Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 95]
cs.CV [Total: 137]
cs.AI [Total: 63]
cs.SD [Total: 10]
cs.LG [Total: 156]
cs.MA [Total: 8]
cs.MM [Total: 3]
eess.AS [Total: 2]
eess.IV [Total: 14]

cs.CL

[1] SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

Main category: cs.CL

TL;DR: SeniorTalk is a Chinese spoken dialogue dataset addressing the scarcity of speech data from individuals aged 75+, featuring 55.53 hours of natural conversations from 202 participants with balanced demographic representation.

Details

Motivation: Current voice technologies have performance gaps for elderly users due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations, with limited data on super-aged individuals.

Method: Created SeniorTalk dataset with 55.53 hours of speech from 101 natural conversations involving 202 participants, strategically balanced across gender, region, and age, with detailed multi-dimensional annotations.

Result: The dataset enables extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, providing crucial insights for developing speech technologies for elderly populations.

Conclusion: SeniorTalk addresses the critical data scarcity for individuals aged 75+ and supports development of more effective voice technologies for aging populations through comprehensive multi-dimensional annotations and balanced demographic representation.

Abstract: While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.

[2] Where did you get that? Towards Summarization Attribution for Analysts

Violet B, John M. Conroy, Sean Lynch, Danielle M, Neil P. Molino, Aaron Wiechmann, Julia S. Yang

Main category: cs.CL

TL;DR: Automatic attribution methods linking summary sentences to source text, using hybrid summarization and custom topology for error analysis.

Details

Motivation: Analysts need attribution to report information sources, requiring automatic methods to link summary sentences to source documents.

Method: Hybrid summarization (automatic paraphrase of extractive summary) and custom topology for identifying attribution error categories.

Result: Proposed approach enables attribution linking and identifies different types of attribution-related errors.

Conclusion: Hybrid summarization facilitates attribution, and custom topology helps analyze attribution error patterns.

Abstract: Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.

[3] GMTRouter: Personalized LLM Router over Multi-turn User Interactions

Encheng Xie, Yihang Sun, Tao Feng, Jiaxuan You

Main category: cs.CL

TL;DR: GMTRouter is a personalized LLM routing system that uses heterogeneous graph learning to capture user preferences from few-shot data, outperforming baselines in accuracy and AUC while adapting to new users without extensive fine-tuning.

Details

Motivation: Existing LLM routing approaches lack full personalization and fail to capture complex user-LLM interactions. User preference data is often scarce, noisy, and inconsistent, limiting methods that rely solely on user-specific data.

Method: Represent multi-turn user-LLM interactions as a heterogeneous graph with user, LLM, query, and response nodes. Use tailored message-passing mechanism to learn user preferences from few-shot data within a lightweight inductive graph learning framework.

Result: GMTRouter consistently outperforms strong baselines with 0.9-21.6% higher accuracy and 0.006-0.309 higher AUC across multiple datasets. It effectively adapts to new users and evolving preferences using only few-shot data.

Conclusion: GMTRouter provides an effective solution for personalized LLM routing by leveraging graph-based learning to capture user preferences from limited data, enabling adaptation to new users without extensive fine-tuning requirements.

Abstract: Large Language Model (LLM) routing has demonstrated strong capability in balancing response quality with computational cost. As users exhibit diverse preferences, personalization has attracted increasing attention in LLM routing, since even identical queries may require different models to generate responses tailored to individual needs. However, existing approaches are not fully personalized and often fail to capture the complex interactions between specific users and LLMs. Moreover, user preference data is typically scarce, noisy, and inconsistent in format, which limits the effectiveness of methods that rely solely on user-specific data. To address these challenges, we propose GMTRouter, which represents multi-turn user-LLM interactions as a heterogeneous graph with four node types: user, LLM, query, and response, thereby preserving the rich relational structure of the interaction. Through a tailored message-passing mechanism, GMTRouter learns to capture user preferences from few-shot data within a lightweight inductive graph learning framework, enabling effective personalization. Extensive experiments demonstrate that GMTRouter consistently outperforms strong baselines, achieving 0.9 to 21.6 percent higher accuracy and 0.006 to 0.309 higher AUC across multiple datasets. More importantly, we demonstrate that GMTRouter can adapt to new users and evolving preferences using only few-shot data, without extensive fine-tuning. The code for GMTRouter is publicly available at https://github.com/ulab-uiuc/GMTRouter.

[4] The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

Azza Bouleimen, Giordano De Marzo, Taehee Kim, Nicol`o Pagan, Hannah Metzler, Silvia Giordano, David Garcia

Main category: cs.CL

TL;DR: LLMs can generate social media conversations that are mistaken for human content 39% of the time, demonstrating their potential for realistic social simulation and misuse risks.

Details

Motivation: To evaluate whether LLMs can convincingly mimic human group conversations on social media, as their validity for simulating online communities remains largely untested.

Method: Collected authentic human conversations from Reddit and generated artificial conversations on the same topics using Llama 3 70B and GPT-4o, then presented them side-by-side to study participants for identification.

Result: LLM-generated conversations were mistaken for human-created content 39% of the time. Participants correctly identified Llama 3 conversations as AI-generated only 56% of the time, barely better than random chance.

Conclusion: LLMs can generate sufficiently realistic social media conversations to deceive humans, highlighting both promising potential for social simulation and warning about potential misuse for generating inauthentic content.

Abstract: Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.

[5] Knowledge Graph Analysis of Legal Understanding and Violations in LLMs

Abha Jha, Abel Salinas, Fred Morstatter

Main category: cs.CL

TL;DR: LLMs show promise for legal analysis but have dangerous vulnerabilities in generating bioweapon instructions despite safeguards. The paper proposes knowledge graph + RAG methodology to evaluate LLM understanding of bioweapons law, revealing safety limitations while suggesting improved frameworks.

Details

Motivation: To address the contradiction that LLMs can interpret laws but also generate unsafe outputs like bioweapon creation steps, despite having safeguards in place.

Method: Knowledge graph construction combined with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs’ legal understanding, intent assessment, and safety vulnerabilities in bioweapons law contexts.

Result: Significant limitations found in LLMs’ reasoning and safety mechanisms regarding bioweapons law, but pathways identified for improvement through enhanced safety protocols and legal reasoning frameworks.

Conclusion: By combining improved safety measures with robust legal reasoning, LLMs can be developed to ethically assist in sensitive legal domains as protectors rather than enablers of law violations.

Abstract: The rise of Large Language Models (LLMs) offers transformative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs’ understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs’ reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains - ensuring they act as protectors of the law rather than inadvertent enablers of its violation.

[6] Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

Azmine Toushik Wasi, Wahid Faisal, Mst Rafia Islam

Main category: cs.CL

TL;DR: Mina is a multilingual LLM-based legal assistant for Bangladesh that achieved 75-80% scores on bar exam components, matching human performance in providing affordable legal assistance.

Details

Motivation: To address barriers to affordable legal advice in Bangladesh including complex legal language, procedural opacity, and high costs, especially given the lack of Bengali-language AI legal assistants.

Method: Uses multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivered via interactive chat interface.

Result: Scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams of Bangladesh Bar Council Exams, matching or surpassing average human performance with demonstrated clarity and sound legal reasoning.

Conclusion: Mina shows potential as a low-cost, multilingual AI assistant that can automate legal tasks and scale access to justice, offering insights for building domain-specific, low-resource systems and sustainable public-service AI deployment.

Abstract: Bangladesh’s low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.

[7] Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

Main category: cs.CL

TL;DR: Soft Preference Learning addresses LLM output diversity loss from alignment algorithms by decoupling KL divergence terms, improving both capabilities and alignment.

Details

Motivation: Alignment algorithms like RLHF and DPO reduce LLM output diversity, causing repetitive text and narrow societal perspectives due to KL divergence regularizer overweighting majority opinions.

Method: Proposes Soft Preference Learning which decouples entropy and cross-entropy terms in KL penalty for fine-grained control over generation diversity.

Result: LLMs trained with Soft Preference Learning achieve higher accuracy on difficult tasks, produce more diverse outputs, represent wider societal viewpoints, and show improved logit calibration.

Conclusion: Soft Preference Learning is a Pareto improvement over standard temperature scaling, effectively balancing alignment objectives while preserving output diversity.

Abstract: The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.

[8] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Jin Li, Yizhou Peng, Longbiao Wang, Jianwu Dang, Nyima Tashi

Main category: cs.CL

TL;DR: POTSA is a novel framework that uses Optimal Transport and parallel speech pairs to improve multilingual speech-to-text translation by aligning semantic representations across languages, especially benefiting low-resource languages.

Details

Motivation: Existing SpeechLLMs overlook semantic commonalities across source languages, leading to biased translation performance and poor performance on low-resource languages.

Method: Uses cross-lingual parallel speech pairs with Optimal Transport: 1) Bias Compensation module for coarse alignment, 2) token-level OT constraints on Q-Former for fine-grained consistency, 3) layer scheduling to focus OT on most beneficial layers.

Result: Achieves SOTA on FLEURS dataset: +0.93 BLEU average on five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.

Conclusion: POTSA effectively bridges high- and low-resource translation gaps by leveraging semantic commonalities through Optimal Transport alignment, demonstrating significant improvements in multilingual speech translation.

Abstract: Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.

[9] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning

Joongho Kim, Xirui Huang, Zarreen Reza, Gabriel Grand, Kevin Zhu, Ryan Lagasse

Main category: cs.CL

TL;DR: SSDP introduces semantic similarity-based dynamic pruning to reduce computational costs in Tree-of-Thought reasoning by clustering and pruning redundant reasoning paths in real time.

Details

Motivation: Tree-of-Thought reasoning improves LLM problem-solving but suffers from high computational costs due to semantic redundancy where different branches explore equivalent reasoning paths.

Method: Semantic Similarity-Based Dynamic Pruning (SSDP) integrates online semantic merging into parallelized tree search, enabling real-time clustering and pruning of redundant reasoning steps.

Result: SSDP achieves up to 2.3x speedup over state-of-the-art baselines while maintaining competitive accuracy (within 5% of strongest baseline) and reduces explored nodes by 85-90% on benchmarks like GSM8K and MATH500.

Conclusion: SSDP provides a practical approach for efficient and scalable LLM reasoning through semantic-based dynamic pruning, significantly reducing computational overhead while preserving reasoning quality.

Abstract: Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at https://github.com/kimjoonghokim/SSDP.

[10] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency Via Adversarial Nudge

Arka Dutta, Sujan Dutta, Rijul Magu, Soumyajit Datta, Munmun De Choudhury, Ashiqur R. KhudaBukhsh

Main category: cs.CL

TL;DR: A framework for stress testing LLM factual fidelity against adversarial nudges shows varying resilience across models, with Claude being most robust and Gemini/DeepSeek least resilient.

Details

Motivation: Hallucinations pose critical challenges for LLM deployment in high-stakes domains, requiring testing of factual fidelity against adversarial manipulation.

Method: Three-step framework: 1) LLM generates truths and lies for closed domains, 2) LLM verifies these assertions, 3) Tests LLM robustness against self-generated lies in movies and novels domains.

Result: Claude showed strong resilience, GPT and Grok moderate resilience, while Gemini and DeepSeek demonstrated weak resilience to adversarial nudges.

Conclusion: Findings raise alarm about LLM susceptibility to adversarial manipulation, especially concerning given increasing public reliance on LLMs for information seeking.

Abstract: Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.

[11] Self-HarmLLM: Can Large Language Model Harm Itself?

Heehwan Kim, Sungjune Park, Daeseon Choi

Main category: cs.CL

TL;DR: Self-HarmLLM: Using a model’s own mitigated harmful queries as new inputs can jailbreak LLM guardrails, with up to 65% transformation and 41% jailbreak success rates.

Details

Motivation: Existing defenses assume external attackers, but a model's own output could become an attack vector. The study explores whether mitigated harmful queries generated by the same model can jailbreak its guardrails.

Method: Proposed Self-HarmLLM scenario using Mitigated Harmful Queries (MHQs) - ambiguous queries preserving original harmful intent. Tested on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions with both automated and human evaluation.

Result: Up to 52% transformation and 33% jailbreak success in Zero-shot; up to 65% transformation and 41% jailbreak success in Few-shot. Automated evaluation consistently overestimated jailbreak success by average 52% compared to human evaluation.

Conclusion: The method proves valid as an attack scenario, indicating need for fundamental reconsideration of guardrail design and more robust evaluation methodology beyond automated assessment alone.

Abstract: Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model’s own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.

[12] OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, Jiawei Zhou

Main category: cs.CL

TL;DR: OKBench is an automated framework for creating dynamic knowledge benchmarks to evaluate LLMs on evolving information, addressing limitations of static benchmarks.

Details

Motivation: Static benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements.

Method: An agentic framework that automates sourcing, creation, validation, and distribution of benchmarks, focusing on news domain with daily knowledge updates.

Result: Evaluation reveals distinct model behaviors with new information and shows retrieval narrows performance gap between small and large models.

Conclusion: Dynamic knowledge benchmarks are crucial for evaluating LLMs, and OKBench democratizes benchmark creation while enabling thorough evaluation of retrieval-augmented methods.

Abstract: Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.

[13] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study

Yilan Liu

Main category: cs.CL

TL;DR: A proof-of-concept system using retrieval-augmented generation (RAG) with curated knowledge bases to automatically generate pediatric speech-language pathology clinical vignettes, addressing limitations of general-purpose LLMs.

Details

Motivation: Manual creation of clinical vignettes for SLP education is time-intensive, and general-purpose LLMs lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision.

Method: Multi-model RAG-based system integrating curated domain knowledge with engineered prompt templates, tested with five commercial and open-source LLMs across seven diverse pediatric disorder scenarios, with automated quality assessment using multi-dimensional rubrics.

Result: Technical feasibility demonstrated for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, enabling privacy-preserving institutional deployment. Generated content aligned with professional guidelines.

Conclusion: The proof-of-concept shows promise for automated SLP case material generation, but requires extensive validation through expert review and testing before educational implementation. Future applications may extend to clinical decision support and IEP goal generation.

Abstract: Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.

[14] Evaluating DisCoCirc in Translation Tasks & its Limitations: A Comparative Study Between Bengali & English

Nazmoon Falgunee Moon

Main category: cs.CL

TL;DR: Extends DisCoCirc formalism to Bengali for English-Bengali translation, finding limitations in handling structural variations between languages despite prior claims of reduced bureaucracy.

Details

Motivation: To develop a DisCoCirc framework for Bengali and test its effectiveness in translation tasks, reassessing claims about reducing language bureaucracy.

Method: Extended DisCoCirc formalism to Bengali language, applied to English-Bengali translation tasks, and analyzed structural variations between languages.

Result: DisCoCirc works well for large parts of language but faces limitations due to structural variations between English and Bengali, struggling even with simple sentences.

Conclusion: The framework has constraints in translation tasks, diverging from prior claims, and suggests scope for future improvements to handle language structural variations.

Abstract: In [4], the authors present the DisCoCirc (Distributed Compositional Circuits) formalism for the English language, a grammar-based framework derived from the production rules that incorporates circuit-like representations in order to give a precise categorical theoretical structure to the language. In this paper, we extend this approach to develop a similar framework for Bengali and apply it to translation tasks between English and Bengali. A central focus of our work lies in reassessing the effectiveness of DisCoCirc in reducing language bureaucracy. Unlike the result suggested in [5], our findings indicate that although it works well for a large part of the language, it still faces limitations due to the structural variation of the two languages. We discuss the possible methods that might handle these shortcomings and show that, in practice, DisCoCirc still struggles even with relatively simple sentences. This divergence from prior claims not only highlights the framework’s constraints in translation but also suggest scope for future improvement. Apart from our primary focus on English-Bengali translation, we also take a short detour to examine English conjunctions, following [1], showing a connection between conjunctions and Boolean logic.

[15] A Super-Learner with Large Language Models for Medical Emergency Advising

Sergey K. Aityan, Abdolreza Mosaddegh, Rolando Herrero, Haitham Tayyar, Jiang Han, Vikram Sawant, Qi Chen, Rishabh Jain, Aruna Senthamaraikannan, Stephen Wood, Manuel Mersini, Rita Lazzaro, Mario Balzaneli, Nicola Iacovazzo, Ciro Gargiulo Isacco

Main category: cs.CL

TL;DR: A super-learner system called MEDAS integrates five major LLMs (Gemini, Llama, Grok, GPT, Claude) using meta-learning to achieve 70% diagnostic accuracy for emergency medicine cases, outperforming individual LLMs (58-65%) and human doctors.

Details

Motivation: To improve medical decision-support in emergency medicine by leveraging the collective capabilities of multiple LLMs through ensemble learning, as individual LLMs show varying diagnostic accuracies that still exceed human performance.

Method: Built MEDAS super-learner system that integrates five major LLMs using a meta-learner approach to learn different capabilities of each LLM and leverage their collective diagnostic knowledge.

Result: Individual LLMs achieved 58-65% diagnostic accuracy, while the MEDAS super-learner achieved 70% accuracy, with at least one individual LLM reaching 85% correct diagnoses in the ensemble.

Conclusion: Meta-learning integration of multiple LLMs significantly improves diagnostic accuracy over individual models, demonstrating that collective knowledge from different medical datasets can be effectively leveraged for emergency medicine diagnostics.

Abstract: Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients’ conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.

[16] Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM

Yibai Liu, Shihang Wang, Zeming Liu, Zheming Song, Junzhe Wang, Jingjing Liu, Qingjie Liu, Yunhong Wang

Main category: cs.CL

TL;DR: GrADS is a self-adaptive gradient-aware data selection method that identifies optimal training subsets using gradient analysis, achieving superior performance with only 5-50% of data while mitigating catastrophic forgetting.

Details

Motivation: SFT for domain specialization is resource-intensive and causes catastrophic forgetting of general capabilities in LLMs, creating a need for more efficient fine-tuning approaches.

Method: Analyzes gradients from preliminary training to design self-guided criteria based on gradient magnitude and statistical distribution, prioritizing examples that maximize learning effectiveness.

Result: With only 5% of selected data, LLMs surpass full-dataset fine-tuning performance; 50% data yields significant improvements while substantially reducing catastrophic forgetting across medical, legal, and financial domains.

Conclusion: GrADS enables efficient domain adaptation of LLMs through intelligent data selection, achieving better performance with less data and mitigating catastrophic forgetting.

Abstract: Despite large language models (LLMs) have achieved impressive achievements across numerous tasks, supervised fine-tuning (SFT) remains essential for adapting these models to specialized domains. However, SFT for domain specialization can be resource-intensive and sometimes leads to a deterioration in performance over general capabilities due to catastrophic forgetting (CF). To address these issues, we propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of LLMs, which identifies effective subsets of training data by analyzing gradients obtained from a preliminary training phase. Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model’s learning process. This approach enables the acquisition of representative samples that enhance LLMs understanding of domain-specific tasks. Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness. Remarkably, utilizing merely 5% of the selected GrADS data, LLMs already surpass the performance of those fine-tuned on the entire dataset, and increasing to 50% of the data results in significant improvements! With catastrophic forgetting substantially mitigated simultaneously. We will release our code for GrADS later.

[17] Detecting Suicidal Ideation in Text with Interpretable Deep Learning: A CNN-BiGRU with Attention Mechanism

Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail

Main category: cs.CL

TL;DR: A hybrid CNN-BiGRU model with SHAP explainability achieves 93.97% accuracy in detecting suicidal ideation from social media data, outperforming existing methods.

Details

Motivation: Suicide is the second leading cause of death for adolescents, and social media platforms contain signals of suicidal intentions that can be detected early.

Method: Combines CNN for local feature extraction and BiGRU for bidirectional sequence modeling, enhanced with attention mechanisms and SHAP for interpretability.

Result: Achieved 93.97% accuracy on a public dataset, outperforming state-of-the-art machine learning and deep learning models in comparative studies.

Conclusion: The hybrid CNN-BiGRU framework with explainable AI provides an effective and reliable solution for suicide detection from social media data.

Abstract: Worldwide, suicide is the second leading cause of death for adolescents with past suicide attempts to be an important predictor for increased future suicides. While some people with suicidal thoughts may try to suppress them, many signal their intentions in social media platforms. To address these issues, we propose a new type of hybrid deep learning scheme, i.e., the combination of a CNN architecture and a BiGRU technique, which can accurately identify the patterns of suicidal ideation from SN datasets. Also, we apply Explainable AI methods using SHapley Additive exPlanations to interpret the prediction results and verifying the model reliability. This integration of CNN local feature extraction, BiGRU bidirectional sequence modeling, attention mechanisms, and SHAP interpretability provides a comprehensive framework for suicide detection. Training and evaluation of the system were performed on a publicly available dataset. Several performance metrics were used for evaluating model performance. Our method was found to have achieved 93.97 accuracy in experimental results. Comparative study to different state-of-the-art Machine Learning and DL models and existing literature demonstrates the superiority of our proposed technique over all the competing methods.

[18] Structured Uncertainty guided Clarification for LLM Agents

Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha

Main category: cs.CL

TL;DR: SAGE-Agent uses structured uncertainty modeling to efficiently clarify ambiguous tool-call parameters, achieving higher task coverage with fewer questions than existing methods.

Details

Motivation: Ambiguous user instructions in LLM agents lead to incorrect tool invocations and task failures, requiring a principled approach to handle uncertainty in tool-call parameters.

Method: Model joint tool-argument clarification as a POMDP with EVPI objective for optimal question selection, using aspect-based cost modeling to prevent redundancy. Also uses uncertainty-weighted GRPO training for reinforcement learning.

Result: Increases coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7× compared to baselines. Boosts When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model).

Conclusion: Structured uncertainty provides a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.

Abstract: LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.

[19] Toward Automated Cognitive Assessment in Parkinson’s Disease Using Pretrained Language Models

Varada Khanna, Nilay Bhatt, Ikgyu Shin, Sule Tinaz, Yang Ren, Hua Xu, Vipina K. Keloth

Main category: cs.CL

TL;DR: NLP models were developed to extract cognitive process categories from Parkinson’s disease patient narratives, with fine-tuned Meta-Llama-3-8B-Instruct achieving the best performance.

Details

Motivation: To understand cognitive experiences in Parkinson's disease patients by extracting insights from unstructured narratives, which is challenging due to subtle and overlapping cognitive constructs.

Method: Compared three NLP model families: Bio_ClinicalBERT for nested entity recognition, fine-tuned Meta-Llama-3-8B-Instruct using QLoRA, and GPT-4o mini in zero- and few-shot settings for extracting seven cognitive categories.

Result: Fine-tuned Meta-Llama-3-8B-Instruct achieved highest overall F1-scores (0.74 micro, 0.59 macro), excelling in context-dependent categories. Bio_ClinicalBERT had high precision but low recall, performing well on some categories but failing on others like thought and emotion.

Conclusion: This NLP task is challenging due to abstract cognitive processes, but refined systems show promise for low-burden cognitive monitoring in Parkinson’s disease as a complement to formal assessments.

Abstract: Understanding how individuals with Parkinson’s disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.

[20] BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

Farah Binta Haque, Md Yasin, Shishir Saha, Md Shoaib Akhter Rafi, Farig Sadeque

Main category: cs.CL

TL;DR: BNLI is a refined Bengali NLI dataset addressing inconsistencies in existing resources, with rigorous annotation and balanced classes, benchmarked on transformer models to improve reliability for Bengali language inference.

Details

Motivation: Existing Bengali NLI datasets have inconsistencies like annotation errors, ambiguous pairs, and limited linguistic diversity, hindering effective model training and evaluation.

Method: Constructed through rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes; benchmarked with transformer-based architectures including multilingual and Bengali-specific models.

Result: Experimental findings show improved reliability and interpretability with BNLI, establishing it as a strong foundation for Bengali and low-resource language inference research.

Conclusion: BNLI provides a reliable, curated dataset that advances Bengali NLI research and serves as a foundation for low-resource language inference tasks.

Abstract: Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.

[21] Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

Yejin Yoon, Yuri Son, Namyoung So, Minseo Kim, Minsoo Cho, Chanhee Park, Seungshin Lee, Taeuk Kim

Main category: cs.CL

TL;DR: TACT dataset enables unified modeling of task-oriented dialogue and chitchat with fluid transitions, enhanced by DPO for improved response quality and transition control.

Details

Motivation: Real-world conversations naturally transition between task-oriented and open-ended modes, but existing systems handle them separately, creating a gap in unified conversational modeling.

Method: Introduces TACT dataset with diverse mode transitions, proposes Switch and Recovery metrics, and applies Direct Preference Optimization (DPO) to enhance transition-aware dialogue modeling.

Result: TACT-trained models outperform baselines, achieving 75.74% joint mode-intent accuracy and 70.1% win rate against GPT-4o in human evaluation.

Conclusion: Combining structurally diverse transition data with DPO enables more proactive and transition-aware conversational agents with improved response quality.

Abstract: Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics. To evaluate an agent’s ability to initiate and recover from mode transitions, we propose two new metrics – Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additional gains, achieving 75.74% joint mode-intent accuracy and a 70.1% win rate against GPT-4o in human evaluation. These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.

[22] BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

Fuyi Yang, Chenchen Ye, Mingyu Derek Ma, Yijia Xiao, Matthew Yang, Wei Wang

Main category: cs.CL

TL;DR: BioVerge is a benchmark and LLM-based agent framework for biomedical hypothesis generation that addresses limitations of traditional methods by using structured and textual data from PubMed, with self-evaluation improving hypothesis novelty and relevance.

Details

Motivation: Current biomedical hypothesis generation methods rely on single data types or predefined patterns, limiting discovery of novel connections. LLM agents show potential but lack standardized datasets and environments for biomedical applications.

Method: BioVerge Agent uses ReAct-based approach with Generation and Evaluation modules that iteratively produce and self-assess hypotheses. Combines structured and textual data from historical biomedical hypotheses and PubMed literature.

Result: Different agent architectures affect exploration diversity and reasoning strategies. Both structured and textual information sources provide unique critical contexts. Self-evaluation significantly improves hypothesis novelty and relevance.

Conclusion: BioVerge provides a standardized environment for biomedical hypothesis generation, demonstrating that LLM agents with self-evaluation and diverse data sources can effectively generate novel and relevant biomedical hypotheses.

Abstract: Hypothesis generation in biomedical research has traditionally centered on uncovering hidden relationships within vast scientific literature, often using methods like Literature-Based Discovery (LBD). Despite progress, current approaches typically depend on single data types or predefined extraction patterns, which restricts the discovery of novel and complex connections. Recent advances in Large Language Model (LLM) agents show significant potential, with capabilities in information retrieval, reasoning, and generation. However, their application to biomedical hypothesis generation has been limited by the absence of standardized datasets and execution environments. To address this, we introduce BioVerge, a comprehensive benchmark, and BioVerge Agent, an LLM-based agent framework, to create a standardized environment for exploring biomedical hypothesis generation at the frontier of existing scientific knowledge. Our dataset includes structured and textual data derived from historical biomedical hypotheses and PubMed literature, organized to support exploration by LLM agents. BioVerge Agent utilizes a ReAct-based approach with distinct Generation and Evaluation modules that iteratively produce and self-assess hypothesis proposals. Through extensive experimentation, we uncover key insights: 1) different architectures of BioVerge Agent influence exploration diversity and reasoning strategies; 2) structured and textual information sources each provide unique, critical contexts that enhance hypothesis generation; and 3) self-evaluation significantly improves the novelty and relevance of proposed hypotheses.

[23] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models

Junichiro Niimi

Main category: cs.CL

TL;DR: LLM citation hallucination depends on citation count as proxy for training data redundancy - highly cited papers show lower hallucination rates due to verbatim memorization beyond ~1,000 citations.

Details

Motivation: Address the problem of LLMs hallucinating non-existent papers in citation recommendation, building on prior studies about how training data frequency affects factual accuracy.

Method: Used GPT-4.1 to generate 100 citations across 20 computer-science domains, manually verified them, and measured factual consistency via cosine similarity between generated and authentic metadata.

Result: Citation count strongly correlated with factual accuracy; bibliographic information becomes almost verbatim memorized beyond ~1,000 citations; memory interference occurs when multiple highly cited papers share similar content.

Conclusion: There’s a threshold (~1,000 citations) where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in LLMs.

Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in citation recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM’s ability to correctly produce bibliographic records depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the pretraining corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record appears in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) citation count is strongly correlated with factual accuracy, (ii) bibliographic information becomes almost verbatim memorized beyond roughly 1,000 citations, and (iii) memory interference occurs when multiple highly cited papers share similar content. These findings indicate a threshold where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in the model.

[24] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Yaxin Zhao, Yu Zhang

Main category: cs.CL

TL;DR: HalluClean is a lightweight, task-agnostic framework that detects and corrects hallucinations in LLM-generated text using a reasoning-enhanced paradigm without external knowledge or supervised training.

Details

Motivation: LLMs often produce hallucinated content that undermines factual reliability, creating a need for methods to improve factual consistency in LLM outputs.

Method: Uses a reasoning-enhanced paradigm with planning, execution, and revision stages to identify and refine unsupported claims. Employs minimal task-routing prompts for zero-shot generalization across domains without external knowledge sources.

Result: Significantly improves factual consistency and outperforms competitive baselines across five tasks: question answering, dialogue, summarization, math word problems, and contradiction detection.

Conclusion: HalluClean demonstrates potential to enhance the trustworthiness of LLM outputs in real-world applications through its lightweight, task-agnostic approach to hallucination detection and correction.

Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.

[25] TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov

Main category: cs.CL

TL;DR: TiDAR is a hybrid architecture that combines diffusion for parallel token drafting and autoregressive sampling in a single forward pass, achieving AR-level quality with 4.71x-5.91x higher throughput.

Details

Motivation: To bridge the gap between diffusion models' parallel generation capability and autoregressive models' superior quality, creating a synergy that delivers both high throughput and AR-level quality.

Method: TiDAR uses structured attention masks to enable parallel token drafting via diffusion and autoregressive sampling in a single forward pass, exploiting GPU compute density while maintaining exact KV cache support.

Result: TiDAR outperforms speculative decoding in throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality, achieving 4.71x-5.91x higher tokens per second than AR models.

Conclusion: TiDAR successfully closes the quality gap with AR models while significantly improving generation throughput, making it the first architecture to effectively balance parallel drafting efficiency with autoregressive-level quality.

Abstract: Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.

[26] EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

Longfei Zuo, Barbara Plank, Siyao Peng

Main category: cs.CL

TL;DR: EVADE framework uses LLMs to generate and validate explanations for detecting errors in NLI datasets, reducing human annotation costs while improving dataset quality.

Details

Motivation: Human label variation in NLI makes it hard to separate annotation errors from plausible variation, and existing two-round human annotation frameworks like VARIERR are costly and limit coverage.

Method: Proposes EVADE framework using LLMs to generate and validate explanations for error detection, comparing human- and LLM-detected errors across distribution comparison, validation overlap, and fine-tuning impact.

Result: LLM validation aligns explanation distributions with human annotations, and removing LLM-detected errors improves fine-tuning performance more than removing human-identified errors.

Conclusion: LLMs can scale error detection effectively, reducing human effort while enhancing dataset quality under label variation scenarios.

Abstract: High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

[27] SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

Shengmin Piao, Sanghyun Park

Main category: cs.CL

TL;DR: SpiralThinker is a unified framework for latent reasoning that performs iterative updates over latent representations, enabling extended implicit reasoning without generating tokens, with progressive alignment maintaining coherence between latent and textual reasoning.

Details

Motivation: Existing latent reasoning methods lack mechanisms for stable evolution of latent representations and systematic interleaving of implicit and explicit reasoning.

Method: Iterative updates over latent representations with progressive alignment objective and structured annotations to maintain coherence between latent and textual reasoning.

Result: Achieves best overall performance among latent reasoning approaches across mathematical, logical, and commonsense reasoning tasks, consistently surpassing previous methods across all benchmarks.

Conclusion: SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space, with both iteration and alignment being indispensable.

Abstract: Recent advances in large reasoning models have been driven by reinforcement learning and test-time scaling, accompanied by growing interest in latent rather than purely textual reasoning. However, existing latent reasoning methods lack mechanisms to ensure stable evolution of latent representations and a systematic way to interleave implicit and explicit reasoning. We introduce SpiralThinker, a unified framework that performs iterative updates over latent representations, enabling extended implicit reasoning without generating additional tokens. A progressive alignment objective combined with structured annotations maintains coherence between latent and textual reasoning. Across mathematical, logical, and commonsense reasoning tasks, SpiralThinker achieves the best overall performance among latent reasoning approaches, consistently surpassing previous methods across all benchmarks. Detailed analyses reveal that both iteration and alignment are indispensable, the numbers of latent tokens and iterations exhibit dataset-specific optima, and appropriate alignment proves critical for an effective iterative process. Overall, SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space.

[28] Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

Zhouxing Tan, Ruochong Xiong, Yulong Wan, Jinlong Ma, Hanlin Xue, Qichun Deng, Haifeng Jing, Zhengtong Zhang, Depei Liu, Shiyuan Luo, Junfei Liu

Main category: cs.CL

TL;DR: The paper proposes a trajectory-based evaluation framework for assessing LLMs’ emotional support capabilities, moving beyond static dialogues to track emotional dynamics over time using psychologically-grounded strategies and novel metrics.

Details

Motivation: Existing LLM evaluations for emotional support rely on short, static dialogues and fail to capture the dynamic, long-term nature of emotional support interactions, necessitating a more realistic assessment approach.

Method: Developed a large-scale benchmark with 328 emotional contexts and 1,152 disturbance events, constrained model outputs using validated emotion regulation strategies, modeled user emotional trajectories as Markov processes, and introduced three trajectory-level metrics (BEL, ETV, ECP).

Result: Extensive evaluations across diverse LLMs revealed significant disparities in emotional support capabilities, providing actionable insights for model development.

Conclusion: The trajectory-based framework enables comprehensive assessment of long-term emotional support performance in LLMs, addressing limitations of snapshot-based evaluations and supporting improved model development.

Abstract: Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.

[29] A Neurosymbolic Approach to Natural Language Formalization and Verification

Sam Bayless, Stefano Buliani, Darion Cassel, Byron Cook, Duncan Clough, Rémi Delmas, Nafi Diallo, Ferhat Erata, Nick Feng, Dimitra Giannakopoulou, Aman Goel, Aditya Gokhale, Joe Hendrix, Marc Hudak, Dejan Jovanović, Andrew M. Kent, Benjamin Kiesl-Reiter, Jeffrey J. Kuna, Nadia Labai, Joseph Lilien, Divya Raghunathan, Zvonimir Rakamarić, Niloofar Razavi, Michael Tautschnig, Ali Torkamani, Nathaniel Weir, Michael W. Whalen, Jianan Yao

Main category: cs.CL

TL;DR: A neurosymbolic framework combining LLMs with logical formalization to ensure policy compliance in regulated industries, achieving over 99% soundness in logical validation.

Details

Motivation: Address the stochastic nature of LLMs that limits their adoption in regulated industries like finance and healthcare where strict policy compliance is required.

Method: Two-stage framework: (1) formalize natural language policies using LLMs with optional human guidance, (2) inference-time autoformalization to validate logical correctness with redundant formalization steps for cross-checking semantic equivalence.

Result: Achieves over 99% soundness with near-zero false positive rate in identifying logical validity, producing auditable logical artifacts.

Conclusion: The approach enables reliable policy compliance verification in regulated domains while maintaining auditability and providing artifacts for improving original text.

Abstract: Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.

[30] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

Gailun Zeng, Ziyang Luo, Hongzhan Lin, Yuchen Tian, Kaixin Li, Ziyang Gong, Jianxiong Guo, Jing Ma

Main category: cs.CL

TL;DR: MM-CRITIC is a comprehensive benchmark for evaluating multimodal critique capabilities of Large Multimodal Models across basic, correction, and comparison dimensions, covering 8 task types and over 500 tasks.

Details

Motivation: Multimodal critique ability is crucial for LMMs to self-improve and become reliable AI assistants, but this capability remains underexplored compared to language-only settings.

Method: Created MM-CRITIC benchmark with 4471 samples, integrated expert-informed ground answers into scoring rubrics, and used GPT-4o to annotate responses and generate reference critiques for reliable evaluation.

Result: Extensive experiments validated MM-CRITIC’s effectiveness and provided comprehensive assessment of leading LMMs’ critique capabilities, revealing correlations between response quality and critique, and varying difficulty across dimensions.

Conclusion: MM-CRITIC serves as a reliable benchmark for evaluating multimodal critique abilities, offering key insights into LMMs’ performance and enabling future improvements in multimodal AI assistants.

Abstract: The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs’ critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.

[31] Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

Chao Wang, Yuqing Cai, Renzeng Duojie, Jin Zhang, Yutong Liu, Nyima Tashi

Main category: cs.CL

TL;DR: Streaming speech recognition framework for Amdo Tibetan using hybrid CTC/Attention with dynamic chunking and external language model, achieving 48.15% relative WER improvement.

Details

Motivation: To address context truncation problems in fixed-chunk methods and adapt to varying speaking rates in Amdo Tibetan speech recognition.

Method: Hybrid CTC/Attention architecture with context-aware dynamic chunking mechanism, linguistically motivated lexicon based on Tibetan orthography, and external language model integration during decoding.

Result: Achieved 6.23% WER on test set, 48.15% relative improvement over fixed-chunk baseline, with reduced latency and performance close to global decoding.

Conclusion: The proposed framework effectively handles Tibetan linguistic characteristics and streaming requirements, demonstrating significant performance gains over traditional methods.

Abstract: In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.

[32] Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng

Main category: cs.CL

TL;DR: Bi-RAR introduces a bidirectional retrieval-augmented reasoning framework that evaluates intermediate reasoning steps in both forward and backward directions using information distance, improving multi-step reasoning in RAG systems.

Details

Motivation: Current RAG approaches struggle with complex multi-step reasoning and suffer from reward hacking due to outcome-based supervision that lacks explicit guidance for intermediate steps.

Method: Proposes Bi-RAR framework with bidirectional information distance based on Kolmogorov complexity, approximated via language model probabilities, and uses multi-objective reinforcement learning with cascading rewards for optimization.

Result: Empirical evaluation on seven QA benchmarks shows Bi-RAR outperforms previous methods and enables efficient interaction with search engines during training and inference.

Conclusion: Bi-RAR effectively addresses limitations in multi-step reasoning for RAG systems through bidirectional evaluation and reinforcement learning, demonstrating superior performance across diverse question answering tasks.

Abstract: Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios.Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

[33] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

Ritsu Sakabe, Hwichan Kim, Tosho Hirasawa, Mamoru Komachi

Main category: cs.CL

TL;DR: This paper evaluates LLMs’ humor capabilities using Japanese Oogiri comedy games, finding they perform at low-to-mid human level but lack empathy, which explains their inability to replicate human humor assessment.

Details

Motivation: Previous humor evaluations of LLMs have been single-dimensional (just 'funny/not funny'), but a multifaceted understanding is needed for sophisticated dialogue systems.

Method: Expanded Oogiri datasets with LLM-generated responses, manually annotated with 5-point ratings across six dimensions (Novelty, Clarity, Relevance, Intelligence, Empathy, Overall Funniness), then evaluated LLMs on generation and evaluation tasks.

Result: LLMs generate responses at low-to-mid human performance level but show notable lack of Empathy. LLMs prioritize Novelty in evaluations while humans prioritize Empathy, explaining their failure to replicate human humor assessment.

Conclusion: Empathy is crucial for humor understanding, and LLMs’ current focus on novelty over empathy limits their humor capabilities. The annotated corpus is released to support development of more emotionally intelligent conversational agents.

Abstract: Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.’’ This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.

[34] One-Topic-Doesn’t-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning

Jieun Han, Daniel Lee, Haneul Yoo, Jinsung Yoon, Junyeong Park, Suin Kim, So-Yeon Ahn, Alice Oh

Main category: cs.CL

TL;DR: A personalized English reading comprehension test generation system using GPT-4o that creates interest-aligned passages and questions from RACE-C dataset, showing improved comprehension and motivation in EFL learners.

Details

Motivation: To address the need for personalized learning in EFL education by creating reading materials that align with individual students' interests to enhance engagement and motivation in reading comprehension.

Method: Developed a structured content transcreation pipeline using GPT-4o, starting from RACE-C dataset, with topic extraction, Bloom’s taxonomy question classification, linguistic feature analysis, and content transcreation to generate personalized passages and questions.

Result: Controlled experiment with South Korean EFL learners showed students using personalized reading passages demonstrated improved comprehension and better motivation retention compared to non-personalized materials.

Conclusion: Personalized reading comprehension tests tailored to students’ interests effectively enhance both comprehension outcomes and motivation in EFL learning environments.

Abstract: Personalized learning has gained attention in English as a Foreign Language (EFL) education, where engagement and motivation play crucial roles in reading comprehension. We propose a novel approach to generating personalized English reading comprehension tests tailored to students’ interests. We develop a structured content transcreation pipeline using OpenAI’s gpt-4o, where we start with the RACE-C dataset, and generate new passages and multiple-choice reading comprehension questions that are linguistically similar to the original passages but semantically aligned with individual learners’ interests. Our methodology integrates topic extraction, question classification based on Bloom’s taxonomy, linguistic feature analysis, and content transcreation to enhance student engagement. We conduct a controlled experiment with EFL learners in South Korea to examine the impact of interest-aligned reading materials on comprehension and motivation. Our results show students learning with personalized reading passages demonstrate improved comprehension and motivation retention compared to those learning with non-personalized materials.

[35] DoPE: Denoising Rotary Position Embedding

Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong

Main category: cs.CL

TL;DR: DoPE is a training-free method that improves length extrapolation in Transformers by detecting outlier frequency bands in positional encoding feature maps and reparameterizing them with Gaussian distributions, effectively mitigating attention sinks.

Details

Motivation: Rotary Position Embedding (RoPE) has inherent limits that weaken length extrapolation, causing attention sink phenomena that disrupt balanced attention patterns in extended contexts.

Method: Reinterpret attention maps as noisy feature maps, use truncated matrix entropy to detect outlier frequency bands, and reparameterize with parameter-free Gaussian distributions for robust extrapolation.

Result: Significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens) in needle-in-a-haystack and many-shot in-context learning tasks.

Conclusion: The denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization.

Abstract: Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io

[36] LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, Lei Zhang, Yong Yu

Main category: cs.CL

TL;DR: LoopTool is an automated framework that integrates data synthesis and model training in a closed loop to improve LLM tool learning by diagnosing weaknesses, correcting noisy labels, and generating challenging samples.

Details

Motivation: Current tool learning approaches use static synthetic data pipelines where data generation and model training are separate processes, failing to adapt to model weaknesses and allowing noisy labels to degrade training efficiency.

Method: LoopTool uses three modules: Greedy Capability Probing to diagnose model capabilities, Judgement-Guided Label Verification to correct annotation errors, and Error-Driven Data Expansion to generate challenging samples based on failures.

Result: An 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves state-of-the-art results on BFCL-v3 and ACEBench benchmarks for its scale.

Conclusion: Closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs within a cost-effective open-source ecosystem.

Abstract: Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model’s specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model’s mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.

[37] A Hybrid Search for Complex Table Question Answering in Securities Report

Daiki Shirafuji, Koji Tanaka, Tatsuhiko Saito

Main category: cs.CL

TL;DR: Proposes a cell extraction method for Table Question Answering that automatically identifies table headers using hybrid retrieval and selects answers from cell intersections, outperforming GPT-4o mini.

Details

Motivation: LLMs struggle with complex table structures in TQA, often providing incorrect answers when processing entire tables as text due to inability to capture structural information.

Method: Hybrid retrieval combining language model and TF-IDF to compute question-cell similarities, estimate table headers, and select answer cells at row-column intersections. Uses contrastive learning on question-header pairs.

Result: Achieved 74.6% accuracy on TQA dataset from NTCIR-18 U4 shared task, outperforming GPT-4o mini (63.9%).

Conclusion: The proposed pipeline effectively handles complex table structures without manual header identification, showing significant improvement over existing LLMs. Future work will incorporate more efficient text-search models.

Abstract: Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6%, outperforming existing LLMs such as GPT-4o mini~(63.9%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.

[38] Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays

Amal Sunny, Advay Gupta, Vishnu Sreekumar

Main category: cs.CL

TL;DR: Context-based sequentiality (using only contextual terms) better aligns with human assessments of essay organization and cohesion than the original topic-context combined approach, and enhances automated essay scoring when combined with standard linguistic features.

Details

Motivation: To empirically validate the proposal that using only contextual terms in sequentiality measurement is more conceptually valid and interpretable than the original topic-context combined approach, and address concerns about topic selection confounding and lack of validation against ground-truth flow measures.

Method: Used two essay datasets (ASAP++ and ELLIPSE) with human-annotated trait scores to compare different sequentiality formulations. Evaluated contextual-only sequentiality against topic-only and original combined versions, and tested combinations with standard linguistic features and zero-shot LLM predictions.

Result: Contextual sequentiality aligns more closely with human assessments of Organization and Cohesion. When combined with standard linguistic features, it adds more predictive value than topic-only or original sequentiality, and outperforms zero-shot LLM predictions.

Conclusion: Context-based sequentiality is a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks, supporting explicit modeling of sentence-to-sentence flow.

Abstract: Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.

[39] The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Francois Meyer, Jan Buys

Main category: cs.CL

TL;DR: This paper studies how subword segmentation evolves during neural language model training, showing that learnable subwords go through four distinct learning stages and become finer-grained during finetuning, with benefits for morphologically complex languages.

Details

Motivation: To understand the learning dynamics of subword segmentation when it's optimized during training rather than fixed in preprocessing, and explore how subwords evolve across different morphological language types.

Method: Extended the subword segmental language model (SSLM) framework to support pretraining and finetuning, trained models for three typologically diverse languages (isi-Xhosa, Setswana, English), and analyzed subword dynamics using linguistic metrics like morphology, productivity, and fertility.

Result: Identified four stages of subword learning with isi-Xhosa showing greater instability, observed subword boundaries shifting to finer granularity during finetuning, and demonstrated improved text generation and cross-lingual transfer for low-resource morphologically complex languages.

Conclusion: Learnable subwords offer a promising approach for handling morphologically complex languages, with dynamic optimization during training leading to better performance than fixed preprocessing approaches.

Abstract: Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

[40] Pretraining Finnish ModernBERTs

Akseli Reunamo, Laura-Maria Peltonen, Hans Moen, Sampo Pyysalo

Main category: cs.CL

TL;DR: ModernBERT encoder models in 6 sizes (51M-475M params) pretrained with limited multilingual focus on Finnish languages, competitive with existing multilingual models and outperforming monolingual models on long-context tasks.

Details

Motivation: To develop multilingual encoder models specifically focused on languages relevant to Finland, addressing the need for efficient models that perform well on long-context tasks.

Method: Pretrained ModernBERT encoder models in six different parameter sizes (51M to 475M) with limited multilingual emphasis on Finnish languages, using different data in the final training stage.

Result: Models are competitive with or superior to existing multilingual models, and outperform monolingual models on tasks requiring context longer than 512 tokens.

Conclusion: The developed ModernBERT models successfully address the need for efficient multilingual models for Finnish languages, with strong performance on long-context tasks, and are publicly released.

Abstract: This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.

[41] Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Main category: cs.CL

TL;DR: RLVR methods struggle with early training collapse in honesty alignment tasks. The paper proposes Anchor, a reinforcement learning method that injects ground truth trajectories to stabilize learning and improve deductive reasoning performance.

Details

Motivation: Existing RLVR methods optimize only for final outcomes, making models vulnerable to collapse when negative rewards dominate early training, especially in honesty alignment where models must identify unanswerable queries.

Method: Created two deductive reasoning datasets from graph structures (linear algebra and logical inference) with unanswerable cases, then proposed Anchor - an RL method that injects ground truth trajectories into rollouts to prevent early training collapse.

Result: Anchor stabilizes learning and significantly improves overall reasoning performance compared to GRPO and other methods, demonstrating the importance of training dynamics for reliable deductive reasoning.

Conclusion: Training dynamics are crucial for honesty alignment in language models, and the proposed Anchor method effectively addresses early training collapse to enable reliable deductive reasoning.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.

[42] C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

Yu Li, Zhe Yang, Yi Huang, Xin Liu, Guilin Qi

Main category: cs.CL

TL;DR: C³TG is a two-phase framework for fine-grained multi-attribute text control that uses attribute classifiers and iterative optimization to resolve conflicts without model modifications.

Details

Motivation: Existing methods struggle with precise multi-attribute control, lack coordination for conflicting attributes, and don't incorporate iterative optimization in controlled text generation.

Method: Two-phase framework: generation phase pairs LLM with attribute classifiers using weighted KL-divergence; optimization phase uses energy function with classifier scores and penalty terms for iterative conflict resolution.

Result: Significantly outperforms baselines across attribute accuracy, linguistic fluency, output diversity metrics while reducing toxicity.

Conclusion: C³TG provides effective and flexible multi-dimensional text attribute control without costly model modifications.

Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.

[43] LiteraryTaste: A Preference Dataset for Creative Writing Personalization

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yi Wang, Yuqian Sun, Tiffany Wang, Shm Garanganao Almeda, Brett A. Halperin, Yuwen Lu, Max Kreminski

Main category: cs.CL

TL;DR: The paper introduces LiteraryTaste, a dataset for personalizing creative writing LLMs by capturing individual reading preferences through both self-reported habits and annotated text pair preferences.

Details

Motivation: People have diverse creative writing preferences, but current LLMs treat these preferences as monolithic. There's a need to develop personalized creative writing models that adapt to individual user tastes.

Method: Created LiteraryTaste dataset with 60 participants providing both stated preferences (self-reported reading habits) and revealed preferences (annotations on 100 pairs of short creative writing texts). Used finetuned transformer encoder to model preferences and LLM-driven interpretability pipeline to analyze preference variations.

Result: Found significant divergence in creative writing preferences among individuals. Transformer encoder achieved 75.8% accuracy for personal revealed preferences and 67.7% for collective preferences. Stated preferences had limited utility in predicting revealed preferences.

Conclusion: The work establishes a foundation for personalizing creative writing technologies by demonstrating the importance of modeling individual revealed preferences rather than relying solely on stated preferences or treating preferences as monolithic.

Abstract: People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user’s preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people’s preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.

[44] Towards Explainable Khmer Polarity Classification

Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing

Main category: cs.CL

TL;DR: This paper proposes an explainable Khmer polarity classifier by fine-tuning Qwen-3 model to provide self-explanations for its predictions, along with a new Khmer polarity dataset.

Details

Motivation: Existing Khmer models predict polarity labels without explaining the rationale behind predictions, lacking explainability in Khmer NLP tasks.

Method: Fine-tuned an instruction-based reasoning Qwen-3 model and created a new Khmer polarity dataset using heuristic rules and human curation.

Result: The fine-tuned model accurately predicts labels and provides reasoning by identifying polarity-related keywords/phrases to support predictions.

Conclusion: The approach successfully creates an explainable Khmer polarity classifier and contributes a publicly available Khmer polarity dataset for future research.

Abstract: Khmer polarity classification is a fundamental natural language processing task that assigns a positive, negative, or neutral label to a given Khmer text input. Existing Khmer models typically predict the label without explaining the rationale behind the prediction. This paper proposes an explainable Khmer polarity classifier by fine-tuning an instruction-based reasoning Qwen-3 model. The notion of explainability in this paper is limited to self-explanations, which the model uses to rationalize its predictions. Experimental results show that the fine-tuned model not only predicts labels accurately but also provides reasoning by identifying polarity-related keywords or phrases to support its predictions. In addition, we contribute a new Khmer polarity dataset consisting of short- to medium-length casual, romanized, and mixed-code Khmer expressions. This dataset was constructed using both heuristic rules and human curation and is publicly available through a gated Hugging Face repository (rinabuoy/khmerpolarity_nonreasoning). The fine-tuned Qwen-3 models are also made available in the same Hugging Face account.

[45] AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness

Zhuoqun Huang, Neil G. Marchant, Olga Ohrimenko, Benjamin I. P. Rubinstein

Main category: cs.CL

TL;DR: Introduces AdaptDel methods with adaptable deletion rates for certified robustness against edit distance perturbations in sequence classification, achieving significant improvements over state-of-the-art methods.

Details

Motivation: Current methods use fixed-rate deletion mechanisms which are suboptimal for naturally occurring inputs of varying lengths, such as sentences in NLP tasks.

Method: Extends randomized smoothing framework to variable-rate deletion with AdaptDel methods that dynamically adjust deletion rates based on input properties.

Result: Achieves up to 30 orders of magnitude improvement to median cardinality of the certified region over state-of-the-art certifications in natural language tasks.

Conclusion: AdaptDel methods provide effective certified robustness for sequence classification against edit distance perturbations by adapting deletion rates to input characteristics.

Abstract: We consider the problem of certified robustness for sequence classification against edit distance perturbations. Naturally occurring inputs of varying lengths (e.g., sentences in natural language processing tasks) present a challenge to current methods that employ fixed-rate deletion mechanisms and lead to suboptimal performance. To this end, we introduce AdaptDel methods with adaptable deletion rates that dynamically adjust based on input properties. We extend the theoretical framework of randomized smoothing to variable-rate deletion, ensuring sound certification with respect to edit distance. We achieve strong empirical results in natural language tasks, observing up to 30 orders of magnitude improvement to median cardinality of the certified region, over state-of-the-art certifications.

[46] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Arka Mukherjee, Shreya Ghosh

Main category: cs.CL

TL;DR: mmJEE-Eval is a multimodal bilingual benchmark using JEE Advanced exam questions that reveals significant gaps between frontier VLMs (77-84% accuracy) and open-source models (37-45%) on scientific reasoning tasks, exposing limitations in current reasoning capabilities.

Details

Motivation: Existing multimodal reasoning benchmarks fail to distinguish true scientific reasoning from pattern-matching, creating a need for more challenging evaluation that tests genuine problem-solving abilities.

Method: Created mmJEE-Eval benchmark with 1,460 bilingual (English/Hindi) questions from JEE Advanced exams (2019-2025) spanning Physics, Chemistry, and Mathematics, then evaluated 17 state-of-the-art models including frontier VLMs and open-source alternatives.

Result: Frontier VLMs (GPT-5, Gemini 2.5) achieve 77-84% accuracy on 2025 questions, while open-source models plateau at 37-45% despite scaling to 400B parameters. Closed models show high accuracy but collapse on meta-cognitive reasoning tasks (GPT-5 fixes only 5.2% errors).

Conclusion: mmJEE-Eval effectively segregates superior training and reasoning methodologies, showing current benchmarks underestimate reasoning gaps and highlighting the need for more challenging evaluation standards in multimodal AI.

Abstract: Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India’s JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval’s difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io

[47] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: SeerSC is a dynamic self-consistency framework that improves token efficiency and reduces latency in LLM inference by integrating System 1 and System 2 reasoning, achieving up to 47% token reduction and 43% latency reduction.

Details

Motivation: Test-time scaling improves LLM performance but incurs high computational costs and latency. Existing dynamic self-consistency methods are limited by sequential request latency.

Method: Integrates System 1 (fast reasoning) and System 2 (deliberate reasoning). Uses System 1 to compute answer entropy for queries, which evaluates sample scaling potential for dynamic self-consistency under System 2. Enables parallel generation to reduce latency.

Result: Achieves up to 47% reduction in token consumption and 43% reduction in inference latency without significant performance loss. Outperforms existing methods.

Conclusion: SeerSC effectively addresses computational cost and latency issues in test-time scaling through integrated System 1/System 2 reasoning, providing substantial efficiency improvements while maintaining performance.

Abstract: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.

[48] Spider4SSC & S2CLite: A text-to-multi-query-language dataset using lightweight ontology-agnostic SPARQL to Cypher parser

Martin Vejvar, Yasutaka Fujimoto

Main category: cs.CL

TL;DR: S2CLite is a rule-based SPARQL-to-Cypher parser that achieves 77.8% parsing accuracy, outperforming state-of-the-art methods, and generates the Spider4SSC dataset with unified text-to-query capabilities.

Details

Motivation: To enable efficient SPARQL to Cypher translation without requiring RDF graphs or external tools, addressing limitations of existing solutions that have high parsing error rates.

Method: Developed a lightweight, ontology-agnostic, purely rule-based parser inspired by traditional programming language compilers that translates SPARQL queries into Cypher queries.

Result: Achieved 77.8% parsing accuracy on Spider4SPARQL (vs 44.2% for S2CTrans) and 96.6% execution accuracy on intersecting queries, outperforming S2CTrans by 7.3%. Generated Spider4SSC dataset with 4525 questions and 2581 equivalent queries in SQL, SPARQL, and Cypher.

Conclusion: S2CLite provides a robust solution for SPARQL-to-Cypher translation with significantly improved accuracy, and the Spider4SSC dataset enables unified text-to-query research across multiple query languages.

Abstract: We present Spider4SSC dataset and S2CLite parsing tool. S2CLite is a lightweight, ontology-agnostic parser that translates SPARQL queries into Cypher queries, enabling both in-situ and large-scale SPARQL to Cypher translation. Unlike existing solutions, S2CLite is purely rule-based (inspired by traditional programming language compilers) and operates without requiring an RDF graph or external tools. Experiments conducted on the BSBM42 and Spider4SPARQL datasets show that S2CLite significantly reduces query parsing errors, achieving a total parsing accuracy of 77.8% on Spider4SPARQL compared to 44.2% by the state-of-the-art S2CTrans. Furthermore, S2CLite achieved a 96.6% execution accuracy on the intersecting subset of queries parsed by both parsers, outperforming S2CTrans by 7.3%. We further use S2CLite to parse Spider4SPARQL queries to Cypher and generate Spider4SSC, a unified Text-to-Query language (SQL, SPARQL, Cypher) dataset with 4525 unique questions and 3 equivalent sets of 2581 matching queries (SQL, SPARQL and Cypher). We open-source S2CLite for further development on GitHub (github.com/vejvarm/S2CLite) and provide the clean Spider4SSC dataset for download.

[49] MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

Rhitabrat Pokharel, Ameeta Agrawal

Main category: cs.CL

TL;DR: MTQ-Eval is a multilingual text quality evaluation framework that trains LLMs on automatically generated quality preference data, showing improved performance across 115 languages and downstream tasks.

Details

Motivation: To determine if LLMs can effectively evaluate general text quality in multilingual contexts beyond task-specific evaluations.

Method: Automatically generate text quality preference data, then train open-source base LLMs to align with ratings of high- and low-quality text by adjusting internal representations.

Result: Comprehensive evaluation across 115 languages demonstrates improved performance, with enhanced evaluation capability leading to notable improvements in downstream tasks.

Conclusion: The proposed MTQ-Eval framework successfully extends LLM evaluation capabilities to general text quality assessment in multilingual contexts.

Abstract: The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.

[50] Self-Correcting Large Language Models: Generation vs. Multiple Choice

Hossein A. Rahmani, Satyapriya Krishna, Xi Wang, Mohammadmehdi Naghiaei, Emine Yilmaz

Main category: cs.CL

TL;DR: LLMs show different self-correction behaviors in open-ended generation vs. multiple-choice tasks, with implications for agentic applications.

Details

Motivation: To systematically investigate how self-correction mechanisms differ between open-ended text generation and multiple-choice selection paradigms in LLMs.

Method: Conducted systematic comparison across various natural language understanding and reasoning tasks using different scale and family language models, analyzing performance trends and error-correction behaviors.

Result: Open-ended generation benefits from flexibility and compositional refinement, while multiple-choice selection leverages clearer boundaries but is limited by provided options. Distinct improvement patterns and failure modes observed.

Conclusion: Self-correction mechanism design should consider task structure and output space interaction, with implications for both knowledge-intensive reasoning and decision-oriented LLM applications.

Abstract: Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.

[51] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Ruibo Deng, Duanyu Feng, Wenqiang Lei

Main category: cs.CL

TL;DR: AMaPO addresses the Overfitting-Underfitting Dilemma in offline preference optimization by using adaptive margins to dynamically reallocate learning effort between correctly and incorrectly ranked samples.

Details

Motivation: Current offline preference optimization methods suffer from a fundamental Overfitting-Underfitting Dilemma where models waste gradients on correctly ranked samples while providing insufficient correction for misranked ones, limiting ranking accuracy.

Method: Proposes Adaptive Margin-attached Preference Optimization (AMaPO) with instance-wise adaptive margins refined through Z-normalization and exponential scaling to amplify gradients for misranked samples and suppress them for correct ones.

Result: Extensive experiments show AMaPO achieves better ranking accuracy and superior downstream alignment performance while successfully mitigating overfitting and underfitting issues.

Conclusion: AMaPO provides a principled solution to the Overfitting-Underfitting Dilemma in offline preference optimization, offering improved performance through dynamic gradient reallocation.

Abstract: Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.

[52] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune

Main category: cs.CL

TL;DR: Developing a strong Multimodal Large Language Model for Basque, a low-resource language, using custom datasets and showing that only 20% Basque data and non-Basque backbone LLMs can achieve solid performance.

Details

Motivation: Current MLLMs perform well in high-resource languages but lack comparable performance in low-resource languages like Basque within the open science community.

Method: Created custom training/evaluation datasets, used Llama-3.1-Instruct and Basque-adapted Latxa as backbones, experimented with different data mixtures for training.

Result: Only 20% Basque multimodal data achieves solid results; Basque-instructed backbone LLM is not necessary for strong Basque MLLM performance.

Conclusion: Provides pathway for developing MLLMs for other low-resource languages by openly releasing resources and demonstrating efficient training approaches.

Abstract: Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

[53] CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

Bichen Wang, Yixin Sun, Junzhe Wang, Hao Yang, Xing Fu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin

Main category: cs.CL

TL;DR: CARE-Bench is a dynamic, interactive benchmark for evaluating LLMs’ psychological counseling competence using diverse client profiles from real cases and multidimensional evaluation based on psychological scales.

Details

Motivation: Address the gap between growing demand for counseling services and limited availability by creating a robust benchmark to assess LLMs' counseling abilities, overcoming limitations of existing unprofessional client simulation and static evaluation formats.

Method: Built diverse client profiles from real-world counseling cases following expert guidelines, created dynamic interactive evaluation format, and established multidimensional performance evaluation using established psychological scales.

Result: Evaluation of general-purpose LLMs and specialized counseling models revealed current limitations, with detailed failure analysis conducted in collaboration with psychologists for different client types.

Conclusion: Provides directions for developing more comprehensive, universal, and effective counseling models by identifying specific failure reasons across different client types.

Abstract: The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model’s comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs’ failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

[54] GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning

Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze

Main category: cs.CL

TL;DR: GSAP-ERE is a manually curated fine-grained dataset for extracting information from ML publications, containing 63K entities and 35K relations from 100 papers, enabling improved knowledge graph construction and reproducibility assessment.

Details

Motivation: To extract and connect fine-grained information in ML research (e.g., method training and data usage) to improve understanding and reproducibility of ML-related research at scale.

Method: Created a manually curated dataset with 10 entity types and 18 relation types from full texts of 100 ML publications, then used it to fine-tune models and test LLM prompting strategies.

Result: Fine-tuned models significantly outperformed LLM prompting (NER: 80.6% vs 44.4%, RE: 54.0% vs 10.1%). The dataset enables effective information extraction for KG construction and reproducibility monitoring.

Conclusion: Supervised models using curated datasets like GSAP-ERE substantially outperform LLM prompting for scholarly IE, highlighting the need for such datasets to advance research in this domain.

Abstract: Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.

[55] BIG5-TPoT: Predicting BIG Five Personality Traits, Facets, and Items Through Targeted Preselection of Texts

Triet M. Le, Arjun Chandra, C. Anton Rytting, Valerie P. Karuzis, Vladimir Rife, William A. Simpson

Main category: cs.CL

TL;DR: TPoT method filters texts semantically before feeding to deep learning model for Big Five personality prediction, improving accuracy and handling large text volumes.

Details

Motivation: Predicting personalities from large text volumes is challenging due to input limits in language models and need for better prediction accuracy.

Method: Targeted preselection of texts (TPoT) - semantically filter texts relevant to specific Big Five traits/facets/items before input to BIG5-TPoT deep learning model.

Result: Improved Mean Absolute Error and accuracy metrics for personality prediction on Stream of Consciousness Essays dataset.

Conclusion: TPoT is an effective strategy that addresses input text limits while enhancing personality prediction performance through semantic filtering.

Abstract: Predicting an individual’s personalities from their generated texts is a challenging task, especially when the text volume is large. In this paper, we introduce a straightforward yet effective novel strategy called targeted preselection of texts (TPoT). This method semantically filters the texts as input to a deep learning model, specifically designed to predict a Big Five personality trait, facet, or item, referred to as the BIG5-TPoT model. By selecting texts that are semantically relevant to a particular trait, facet, or item, this strategy not only addresses the issue of input text limits in large language models but also improves the Mean Absolute Error and accuracy metrics in predictions for the Stream of Consciousness Essays dataset.

[56] Readability Measures and Automatic Text Simplification: In the Search of a Construct

Rémi Cardon, A. Seza Doğruöz

Main category: cs.CL

TL;DR: Readability measures don’t correlate well with ATS evaluation metrics or human judgment, highlighting the need for clearer definition of simplification constructs.

Details

Motivation: To investigate the relationship between readability measures and automatic text simplification evaluation, as previous studies focused on ATS metrics vs human judgment but not readability measures.

Method: Conducted correlation studies between readability measures and human judgment, and between readability measures and ATS evaluation metrics on English texts.

Result: Readability measures generally do not correlate well with automatic metrics and human judgment in ATS.

Conclusion: Since the three assessment angles (readability measures, ATS metrics, human judgment) show low correlations, there’s a need for clearer definition of the construct in automatic text simplification.

Abstract: Readability is a key concept in the current era of abundant written information. To help making texts more readable and make information more accessible to everyone, a line of researched aims at making texts accessible for their target audience: automatic text simplification (ATS). Lately, there have been studies on the correlations between automatic evaluation metrics in ATS and human judgment. However, the correlations between those two aspects and commonly available readability measures (such as readability formulas or linguistic features) have not been the focus of as much attention. In this work, we investigate the place of readability measures in ATS by complementing the existing studies on evaluation metrics and human judgment, on English. We first discuss the relationship between ATS and research in readability, then we report a study on correlations between readability measures and human judgment, and between readability measures and ATS evaluation metrics. We identify that in general, readability measures do not correlate well with automatic metrics and human judgment. We argue that as the three different angles from which simplification can be assessed tend to exhibit rather low correlations with one another, there is a need for a clear definition of the construct in ATS.

[57] SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

Mohamed Elaraby, Jyoti Prakash Maheswari

Main category: cs.CL

TL;DR: SynClaimEval framework evaluates synthetic data utility for long-context claim verification, examining input characteristics, synthesis logic, and explanation quality.

Details

Motivation: High cost of constructing annotated resources for training and evaluation of LLMs with extended context windows, with synthetic data offering a scalable alternative.

Method: Introduces SynClaimEval framework that examines three dimensions: input characteristics (context length, out-of-domain generalization), synthesis logic (claim complexity, error type variation), and explanation quality (evidence consistency).

Result: Long-context synthesis improves verification in base instruction-tuned models, especially when augmenting human-written datasets. Synthesis enhances explanation quality even when verification scores don’t improve.

Conclusion: Synthetic data strengthens both performance and explainability in long-context claim verification tasks, with particular benefits for explanation quality.

Abstract: Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification – a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores do not improve, underscoring its potential to strengthen both performance and explainability.

[58] NaturalTurn: A Method to Segment Speech into Psychologically Meaningful Conversational Turns

Gus Cooney, Andrew Reece

Main category: cs.CL

TL;DR: NaturalTurn is a turn-segmentation algorithm that accurately identifies conversational turns by distinguishing primary speaker turns from listener backchannels and parallel speech, enabling better analysis of conversational dynamics.

Details

Motivation: Researchers lack scalable methods to segment speech-to-text transcripts into conversational turns, which are fundamental building blocks for studying social interaction dynamics in large conversation datasets.

Method: NaturalTurn algorithm distinguishes speakers’ primary conversational turns from listeners’ secondary utterances (backchannels, brief interjections, parallel speech) to accurately capture conversational exchange dynamics.

Result: NaturalTurn outperforms baseline models by producing turns with realistic durations and gaps, revealing stronger linguistic alignment patterns between speakers, and uncovering hidden relationships between turn-taking and affective outcomes.

Conclusion: NaturalTurn represents a pragmatic advancement in turn models that enables researchers to link turn-taking dynamics with important social interaction outcomes, supporting the central goals of conversation science.

Abstract: Conversation is a subject of increasing interest in the social, cognitive, and computational sciences. Yet as conversational datasets continue to increase in size and complexity, researchers lack scalable methods to segment speech-to-text transcripts into conversational “turns”-the basic building blocks of social interaction. We discuss this challenge and then introduce “NaturalTurn,” a turn-segmentation algorithm designed to accurately capture the dynamics of conversational exchange. NaturalTurn operates by distinguishing speakers’ primary conversational turns from listeners’ secondary utterances, such as backchannels, brief interjections, and other forms of parallel speech that characterize human conversation. Using data from a large conversation corpus, we show that NaturalTurn captures conversational turns more accurately than a baseline model. For example, it produces turns with durations and gaps that match empirical literature, reveals stronger linguistic alignment patterns between speakers, and uncovers otherwise hidden relationships between turn-taking and affective outcomes. NaturalTurn thus represents a pragmatic development in machine-generated transcript-processing methods, or “turn models”, that will enable researchers to link turn-taking dynamics with important outcomes of social interaction, a central goal of conversation science.

[59] Evaluating Deep Unlearning in Large Language Models

Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri

Main category: cs.CL

TL;DR: Deep unlearning aims to remove target facts AND prevent their deduction via logical reasoning from retained knowledge in LLMs, going beyond simple fact removal.

Details

Motivation: Current fact unlearning methods focus only on removing target facts but overlook deductive connections to other knowledge, potentially allowing facts to be reconstructed through reasoning.

Method: Proposes deep unlearning setting with three metrics (Success-DU, Recall, Accuracy) and benchmarks using MQuAKE dataset and newly constructed Eval-DU dataset for multi-step deductions.

Result: Current unlearning methods struggle with deep unlearning - either fail to deeply unlearn or excessively remove unrelated facts, compromising model utility.

Conclusion: Targeted algorithms need to be developed specifically for robust/deep fact unlearning in LLMs to handle deductive reasoning chains.

Abstract: Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.

[60] Large Language Model Benchmarks in Medical Tasks

Lawrence K. Q. Yan, Qian Niu, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu, Junyu Liu, Xinyuan Song, Riyang Bao, Zekun Jiang, Ziyuan Qin

Main category: cs.CL

TL;DR: Survey of benchmark datasets for medical LLMs across text, image, and multimodal domains, covering EHRs, dialogues, QA, and image captioning.

Details

Motivation: Evaluate LLM performance in medicine using diverse benchmark datasets as these models become increasingly applied in clinical settings.

Method: Comprehensive categorization of medical datasets by modality (text, image, multimodal), analyzing their structure, significance, and impact on clinical LLM development.

Result: Identified key benchmarks including MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert that advance medical report generation, clinical summarization, and synthetic data tasks.

Conclusion: Highlights need for more diverse language datasets, structured omics data, and innovative synthesis approaches to advance multimodal medical intelligence.

Abstract: With the increasing application of large language models (LLMs) in the medical domain, evaluating these models’ performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.

[61] Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

Main category: cs.CL

TL;DR: Template-based Data Generation (TDG) uses GPT-4 to create parameterized meta-templates that generate scalable, high-quality math problems with verifiable solutions, addressing data scarcity for reasoning tasks.

Details

Motivation: LLMs struggle with complex multi-step reasoning, especially in math, due to lack of large-scale, high-quality domain-specific datasets needed to develop sophisticated reasoning abilities.

Method: TDG paradigm uses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates that synthesize infinite streams of high-quality problems and solutions, creating TemplateGSM dataset with 7M+ grade school math problems.

Result: Created TemplateMath Part I: TemplateGSM with over 7 million synthetically generated math problems, each with programmatically verifiable solutions, providing unprecedented quality at scale for supervised fine-tuning and RLVR alignment.

Conclusion: TDG and TemplateGSM provide scalable solution to data and verification bottleneck, enabling new generation of LLMs with powerful, reliable reasoning skills through automated template-based data generation.

Abstract: The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by leveraging GPT-4 to generate meta-templates, ensuring diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills. Project Page: https://github.com/iiis-ai/TemplateMath

[62] OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation

Hanning Zhang, Juntong Song, Juno Zhu, Yuanhao Wu, Tong Zhang, Cheng Niu

Main category: cs.CL

TL;DR: OpenGenAlign is a framework and dataset for developing reward models to evaluate and improve open-ended long-context generation in LLMs, addressing hallucination, comprehensiveness, reliability, and efficiency.

Details

Motivation: Reward modeling is crucial for LLM improvement but its capability for open-ended long-context generation is rarely explored, creating a gap in evaluating and enhancing generation quality in such scenarios.

Method: Developed OpenGenAlign framework with automated pipeline to evaluate LLM outputs across long-context QA, Data-to-Text, and Summarization, creating 33K high-quality preference data with 81% human agreement rate.

Result: Existing reward models perform suboptimally on the benchmark, while the trained reward model achieves superior performance and effectively improves policy model generation quality using RL. The framework also enables effective guided generation and integrates well with reward data from other domains.

Conclusion: OpenGenAlign successfully addresses the gap in evaluating open-ended long-context generation and demonstrates superior performance in improving generation quality through reward modeling and RL integration.

Abstract: Reward Modeling is critical in evaluating and improving the generation of Large Language Models (LLMs). While numerous recent works have shown its feasibility in improving safety, helpfulness, reasoning, and instruction-following ability, its capability and generalization to open-ended long-context generation is still rarely explored. In this paper, we introduce OpenGenAlign, a framework and a high-quality dataset designed to develop reward models to evaluate and improve hallucination-free, comprehensive, reliable, and efficient open-ended long-context generation. We define four key metrics to assess generation quality and develop an automated pipeline to evaluate the outputs of multiple LLMs across long-context QA, Data-to-Text, and Summarization scenarios using o3, ending up with 33K high-quality preference data with a human agreement rate of 81%. Experimental results first demonstrate that existing reward models perform suboptimally on the held-out benchmark. And Our trained reward model achieves superior performance in the benchmark and effectively improves the generation quality of the policy models using Reinforcement Learning (RL). Additionally, OpenGenAlign could be used for effective guided generation in existing datasets. Furthermore, we demonstrate that the OpenGenAlign could be integrated with reward data from other domains to achieve better performance.

[63] How Linguistics Learned to Stop Worrying and Love the Language Models

Richard Futrell, Kyle Mahowald

Main category: cs.CL

TL;DR: Language models can contribute to linguistics but don’t replace linguistic theory; they serve as model systems for usage-based approaches.

Details

Motivation: To address the debate between extreme views that either dismiss language models' relevance to linguistics or claim they make linguistic theory obsolete.

Method: Argumentative analysis of the relationship between language models and linguistics, examining their potential contributions and limitations.

Result: Language models can inform fundamental questions about linguistic structure, processing, and learning, while forcing reconsideration of foundational linguistic arguments.

Conclusion: An optimistic view that language models serve as valuable model systems for gradient, usage-based approaches to language without replacing linguistic structure and theory.

Abstract: Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don’t really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments and ways of thinking that have been foundational in linguistics. While they do not replace linguistic structure and theory, they serve as model systems and working proofs of concept for gradient, usage-based approaches to language. We offer an optimistic take on the relationship between language models and linguistics.

[64] FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales

Main category: cs.CL

TL;DR: FedP^2EFT is a federated learning method that automatically learns optimal personalized PEFT structures for multilingual LLMs using Bayesian sparse rank selection, outperforming existing methods.

Details

Motivation: To improve client-specific performance in federated learning for multilingual LLMs by automating the personalization strategy for PEFT modules, avoiding manual configuration and overfitting issues.

Method: Uses federated learning-to-personalize with Bayesian sparse rank selection to collaboratively learn optimal personalized PEFT structures (LoRA layers and ranks) for each client in cross-device FL settings.

Result: Demonstrates significant performance improvements over existing personalized fine-tuning methods on both simulated and real-world multilingual FL benchmarks.

Conclusion: FedP^2EFT effectively automates PEFT structure personalization for multilingual LLMs in FL, complementing existing FL methods while avoiding overfitting in low-data regimes.

Abstract: Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing other existing FL methods.

[65] Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver

Main category: cs.CL

TL;DR: This paper explores using small, open-source LLMs for argument mining in student essays through few-shot prompting and fine-tuning, achieving better performance than baselines in segmentation and classification tasks.

Details

Motivation: To provide accessible, privacy-preserving, and computationally efficient tools for educators to give targeted feedback on students' argumentation skills using small LLMs that can be deployed locally.

Method: Leveraged open-source small LLMs with few-shot prompting and fine-tuning for three tasks: argument segmentation, argument type classification, and argument quality assessment on the Feedback Prize dataset.

Result: Fine-tuned small LLMs outperformed baseline methods in argument segmentation and type classification, while few-shot prompting achieved comparable performance to baselines in quality assessment.

Conclusion: Small open-source LLMs have significant educational potential for providing real-time, personalized feedback on argumentation skills while ensuring low computational cost and privacy protection.

Abstract: Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students’ argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models’ small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.

[66] Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs

Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, Anurag Beniwal

Main category: cs.CL

TL;DR: Semantic Volume is a novel mathematical measure that quantifies both external and internal uncertainty in LLMs by perturbing queries/responses, embedding them in semantic space, and computing Gram matrix determinant to capture dispersion.

Details

Motivation: LLMs suffer from hallucinations due to both internal uncertainty (missing/conflicting knowledge) and external uncertainty (ambiguous user queries), but existing methods mainly focus on internal uncertainty only.

Method: Perturb queries and responses, embed them in semantic space, compute Gram matrix determinant of embedding vectors to measure dispersion as uncertainty, providing unsupervised detection without internal LLM access.

Result: Extensive experiments show Semantic Volume consistently outperforms existing baselines in both external and internal uncertainty detection tasks.

Conclusion: Semantic Volume is a robust, interpretable approach that improves LLM reliability by systematically detecting uncertainty in both user queries and model responses, with theoretical links to differential entropy.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or conflicting knowledge within the model. However, hallucinations can also stem from external uncertainty, where ambiguous user queries lead to multiple possible interpretations. In this work, we introduce Semantic Volume, a novel mathematical measure for quantifying both external and internal uncertainty in LLMs. Our approach perturbs queries and responses, embeds them in a semantic space, and computes the Gram matrix determinant of the embedding vectors, capturing their dispersion as a measure of uncertainty. Our framework provides a generalizable and unsupervised uncertainty detection method without requiring internal access to LLMs. We conduct extensive experiments on both external and internal uncertainty detections, demonstrating that our Semantic Volume method consistently outperforms existing baselines in both tasks. Additionally, we provide theoretical insights linking our measure to differential entropy, unifying and extending previous sampling-based uncertainty measures such as the semantic entropy. Semantic Volume is shown to be a robust and interpretable approach to improving the reliability of LLMs by systematically detecting uncertainty in both user queries and model responses.

[67] MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization

Jian Zhang, Zhangqi Wang, Haiping Zhu, Kangda Cheng, Kai He, Bo Li, Qika Lin, Jun Liu, Erik Cambria

Main category: cs.CL

TL;DR: MARS is a multi-agent framework for automated prompt optimization that uses Socratic dialogue and POMDP modeling to adaptively refine prompts, outperforming existing methods.

Details

Motivation: Existing automated prompt optimization methods suffer from rigid template structures and inefficient exploration in the prompt space, limiting their effectiveness.

Method: Proposed MARS framework with five agents: Planner generates optimization trajectories, Teacher-Critic-Student triad conducts Socratic dialogue for iterative prompt refinement, and Target agent executes prompts for performance feedback, all modeled as a POMDP.

Result: Extensive experiments show MARS outperforms existing APO methods in optimization performance, search efficiency, and interpretability across multiple datasets.

Conclusion: MARS provides an effective and interpretable approach to automated prompt optimization by integrating reasoning, feedback, and state transition into a unified hidden-state evolution process.

Abstract: Large language models (LLMs) typically operate in a question-answering paradigm, where the quality of the input prompt critically affects the response. Automated Prompt Optimization (APO) aims to overcome the cognitive biases of manually crafted prompts and explore a broader prompt design space. However, existing APO methods often suffer from rigid template structures and inefficient exploration in the prompt space. To this end, we propose a Multi-Agent Adaptive Reasoning with Socratic guidance framework (MARS) for APO. MARS consists of five complementary agents and formulates the optimization process as a Partially Observable Markov Decision Process (POMDP), enabling adaptive prompt refinement through explicit state modeling and interactive feedback. Specifically, a Planner agent generates flexible optimization trajectories, a Teacher-Critic-Student triad engages in Socratic-style dialogue to iteratively optimize the prompt based on pseudo-gradient signals in the text space, and a Target agent executes the prompt in downstream tasks to provide performance feedback. MARS integrates reasoning, feedback, and state transition into a unified hidden-state evolution process, improving both the effectiveness and interpretability of optimization. Extensive experiments on multiple datasets demonstrate that MARS outperforms existing APO methods in terms of optimization performance, search efficiency, and interpretability.

[68] Exploiting individual differences to bootstrap communication

Richard A. Blythe, Casimir Fisch

Main category: cs.CL

TL;DR: Communication systems can emerge from non-communicative behaviors through individual differences and shared intentionality, without requiring pre-existing feedback mechanisms.

Details

Motivation: To explain how communication can be bootstrapped from non-communicative behaviors, overcoming the circular problem that feedback requires existing communication to determine success.

Method: A model showing communication emergence in large populations through individual behavioral differences, predictability in situations, and alignment of psychological states via shared intentionality.

Result: The model demonstrates that an unbounded communication system can emerge without pre-existing means to determine communicative success, relying on social cognition capabilities.

Conclusion: Large flexible communication systems like language may derive from general social cognition capacities rather than requiring specialized communication-specific mechanisms.

Abstract: Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.

[69] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Dylan Bouchard, Mohit Singh Chauhan

Main category: cs.CL

TL;DR: Proposes a versatile framework for detecting hallucinations in LLMs using uncertainty quantification techniques and a tunable ensemble approach, with implementation available in the UQLM toolkit.

Details

Motivation: Hallucinations are a persistent problem in LLMs, especially critical in high-stakes domains like healthcare and finance, requiring effective detection methods.

Method: Adapts various uncertainty quantification techniques (black-box UQ, white-box UQ, LLM-as-a-Judge) into standardized confidence scores and proposes a tunable ensemble approach that combines these scores.

Result: The tunable ensemble typically surpasses individual components and outperforms existing hallucination detection methods in experiments using LLM question-answering benchmarks.

Conclusion: Customized hallucination detection strategies improve the accuracy and reliability of LLMs, with the framework being practical for real-world applications.

Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for closed-book hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper’s companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

[70] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: CANOE framework reduces LLM hallucinations by synthesizing short-form QA data and using Dual-GRPO reinforcement learning with rule-based rewards, improving faithfulness across 11 tasks without human annotations.

Details

Motivation: Teaching LLMs to be faithful in provided context is crucial for reliable information-seeking systems, as hallucinations undermine trust and accuracy.

Method: Synthesize short-form QA data with four diverse tasks, then use Dual-GRPO reinforcement learning with three tailored rule-based rewards to optimize both short-form and long-form response generation without human preference data.

Result: CANOE greatly improves faithfulness across 11 different tasks, outperforming advanced LLMs like GPT-4o and OpenAI o1.

Conclusion: The framework successfully reduces LLM hallucinations systematically without human annotations, demonstrating effectiveness across diverse downstream tasks.

Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to reduce faithfulness hallucinations of LLMs across different downstream tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

[71] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw

Main category: cs.CL

TL;DR: IFEval-Audio is a new evaluation dataset for assessing instruction-following capabilities in audio-based large language models, containing 280 audio-instruction-answer triples across six dimensions.

Details

Motivation: Instruction-following ability often deteriorates in multimodal models after alignment with non-text modalities like audio, and this capability remains largely unexplored in audio-based LLMs.

Method: Created IFEval-Audio dataset with 280 audio-instruction-answer triples across six dimensions (Content, Capitalization, Symbol, List Structure, Length, Format), where each example pairs audio input with text instruction requiring structured output.

Result: Benchmarked state-of-the-art audio LLMs on their audio-involved instruction-following capabilities using the IFEval-Audio dataset.

Conclusion: The dataset is publicly released to support future research in evaluating instruction-following in audio-based large language models.

Abstract: Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

[72] LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High

Judith Sieker, Clara Lachenmaier, Sina Zarrieß

Main category: cs.CL

TL;DR: LLMs struggle to detect false presuppositions in political contexts, with performance varying by linguistic construction, political party, and scenario probability.

Details

Motivation: To investigate whether LLMs, like humans, fail to detect and correct misleading assumptions introduced as false presuppositions, especially in high-stakes political misinformation contexts.

Method: Used linguistic presupposition analysis with a newly created dataset, testing three LLMs (GPT-4-o, LLama-3-8B, Mistral-7B-v03) under different conditions including linguistic construction, political party, and scenario probability.

Result: Models showed poor performance in recognizing false presuppositions, with varying sensitivity across different conditions.

Conclusion: Linguistic presupposition analysis is valuable for uncovering how LLMs reinforce political misinformation through their failure to detect false presuppositions.

Abstract: This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI’s GPT-4-o, Meta’s LLama-3-8B, and MistralAI’s Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.

[73] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng

Main category: cs.CL

TL;DR: RFTC detects poisoned examples in LLM fine-tuning by using reference model comparison and TF-IDF clustering, outperforming existing methods in accuracy and downstream performance.

Details

Motivation: Existing detectors for data poisoning in LLM fine-tuning are either unsuitable for generation tasks or degrade quality through rewriting, creating a need for efficient pre-fine-tuning detection.

Method: RFTC uses Reference-Filtration to flag suspicious examples by comparing responses with a reference model, then applies TF-IDF clustering on suspicious sets to identify poisoned examples based on intra-class distance.

Result: On two machine translation datasets and one QA dataset, RFTC outperforms prior detectors in both detection accuracy and downstream performance of fine-tuned models.

Conclusion: RFTC provides an effective and robust solution for detecting poisoned examples before fine-tuning, leveraging the compact clustering pattern of poisoned responses in the TF-IDF space.

Abstract: Stealthy data poisoning during fine-tuning can backdoor large language models (LLMs), threatening downstream safety. Existing detectors either use classifier-style probability signals–ill-suited to generation–or rely on rewriting, which can degrade quality and even introduce new triggers. We address the practical need to efficiently remove poisoned examples before or during fine-tuning. We observe a robust signal in the response space: after applying TF-IDF to model responses, poisoned examples form compact clusters (driven by consistent malicious outputs), while clean examples remain dispersed. We leverage this with RFTC–Reference-Filtration + TF-IDF Clustering. RFTC first compares each example’s response with that of a reference model and flags those with large deviations as suspicious; it then performs TF-IDF clustering on the suspicious set and identifies true poisoned examples using intra-class distance. On two machine translation datasets and one QA dataset, RFTC outperforms prior detectors in both detection accuracy and the downstream performance of the fine-tuned models. Ablations with different reference models further validate the effectiveness and robustness of Reference-Filtration.

[74] anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

Haitao Li, Ziyu Li, Yiheng Mao, Ziyi Liu, Zhoujian Sun, Zhengxing Huang

Main category: cs.CL

TL;DR: Developed anyECG-chat, a multimodal LLM for ECG analysis that supports diverse tasks including report generation, abnormal waveform localization, and open-ended QA, handling dynamic-length and multiple ECG inputs for both hospital and home environments.

Details

Motivation: Existing ECG-focused MLLMs are limited to single 12-lead, short-duration ECG inputs and primarily focus on report generation, underutilizing MLLM potential for broader ECG analysis tasks.

Method: Created anyECG dataset with diverse tasks and ECG types, then trained anyECG-chat model using three-stage curriculum training to support dynamic-length and multiple ECG inputs.

Result: anyECG-chat successfully supports various practical scenarios including report generation, abnormal waveform localization for long-duration reduced-lead ECGs, and comprehensive multiple ECG comparison analysis.

Conclusion: The proposed anyECG-chat model effectively addresses limitations of existing ECG MLLMs by supporting flexible ECG inputs and diverse analysis tasks, demonstrating practical utility across hospital and home environments.

Abstract: The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs. Our code and data are available at: https://github.com/CuCl-2/anyECG-chat.

[75] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Minda Hu, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong

Main category: cs.CL

TL;DR: LLMs struggle with unsolvable math problems, fabricating unreliable responses. The study creates a ReliableMath dataset and finds that while larger LLMs can improve with reliable prompts, smaller LLMs need alignment strategies to enhance reliability.

Details

Motivation: To investigate LLM reliability in mathematical reasoning tasks, particularly their tendency to fabricate responses for unsolvable problems, which undermines trust in their outputs.

Method: Developed a ReliableMath dataset with solvable and unsolvable problems using a construction workflow with human evaluations. Tested various LLMs with reliable prompts and proposed an alignment strategy for small LLMs.

Result: LLMs fail to identify unsolvable problems and generate fabricated responses. Larger LLMs improve reliability with prompts but still lag on unsolvable problems. Small LLMs show little progress without alignment strategies.

Conclusion: LLM reliability in math reasoning is limited, especially for unsolvable problems. Alignment strategies can significantly improve small LLMs’ reliability on both in-domain and out-of-domain tasks.

Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs’ reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.

[76] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata

Main category: cs.CL

TL;DR: Introduces IndoPref, the first fully human-authored Indonesian preference dataset with 522 prompts and 4,099 human-annotated pairwise preferences across 5 LLMs, addressing the underrepresentation of Indonesian in LLM research.

Details

Motivation: Indonesian is spoken by over 200 million people but remains underrepresented in LLM preference research, with existing multilingual datasets often being English translations that lack cultural and linguistic authenticity.

Method: Created a fully human-authored Indonesian preference dataset with native Indonesian annotations, spanning 10 diverse categories and comparing five instruction-tuned LLMs using pairwise preference comparisons with strong inter-annotator agreement.

Result: Produced IndoPref dataset with 522 prompts and 4,099 human-annotated pairwise preferences, achieving strong inter-annotator agreement measured by Krippendorff’s alpha across multiple LLM comparisons.

Conclusion: IndoPref enables practitioners to identify fine-grained strengths and weaknesses of LLMs for Indonesian language, providing the first culturally authentic benchmark for evaluating Indonesian text generation quality.

Abstract: Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset designed to evaluate the naturalness and quality of LLM-generated text. The dataset contains 522 prompts and yields 4,099 human-annotated pairwise preferences from comparisons across five instruction-tuned LLMs. All annotations are natively written in Indonesian with strong inter-annotator agreement, measured by Krippendorff’s alpha. Our benchmark spans 10 diverse categories, enabling practitioners to identify LLMs’ fine-grained strengths and weaknesses.

[77] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, Kehong Yuan

Main category: cs.CL

TL;DR: Discovery of Super Experts (SEs) - a small but critical subset of experts in MoE LLMs that cause extreme activation outliers and are essential for model performance, especially in mathematical reasoning.

Details

Motivation: To understand the internal dynamics of Mixture of Experts (MoE) LLMs and identify why certain experts have disproportionate impact on model performance despite their limited numbers.

Method: Systematic investigation through pruning experiments, activation analysis, and studying the distribution and characteristics of Super Experts across various open-source MoE LLMs.

Result: SEs are characterized by rare but extreme activation outliers in down_proj outputs, causing massive activations between decoder layers. Pruning just a few SEs (e.g., 3 out of 6,144) causes significant performance degradation and repetitive outputs. SEs are the primary source of systematic outlier mechanisms in Transformers.

Conclusion: Super Experts play a pivotal role in MoE LLMs’ forward inference, particularly in maintaining attention sinks and mathematical reasoning capabilities. Their compression disrupts the systematic outlier mechanism, leading to model collapse.

Abstract: In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs’ forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further investigate why compressing SEs exerts such a pronounced impact. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. The code is provided in https://github.com/ZunhaiSu/Super-Experts-Profilling.

[78] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen

Main category: cs.CL

TL;DR: LLMs can generate comprehensive clinical consultation templates but produce excessively long content and fail to prioritize key clinical questions under length constraints, with performance varying by medical specialty.

Details

Motivation: To evaluate LLMs' capacity to generate structured clinical consultation templates for electronic consultations, addressing the need for efficient physician communication in healthcare settings.

Method: Used 145 expert-crafted templates from Stanford’s eConsult team to assess frontier models through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis.

Result: Models like o3 achieved high comprehensiveness (up to 92.2%) but generated excessively long templates and failed to correctly prioritize clinically important questions under length constraints, with significant performance degradation in narrative-driven specialties.

Conclusion: LLMs can enhance structured clinical information exchange between physicians but require more robust evaluation methods that capture prioritization ability within real-world physician time constraints.

Abstract: This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford’s eConsult team, we assess frontier models – including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro – for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model’s ability to prioritize clinically salient information within the time constraints of real-world physician communication.

[79] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, Di Wang

Main category: cs.CL

TL;DR: LLMs exhibit sycophantic behavior by agreeing with user opinions even when they contradict facts, and this emerges through a two-stage process involving late-layer output preference shifts and deeper representational divergence.

Details

Motivation: To understand the internal mechanisms behind sycophantic behavior in LLMs, as prior work documented the tendency but didn't explain how it arises mechanistically.

Method: Systematically studied how user opinions induce sycophancy across model families, using logit-lens analysis and causal activation patching to identify the emergence process.

Result: Simple opinion statements reliably induce sycophancy, user expertise framing has negligible impact, and first-person prompts create stronger representational perturbations than third-person framings.

Conclusion: Sycophancy is not a surface-level artifact but emerges from structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (I believe...'') consistently induce higher sycophancy rates than third-person framings (They believe…’’) by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

[80] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu

Main category: cs.CL

TL;DR: ComoRAG is a dynamic, iterative RAG framework that mimics human cognitive reasoning by using iterative reasoning cycles with a dynamic memory workspace to handle complex narrative comprehension in long stories.

Details

Motivation: Traditional RAG methods are insufficient for long narrative comprehension due to their stateless, single-step retrieval that fails to capture evolving character relations and dynamic plotlines in extended contexts.

Method: ComoRAG uses iterative reasoning cycles where it generates probing queries to explore new paths, integrates retrieved evidence into a global memory pool, and consolidates knowledge to build coherent context for query resolution.

Result: Outperforms strong RAG baselines with up to 11% relative gains across four challenging long-context narrative benchmarks (200K+ tokens), particularly excelling at complex queries requiring global context comprehension.

Conclusion: ComoRAG provides a principled, cognitively motivated paradigm for retrieval-based stateful reasoning, effectively handling the dynamic nature of narrative comprehension in long stories through iterative evidence acquisition and memory consolidation.

Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM’s diminished reasoning over extended context and its high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods could fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition on reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global context comprehension, offering a principled, cognitively motivated paradigm towards retrieval-based stateful reasoning. Our framework is made publicly available at https://github.com/EternityJune25/ComoRAG.

[81] Mitigating Hallucinations in Large Language Models via Causal Reasoning

Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao

Main category: cs.CL

TL;DR: CDCR-SFT is a supervised fine-tuning framework that trains LLMs to explicitly construct causal DAGs and perform reasoning over them, achieving state-of-the-art causal reasoning performance and reducing hallucinations.

Details

Motivation: Existing reasoning approaches in LLMs operate at linguistic token level rather than modeling underlying causal relationships, lacking ability to represent conditional independencies or satisfy causal identification assumptions.

Method: Introduces CDCR-SFT framework that trains LLMs to explicitly construct variable-level directed acyclic graphs (DAGs) and perform reasoning over them, using a dataset of 25,368 samples with explicit causal DAGs.

Result: Achieves 95.33% accuracy on CLADDER (surpassing human performance of 94.8%), reduces hallucination on HaluEval by 10%, and improves causal reasoning across four LLMs on eight tasks.

Conclusion: Explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies and hallucinations in LLM outputs.

Abstract: Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code is available at https://github.com/MrLYG/CDCR-SFT.

[82] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: SPARK is a training-free KV cache compression method that applies channel-level sparsity pruning to reduce memory usage while maintaining model accuracy, achieving over 30% KV cache reduction with minimal performance degradation.

Details

Motivation: Current KV cache compression methods focus on temporal axis compression but ignore fine-grained importance variations across feature dimensions, limiting their ability to balance efficiency and accuracy effectively.

Method: SPARK applies unstructured sparsity by pruning KV cache at the channel level and dynamically restoring pruned entries during attention computation, making it orthogonal to existing compression techniques.

Result: SPARK reduces KV cache storage by over 30% compared to eviction-based methods while preserving or improving model accuracy, and maintains performance with less than 5% degradation even at 80% pruning ratio.

Conclusion: SPARK effectively addresses the KV cache bottleneck through channel-level sparsity, enabling longer sequence processing within the same memory budget while maintaining model performance.

Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.

[83] Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs

Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, Zilong Zheng

Main category: cs.CL

TL;DR: The paper proposes Router Lens to identify context-faithful experts in LLMs and introduces CEFT, a lightweight fine-tuning method that selectively optimizes these experts to improve context faithfulness more efficiently than full fine-tuning.

Details

Motivation: Large language models often fail to ground their outputs in provided context, leading to irrelevant responses. The work explores whether certain experts in mixture-of-experts architectures specialize in context utilization as a pathway to improve context faithfulness.

Method: Proposed Router Lens method to identify context-faithful experts, and introduced Context-faithful Expert Fine-Tuning (CEFT) - a lightweight optimization approach that selectively fine-tunes only the identified context-faithful experts.

Result: Analysis reveals that context-faithful experts progressively amplify attention to relevant contextual information, enhancing context grounding. CEFT matches or surpasses full fine-tuning performance across various benchmarks while being significantly more efficient.

Conclusion: Targeted optimization of context-faithful experts through CEFT provides an effective and efficient approach to improve context faithfulness in LLMs, leveraging the emergent specialization observed in mixture-of-experts architectures.

Abstract: Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization, offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.

[84] Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki

Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

Main category: cs.CL

TL;DR: Evaluation of language models’ ability to handle Japanese wareki calendar system, revealing struggles with calendar arithmetic and knowledge despite some models performing basic conversions.

Details

Motivation: Most temporal reasoning research focuses on Gregorian calendar, but many non-Gregorian systems like Japanese wareki are culturally important and actively used, yet their handling by LMs remains unevaluated.

Method: Created datasets requiring temporal knowledge and reasoning with wareki dates, then evaluated open and closed language models on calendar conversions, arithmetic, and knowledge tasks.

Result: Some models can perform calendar conversions, but GPT-4o, Deepseek V3, and Japanese-centric models struggle with Japanese calendar arithmetic and knowledge; error analysis suggests corpus frequency and Gregorian bias as explanations.

Conclusion: Current LMs are inadequately equipped for culture-specific calendar tasks, highlighting the need for developing models better suited for diverse temporal systems beyond the Gregorian calendar.

Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese wareki. We create datasets that require temporal knowledge and reasoning in using wareki dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving wareki dates. Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model’s knowledge as possible explanations. Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

[85] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

Krithi Shailya, Akhilesh Kumar Mishra, Gokul S Krishnan, Balaraman Ravindran

Main category: cs.CL

TL;DR: LLMs show significant geographic, demographic, and economic biases in university recommendations, favoring Global North institutions and reinforcing gender stereotypes, despite some diversity in LLaMA-3.1.

Details

Motivation: To examine biases in LLM-based educational recommendations that risk perpetuating societal inequalities in higher education access.

Method: Analyzed 25,000+ recommendations from three open-source LLMs (LLaMA-3.1-8B, Gemma-7B, Mistral-7B) using 360 simulated user profiles varying by gender, nationality, and economic status, with a novel multi-dimensional evaluation framework.

Result: Strong biases found: Global North institutions favored, gender stereotypes reinforced, institutional repetition prevalent. LLaMA-3.1 showed highest diversity (481 universities across 58 countries) but systemic disparities persisted.

Conclusion: Urgent need for bias consideration in educational LLMs to ensure equitable global access to higher education, requiring multi-dimensional evaluation beyond just accuracy metrics.

Abstract: Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.

[86] Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance

Ahmed Alajrami, Xingwei Tan, Nikolaos Aletras

Main category: cs.CL

TL;DR: Instruction-tuning with perturbed instructions can improve LLMs’ resistance to noisy inputs and sometimes enhance downstream performance.

Details

Motivation: LLMs are sensitive to minor variations in instruction phrasing, and this study explores whether introducing perturbations during instruction-tuning can make models more resilient to noisy user inputs.

Method: Instruction-tuning with perturbations like removing stop words or shuffling words, then evaluating on original and perturbed versions of benchmarks (MMLU, BBH, GSM8K) while assessing learning dynamics and behavior shifts.

Result: Surprisingly, instruction-tuning on perturbed instructions can improve downstream performance in some cases, making LLMs more resilient to noisy inputs.

Conclusion: Including perturbed instructions in instruction-tuning is important as it enhances LLMs’ robustness against noisy user inputs.

Abstract: Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs’ resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs’ performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.

[87] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li

Main category: cs.CL

TL;DR: High-performance matrix multiplication optimization for LLM inference on AWS Trainium AI accelerator using kernel fusion and caching strategies

Details

Motivation: Trainium AI accelerator provides cost-effective solutions but its systolic array architecture and data layout requirements make high-performance challenging

Method: Designed techniques using kernel fusion and novel caching strategies to reduce data movement, maximize SRAM bandwidth, and avoid expensive matrix transpose

Result: Achieved 1.35x average speedup (up to 2.22x) for matmul kernel and 1.66x average speedup (up to 2.49x) for end-to-end LLM inference compared to AWS implementation

Conclusion: The proposed optimization techniques significantly improve LLM inference performance on Trainium by addressing architectural challenges

Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

[88] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Dinghong Song, Yuan Feng, Yiwei Wang, Shangye Chen, Cyril Guyot, Filip Blagojevic, Hyeran Jeon, Pengfei Su, Dong Li

Main category: cs.CL

TL;DR: AttnCache accelerates LLM prefill inference by caching and reusing similar attention maps, reducing computational overhead of self-attention with minimal accuracy loss.

Details

Motivation: Many real-world workloads use only the prefill stage of LLM inference where self-attention's quadratic complexity becomes the main bottleneck. Semantically different sentences often produce similar attention maps.

Method: Proposes AttnCache framework that builds an attention map memorization database, uses efficient caching and similarity search to identify and reuse pre-cached attention maps during inference.

Result: Achieves average 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU with negligible accuracy degradation.

Conclusion: AttnCache effectively accelerates prefill-only LLM inference by leveraging attention map similarity, providing significant speed improvements with minimal impact on accuracy.

Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

[89] Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Main category: cs.CL

TL;DR: The paper proposes training LLMs to reason about instruction hierarchies, enabling them to prioritize system instructions over user inputs, which improves reliability and safety against attacks.

Details

Motivation: As LLMs take on high-stakes roles, they need to reconcile competing instructions from multiple sources (developers, users, tools) within prompts, requiring enforcement of instruction hierarchies for reliability and controllability.

Method: Reframe instruction hierarchy resolution as reasoning task, construct VerIH dataset with aligned/conflicting system-user instructions, use lightweight reinforcement learning to transfer reasoning capabilities to instruction prioritization.

Result: Finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, generalize to safety-critical settings, enhance robustness against jailbreak and prompt injection attacks.

Conclusion: Reasoning over instruction hierarchies provides practical path to reliable LLMs where system prompt updates yield controllable and robust behavior changes.

Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

[90] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus

Main category: cs.CL

TL;DR: LLMs struggle to authentically role-play villainous characters due to safety alignment, showing declining fidelity as character morality decreases.

Details

Motivation: To investigate how LLMs' safety alignment conflicts with their ability to authentically portray morally ambiguous or villainous characters in creative generation tasks.

Method: Created Moral RolePlay benchmark with four-level moral alignment scale and balanced test set, then evaluated state-of-the-art LLMs role-playing characters from moral paragons to pure villains.

Result: Models show consistent decline in role-playing fidelity as character morality decreases, struggle most with traits antithetical to safety principles (Deceitful, Manipulative), and substitute nuanced malevolence with superficial aggression.

Conclusion: There’s a fundamental tension between model safety and creative fidelity, with safety-aligned models performing poorly in villain role-playing, highlighting need for more nuanced alignment methods.

Abstract: Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as Deceitful'' and Manipulative’’, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

[91] Quantifying Edits Decay in Fine-tuned LLMs

Yinjie Cheng, Paul Youssef, Christin Seifert, Jörg Schlötterer, Zhixue Zhao

Main category: cs.CL

TL;DR: Fine-tuning after knowledge editing causes edits to decay, with survival rates varying by editing method and fine-tuning approach. Selective-layer fine-tuning can effectively remove edits.

Details

Motivation: To understand if knowledge edits survive fine-tuning, motivated by practical needs to either preserve beneficial edits or remove malicious ones.

Method: Systematic evaluation of 232 configurations using 2 editing methods (MEMIT, AlphaEdit) and 3 fine-tuning approaches (full-parameter, LoRA, DoRA) across 5 LLMs and 3 datasets.

Result: Edits decay after fine-tuning, with AlphaEdit decaying more than MEMIT. Fine-tuning edited layers only effectively removes edits, while fine-tuning non-edited layers impairs more edits than full fine-tuning.

Conclusion: Knowledge editing evaluation must consider the full LLM pipeline, and selective-layer fine-tuning provides actionable strategies for managing edits during fine-tuning.

Abstract: Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.

[92] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Fangqi Dai, Xingjian Jiang, Zizhuang Deng

Main category: cs.CL

TL;DR: HLPD is a novel method for detecting machine-revised text using human language preference optimization, achieving significant improvements over existing approaches in identifying texts modified by advanced LLMs.

Details

Motivation: To address the challenge of detecting machine-revised text, especially in black-box settings where the generating model is unknown, and prevent misinformation from trustworthy-looking LLM-generated content.

Method: Proposes Human Language Preference Detection (HLPD) with Human Language Preference Optimization (HLPO) - a reward-based alignment process that shifts the scoring model’s token distribution toward human-like writing patterns to enhance sensitivity to human writing.

Result: HLPD achieves 15.11% relative improvement in AUROC over ImBD for GPT-series revised texts, and 45.56% improvement over Fast-DetectGPT. For advanced LLMs, it achieves highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%.

Conclusion: HLPD effectively addresses the limitations of previous methods in detecting machine-revised text, demonstrating superior performance across various LLM revision scenarios through its human language preference optimization approach.

Abstract: To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model’s token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

[93] Categorical Emotions or Appraisals – Which Emotion Model Explains Argument Convincingness Better?

Lynn Greschner, Meike Bauer, Sabine Weber, Roman Klinger

Main category: cs.CL

TL;DR: Appraisal theories are more effective than categorical emotions for predicting argument convincingness, as they capture subjective cognitive evaluations of importance and impact.

Details

Motivation: Argument convincingness depends not just on structure and speaker credibility, but also on subjective emotional responses influenced by recipients' goals, knowledge, and stance. Appraisal theories can link cognitive assessments to emotions but their suitability for argument analysis remains unexplored.

Method: Used zero-shot prompting experiments with the ContArgA corpus to evaluate the importance of gold-annotated and predicted emotions and appraisals for assessing subjective convincingness labels.

Result: Categorical emotion information improves convincingness prediction, but appraisals provide more pronounced improvement, demonstrating their superior effectiveness.

Conclusion: This is the first systematic comparison showing appraisals’ advantage over categorical emotions for convincingness prediction, offering insights for computational argumentation theory and practice.

Abstract: The convincingness of an argument does not only depend on its structure (logos), the person who makes the argument (ethos), but also on the emotion that it causes in the recipient (pathos). While the overall intensity and categorical values of emotions in arguments have received considerable attention in the research community, we argue that the emotion an argument evokes in a recipient is subjective. It depends on the recipient’s goals, standards, prior knowledge, and stance. Appraisal theories lend themselves as a link between the subjective cognitive assessment of events and emotions. They have been used in event-centric emotion analysis, but their suitability for assessing argument convincingness remains unexplored. In this paper, we evaluate whether appraisal theories are suitable for emotion analysis in arguments by considering subjective cognitive evaluations of the importance and impact of an argument on its receiver. Based on the annotations in the recently published ContArgA corpus, we perform zero-shot prompting experiments to evaluate the importance of gold-annotated and predicted emotions and appraisals for the assessment of the subjective convincingness labels. We find that, while categorical emotion information does improve convincingness prediction, the improvement is more pronounced with appraisals. This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals, providing insights for theoretical and practical applications in computational argumentation.

[94] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

Shiva Upadhye, Richard Futrell

Main category: cs.CL

TL;DR: This paper investigates backward predictability effects in natural language production using improved measures and language models, showing how both past and future context influence word duration and substitution errors.

Details

Motivation: To better understand the poorly-understood backward predictability effect of words given their future context, which may relate to future planning in language production.

Method: Two studies using naturalistic speech corpora with improved predictability measures and powerful language models, including a new information-theoretic measure integrating both future and past context predictability.

Result: The proposed backward predictability measure yields similar effects across both word duration and substitution error studies, with fine-grained error analysis revealing how speakers prioritize different information types during lexical planning.

Conclusion: The findings illuminate the functional roles of past and future context in word encoding and choice, bridging contextual predictability effects with sentence planning mechanisms.

Abstract: Contextual predictability shapes both the form and choice of words in online language production. The effects of the predictability of a word given its previous context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood backward predictability effect of a word given its future context, which may be related to future planning. Here, in two studies of naturalistic speech corpora, we investigate backward predictability effects using improved measures and more powerful language models, introducing a new principled and conceptually motivated information-theoretic predictability measure that integrates predictability from both the future and the past context. Our first study revisits classic predictability effects on word duration. Our second study investigates substitution errors within a generative framework that independently models the effects of lexical, contextual, and communicative factors on word choice, while predicting the actual words that surface as speech errors. We find that our proposed conceptually-motivated alternative to backward predictability yields qualitatively similar effects across both studies. Through a fine-grained analysis of substitution errors, we further show that different kinds of errors are suggestive of how speakers prioritize form, meaning, and context-based information during lexical planning. Together, these findings illuminate the functional roles of past and future context in how speakers encode and choose words, offering a bridge between contextual predictability effects and the mechanisms of sentence planning.

[95] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

Main category: cs.CL

TL;DR: MLLMs struggle with cross-modality skill composition despite various improvement attempts.

Details

Motivation: To evaluate how well Multimodal Large Language Models can combine previously learned skills across different modalities to solve new tasks.

Method: Created three evaluation tasks requiring sequential composition of two modality-dependent skills, tested open MLLMs using direct prompting and two-step cascaded inference, then explored chain-of-thought prompting and fine-tuning to improve composition.

Result: All evaluated MLLMs showed significant cross-modality skill composition gaps. Chain-of-thought prompting and fine-tuning improved performance but substantial gaps remained.

Conclusion: Current MLLMs have significant limitations in cross-modal skill composition, requiring more research to address this fundamental capability gap.

Abstract: Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

cs.CV

[96] Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants

I. Bailo, F. Buonora, G. Ciarfaglia, L. T. Consoli, A. Evangelista, M. Gabusi, M. Ghiani, C. Petracca Ciavarella, F. Picariello, F. Sarcina, F. Tuosto, V. Zullo, L. Airoldi, G. Bruno, D. D. Gobbo, S. Pezzenati, G. A. Tona

Main category: cs.CV

TL;DR: AI-powered system for automating gas plant digitization using OCR, Vision LLM, Object Detection, and a novel Transformer architecture to extract design data and hierarchical plant structures from P&ID documents with high accuracy.

Details

Motivation: To automate the digitization of SNAM's gas infrastructure plants by extracting information from P&ID documents, streamlining daily work processes and overcoming data standardization challenges through AI technologies.

Method: Uses OCR, Vision LLM, Object Detection, Relational Reasoning, and optimization algorithms. Introduces a novel Transformer architecture for Scene Graph Generation to analyze complex relations between plant components from P&ID PDFs.

Result: Achieved 91% accuracy in extracting textual design data, 93% accuracy in component identification, and approximately 80% accuracy in extracting hierarchical plant structures.

Conclusion: The synergistic combination of AI technologies successfully automates plant digitization, overcoming data variety challenges and achieving high accuracy rates for both design data extraction and plant topology reconstruction.

Abstract: The energy transition is a key theme of the last decades to determine a future of eco-sustainability, and an area of such importance cannot disregard digitization, innovation and the new technological tools available. This is the context in which the Generative Artificial Intelligence models described in this paper are positioned, developed by Engineering Ingegneria Informatica SpA in order to automate the plant structures acquisition of SNAM energy infrastructure, a leading gas transportation company in Italy and Europe. The digitization of a gas plant consists in registering all its relevant information through the interpretation of the related documentation. The aim of this work is therefore to design an effective solution based on Artificial Intelligence techniques to automate the extraction of the information necessary for the digitization of a plant, in order to streamline the daily work of MGM users. The solution received the P&ID of the plant as input, each one in pdf format, and uses OCR, Vision LLM, Object Detection, Relational Reasoning and optimization algorithms to return an output consisting of two sets of information: a structured overview of the relevant design data and the hierarchical framework of the plant. To achieve convincing results, we extend a state-of-the-art model for Scene Graph Generation introducing a brand new Transformer architecture with the aim of deepening the analysis of the complex relations between the plant’s components. The synergistic use of the listed AI-based technologies allowed to overcome many obstacles arising from the high variety of data, due to the lack of standardization. An accuracy of 91% has been achieved in the extraction of textual information relating to design data. Regarding the plants topology, 93% of components are correctly identified and the hierarchical structure is extracted with an accuracy around 80%.

[97] Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: Proposes a systematic evaluation methodology to detect and quantify lip leakage in inpainting-based talking face generation, where generated lips are influenced by reference images rather than audio.

Details

Motivation: Standard metrics and test setups fail to detect lip leaking, where generated lips are influenced by identity reference images rather than driving audio, compromising audio-visual synchronization.

Method: Uses three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis, with derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores.

Result: Provides a framework to analyze how different identity reference selections affect lip leakage, establishing a more reliable benchmark for talking face generation evaluation.

Conclusion: The proposed model-agnostic methodology enables systematic detection and quantification of lip leakage, offering insights for reference design and improving evaluation reliability in talking face generation research.

Abstract: Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

[98] Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Xianhui Meng, Yukang Huo, Li Zhang, Liu Liu, Haonan Jiang, Yan Zhong, Pingrui Zhang, Cewu Lu, Jun Liu

Main category: cs.CV

TL;DR: PPF-Tracker is a novel point-pair-based framework for articulated object pose tracking that uses SE(3) quasi-canonicalization, Point Pair Features, and kinematic constraints to achieve robust multi-frame tracking.

Details

Motivation: Articulated object pose tracking is underexplored compared to rigid objects due to inherent kinematic constraints, creating challenges in robotics and manipulation tasks.

Method: Performs SE(3) quasi-canonicalization of point clouds, models objects using Point Pair Features to predict pose voting parameters, and incorporates semantic joint axis information to impose unified kinematic constraints.

Result: Demonstrates strong generalization across diverse synthetic datasets and real-world scenarios, showing effectiveness and robustness in multi-frame pose tracking.

Conclusion: PPF-Tracker advances articulated object pose tracking and can foster progress in robotics, embodied intelligence, and augmented reality applications.

Abstract: Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbf{PPF-Tracker}. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at https://github.com/mengxh20/PPFTracker.

[99] A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking

Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Calheiros, Teng Joon Lim

Main category: cs.CV

TL;DR: MATRIX introduces a multi-drone surveillance dataset with dynamic camera positions and a deep learning framework for robust pedestrian tracking in complex urban environments with occlusions.

Details

Motivation: Existing multi-drone surveillance approaches struggle with dynamic camera positions and complex occlusions, limiting their effectiveness in real-world urban scenarios.

Method: Proposes a framework with real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird’s-eye-view representation for dynamic drone-based surveillance.

Result: Maintains ~90% detection/tracking accuracy and ~80% trajectory tracking in complex environments, outperforming static camera methods which degrade significantly. Shows strong generalization and graceful performance degradation under camera failures.

Conclusion: MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems with practical robustness for real-world deployments.

Abstract: Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird’s-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with $\sim$90% detection and tracking accuracy, as well as successfully tracks $\sim$80% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.

[100] Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Network

Xuan Yu, Tianyang Xu

Main category: cs.CV

TL;DR: A topology-driven multi-subspace fusion framework on Grassmannian manifold that enables adaptive subspace collaboration through dynamic selection and fusion of multiple subspaces, outperforming static single-subspace approaches.

Details

Motivation: Existing Grassmannian approaches rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces needed to capture complex geometric structures.

Method: Proposes adaptive multi-subspace modeling with topological convergence analysis for dynamic subspace selection/weighting, and multi-subspace interaction blocks using Fréchet mean optimization on the manifold, with Riemannian batch normalization and mutual information regularization.

Result: Achieves state-of-the-art performance on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks, demonstrating superior discriminability and interpretability.

Conclusion: Successfully adapts multi-channel interaction philosophy from Euclidean networks to non-Euclidean domains, advancing geometric deep learning with improved subspace collaboration and theoretical convergence guarantees.

Abstract: Grassmannian manifold offers a powerful carrier for geometric representation learning by modelling high-dimensional data as low-dimensional subspaces. However, existing approaches predominantly rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces critical for capturing complex geometric structures. To address this limitation, we propose a topology-driven multi-subspace fusion framework that enables adaptive subspace collaboration on the Grassmannian. Our solution introduces two key innovations: (1) Inspired by the Kolmogorov-Arnold representation theorem, an adaptive multi-subspace modelling mechanism is proposed that dynamically selects and weights task-relevant subspaces via topological convergence analysis, and (2) a multi-subspace interaction block that fuses heterogeneous geometric representations through Fréchet mean optimisation on the manifold. Theoretically, we establish the convergence guarantees of adaptive subspaces under a projection metric topology, ensuring stable gradient-based optimisation. Practically, we integrate Riemannian batch normalisation and mutual information regularisation to enhance discriminability and robustness. Extensive experiments on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks demonstrate state-of-the-art performance. Our work not only advances geometric deep learning but also successfully adapts the proven multi-channel interaction philosophy of Euclidean networks to non-Euclidean domains, achieving superior discriminability and interpretability.

[101] Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany

Main category: cs.CV

TL;DR: TTM is a training-free framework for motion- and appearance-controlled video generation using I2V diffusion models, leveraging crude reference animations as motion cues without requiring model fine-tuning.

Details

Motivation: Existing diffusion-based video generation lacks precise motion control, and prior motion-conditioned methods require computationally expensive model-specific fine-tuning.

Method: Uses crude reference animations from user-friendly manipulations, adapts SDEdit’s mechanism to video domain, employs dual-clock denoising for region-dependent motion alignment while preserving appearance with image conditioning.

Result: Matches or exceeds training-based baselines in realism and motion control on object and camera motion benchmarks, with precise appearance control through pixel-level conditioning.

Conclusion: TTM provides effective motion and appearance control for video generation without additional training costs, compatible with any backbone diffusion model.

Abstract: Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit’s use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

[102] CADIC: Continual Anomaly Detection Based on Incremental Coreset

Gen Yang, Zhipeng Deng, Junfeng Man

Main category: cs.CV

TL;DR: Proposes a unified memory bank framework for Continual Anomaly Detection that eliminates task-specific memory fragmentation, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: Existing embedding-based CAD approaches require constructing class-specific sub-memory banks for each task, which restricts flexibility and scalability.

Method: A novel CAD framework where all tasks share a unified memory bank, incrementally updating embeddings within a fixed-size coreset during training, and using nearest-neighbor matching for anomaly detection in inference.

Result: Achieved average image-level AUROC scores of 0.972 (MVTec AD) and 0.891 (Visa), with 100% accuracy on a real-world electronic paper dataset.

Conclusion: The proposed unified memory bank approach effectively addresses memory fragmentation issues in CAD, demonstrating superior performance and practical robustness.

Abstract: The primary objective of Continual Anomaly Detection (CAD) is to learn the normal patterns of new tasks under dynamic data distribution assumptions while mitigating catastrophic forgetting. Existing embedding-based CAD approaches continuously update a memory bank with new embeddings to adapt to sequential tasks. However, these methods require constructing class-specific sub-memory banks for each task, which restricts their flexibility and scalability. To address this limitation, we propose a novel CAD framework where all tasks share a unified memory bank. During training, the method incrementally updates embeddings within a fixed-size coreset, enabling continuous knowledge acquisition from sequential tasks without task-specific memory fragmentation. In the inference phase, anomaly scores are computed via a nearest-neighbor matching mechanism, achieving state-of-the-art detection accuracy. We validate the method through comprehensive experiments on MVTec AD and Visa datasets. Results show that our approach outperforms existing baselines, achieving average image-level AUROC scores of 0.972 (MVTec AD) and 0.891 (Visa). Notably, on a real-world electronic paper dataset, it demonstrates 100% accuracy in anomaly sample detection, confirming its robustness in practical scenarios. The implementation will be open-sourced on GitHub.

[103] Predict and Resist: Long-Term Accident Anticipation under Sensor Noise

Xingcheng Liu, Bin Rao, Yanchen Guan, Chengyue Wang, Haicheng Liao, Jiaxun Zhang, Chengyu Lin, Meixin Zhu, Zhenning Li

Main category: cs.CV

TL;DR: A unified framework combining diffusion-based denoising with time-aware actor-critic modeling for robust accident anticipation in autonomous driving, addressing sensory noise and timing reliability challenges.

Details

Motivation: To enable proactive and safe autonomous driving by overcoming two key real-world challenges: noisy/degraded sensory inputs from weather, motion blur, or hardware limitations, and the need for timely yet reliable predictions that balance early alerts with false-alarm suppression.

Method: Integrates diffusion-based denoising with a time-aware actor-critic model. The diffusion module reconstructs noise-resilient image and object features through iterative refinement, while the actor-critic architecture uses long-horizon temporal reasoning and time-weighted rewards to determine optimal alert timing.

Result: Achieves state-of-the-art accuracy on three benchmark datasets (DAD, CCD, A3D) with significant gains in mean time-to-accident, while maintaining robust performance under Gaussian and impulse noise. Produces earlier, more stable, and human-aligned predictions in both routine and complex traffic scenarios.

Conclusion: The proposed framework demonstrates strong potential for real-world, safety-critical deployment by providing robust accident anticipation that handles sensory degradation while delivering timely and reliable predictions.

Abstract: Accident anticipation is essential for proactive and safe autonomous driving, where even a brief advance warning can enable critical evasive actions. However, two key challenges hinder real-world deployment: (1) noisy or degraded sensory inputs from weather, motion blur, or hardware limitations, and (2) the need to issue timely yet reliable predictions that balance early alerts with false-alarm suppression. We propose a unified framework that integrates diffusion-based denoising with a time-aware actor-critic model to address these challenges. The diffusion module reconstructs noise-resilient image and object features through iterative refinement, preserving critical motion and interaction cues under sensor degradation. In parallel, the actor-critic architecture leverages long-horizon temporal reasoning and time-weighted rewards to determine the optimal moment to raise an alert, aligning early detection with reliability. Experiments on three benchmark datasets (DAD, CCD, A3D) demonstrate state-of-the-art accuracy and significant gains in mean time-to-accident, while maintaining robust performance under Gaussian and impulse noise. Qualitative analyses further show that our model produces earlier, more stable, and human-aligned predictions in both routine and highly complex traffic scenarios, highlighting its potential for real-world, safety-critical deployment.

[104] RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

Hae-Won Jo, Yeong-Jun Cho

Main category: cs.CV

TL;DR: RS-Net is a modular framework for Dynamic Scene Graph Generation that scores object pair importance using spatial and temporal context, improving relation prediction without changing existing model architectures.

Details

Motivation: Existing DSGG methods lack guidance for non-related object pairs, making it difficult to identify meaningful relations during inference due to training only on annotated pairs.

Method: Proposes Relation Scoring Network (RS-Net) with spatial context encoder using learnable context tokens and temporal encoder for video-level aggregation, integrated into unified triplet scoring mechanism.

Result: Experiments on Action Genome dataset show consistent improvements in Recall and Precision across baselines, with notable gains in mean Recall, addressing long-tailed relation distribution while maintaining competitive efficiency.

Conclusion: RS-Net effectively enhances relation prediction in DSGG by scoring contextual importance of object pairs, achieving superior performance over state-of-the-art methods without architectural changes.

Abstract: Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.

[105] Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah

Main category: cs.CV

TL;DR: A lightweight Anonymizing Adapter Module (AAM) that removes private information from video features in latent space while maintaining utility across downstream tasks.

Details

Motivation: Current privacy methods require pixel-level anonymization and retraining, making them unsuitable for video foundation models. Extracted visual features inadvertently reveal sensitive personal information.

Method: Plug-and-play adapter module with three training objectives: clip-level self-supervised privacy objective, co-training for utility retention, and latent consistency loss for generalization.

Result: 35% reduction in privacy leakage while maintaining near-baseline utility across Action Recognition, Temporal Action Detection, and Anomaly Detection tasks. Also mitigates gender bias in action recognition.

Conclusion: The proposed AAM effectively preserves privacy in video foundation models without requiring retraining, maintaining utility across various downstream tasks while reducing biases.

Abstract: We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.

[106] Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

Main category: cs.CV

TL;DR: Autoregressive next-pixel prediction shows divergent optimal scaling strategies for classification vs generation tasks, with compute being the primary bottleneck rather than data availability.

Details

Motivation: To investigate the scaling properties of autoregressive next-pixel prediction as a unified framework for vision models and understand optimal scaling strategies across different vision tasks.

Method: Trained Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs at resolutions starting from 32x32, evaluating three metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality via Fréchet Distance.

Result: Found optimal scaling strategies are task-dependent - generation requires 3-5x faster data growth than classification. Higher resolutions require model size to grow much faster than data size. Compute is the primary bottleneck, not training data.

Conclusion: Pixel-by-pixel image modeling is forecasted to be feasible within 5 years given current compute growth rates of 4-5x annually.

Abstract: This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr’echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

[107] Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification

Abhipsa Basu, Aviral Gupta, Abhijnya Bhat, R. Venkatesh Babu

Main category: cs.CV

TL;DR: The paper explores diffusion model finetuning techniques (LoRA and DreamBooth) to generate balanced training data that preserves original distributions, addressing dataset bias in image classification.

Details

Motivation: To address biases in image classification systems caused by uneven group representation in training data, where certain attributes become disproportionately associated with specific groups (e.g., blond hair with females).

Method: Uses multiple diffusion-finetuning techniques including LoRA and DreamBooth, with clustering within groups to handle intra-group variations. Generates group-balanced data for pretraining followed by fine-tuning on real data.

Result: The finetuning approaches outperform vanilla Stable Diffusion and achieve comparable results to SOTA debiasing techniques like Group-DRO, while surpassing them as dataset bias severity increases.

Conclusion: Diffusion model finetuning techniques effectively generate balanced training data that preserves original distributions, offering a robust solution for dataset bias in image classification systems.

Abstract: Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases.

[108] WiCV at CVPR 2025: The Women in Computer Vision Workshop

Estefania Talavera, Deblina Bhattacharjee, Himangi Mittal, Mengwei Ren, Karen Sanchez, Carla Muntean, JungEun Kim, Mona Jalal

Main category: cs.CV

TL;DR: Overview report of WiCV@CVPR 2025 workshop documenting program details, participation statistics, mentorship outcomes, and historical trends to support diversity initiatives in computer vision.

Details

Motivation: To document the impact and evolution of the Women in Computer Vision Workshop as a reference for future editions and other diversity initiatives in AI and computer vision communities.

Method: Analysis of workshop program, participation statistics, mentorship matching data, and historical trends from previous WiCV editions.

Result: WiCV@CVPR 2025 featured 14 accepted papers (5 oral presentations), 36 extended abstract posters, 80 mentees matched with 37 mentors, over 100 onsite participants, and $44,000 in travel grants and diversity awards from 10 sponsors.

Conclusion: The 16th edition of WiCV successfully continued its mission to increase visibility, inclusion, and professional growth of women and underrepresented minorities in computer vision, demonstrating sustained impact through technical presentations, networking, and mentorship programs.

Abstract: The Women in Computer Vision Workshop (WiCV@CVPR 2025) was held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) in Nashville, Tennessee, United States. This report presents an overview of the workshop program, participation statistics, mentorship outcomes, and historical trends from previous WiCV editions. The goal is to document the impact and evolution of WiCV as a reference for future editions and for other initiatives aimed at advancing diversity, equity, and inclusion within the AI and computer vision communities. WiCV@CVPR 2025 marked the 16th edition of this long-standing event dedicated to increasing the visibility, inclusion, and professional growth of women and underrepresented minorities in the computer vision community. This year’s workshop featured 14 accepted papers in the CVPR Workshop Proceedings out of 32 full-paper submissions. Five of these were selected for oral presentations, while all 14 were also presented as posters, along with 36 extended abstract posters accepted from 62 short-paper submissions, which are not included in the proceedings. The mentoring program matched 80 mentees with 37 mentors from both academia and industry. The 2025 edition attracted over 100 onsite participants, fostering rich technical and networking interactions across all career stages. Supported by 10 sponsors and approximately $44,000 USD in travel grants and diversity awards, WiCV continued its mission to empower emerging researchers and amplify diverse voices in computer vision.

[109] Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

Main category: cs.CV

TL;DR: An optimal transport-guided prompt learning framework that preserves structural consistency between pre-trained and fine-tuned vision-language models to improve generalization and prevent overfitting.

Details

Motivation: Existing prompt learning methods for vision-language models lead to overfitting and degrade zero-shot generalization, creating a need for better adaptation techniques that preserve pre-trained knowledge.

Method: Uses optimal transport to guide prompt learning by preserving structural consistency of feature distributions between pre-trained and fine-tuned models, with joint constraints on both vision and text representations for holistic feature alignment.

Result: Outperforms existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without requiring additional augmentation or ensemble techniques.

Conclusion: The proposed optimal transport-guided framework effectively balances adaptation and generalization in vision-language models, demonstrating superior performance across various evaluation scenarios.

Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT

[110] Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation

Abu Taib Mohammed Shahjahan, A. Ben Hamza

Main category: cs.CV

TL;DR: PoseKAN introduces an adaptive graph Kolmogorov-Arnold Network framework for 2D-to-3D pose estimation that overcomes GCN limitations by using learnable activation functions and multi-hop feature aggregation.

Details

Motivation: GCN-based methods for 3D human pose estimation have limited receptive fields that struggle with long-range dependencies needed for handling occlusions and depth ambiguities, and exhibit spectral bias that prioritizes low-frequency components over high-frequency details.

Method: Extends KANs to graph-based learning with learnable functions on graph edges, multi-hop feature aggregation for local and distant neighbor information, residual PoseKAN blocks for deeper refinement, and global response normalization for feature selectivity.

Result: Extensive experiments on benchmark datasets demonstrate competitive performance against state-of-the-art methods.

Conclusion: PoseKAN provides a more expressive and adaptive framework for 3D pose estimation by overcoming GCN limitations through learnable activation functions and improved spatial awareness.

Abstract: Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model’s adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.

[111] SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph

Jingjie He, Weijie Liang, Zihan Shan, Matthew Caesar

Main category: cs.CV

TL;DR: SIFT-Graph is a multimodal defense framework that enhances vision model robustness by combining SIFT keypoints with Graph Attention Networks to create structure-aware features resilient to adversarial attacks.

Details

Motivation: Traditional vision models rely on fragile pixel-level representations vulnerable to adversarial attacks, lacking mechanisms to incorporate inherently robust visual features.

Method: Integrates Scale-Invariant Feature Transform keypoints with Graph Attention Network to capture scale/rotation invariant local structures, then fuses these robust embeddings with traditional vision models like Vision Transformers and CNNs.

Result: Effectively improves model robustness against gradient-based white box adversarial attacks while maintaining only marginal drop in clean accuracy.

Conclusion: The framework successfully creates a unified, structure-aware defensive model that enhances adversarial robustness by leveraging multimodal feature aggregation.

Abstract: Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.

[112] DT-NVS: Diffusion Transformers for Novel View Synthesis

Wonbong Jang, Jonathan Tremblay, Lourdes Agapito

Main category: cs.CV

TL;DR: DT-NVS is a 3D diffusion model for generalized novel view synthesis from single images, trained on real-world videos with image-only losses and novel transformer architectures.

Details

Motivation: Existing methods focus on small camera movements or unnatural object-centric scenes, limiting real-world applications. The goal is to generate novel views of natural everyday scenes from single images.

Method: Uses 3D-aware diffusion model with transformer backbone, novel camera conditioning strategies for unaligned datasets, and training paradigm swapping reference frame roles between conditioning image and noisy input.

Result: Shows improvements over state-of-the-art 3D diffusion models and deterministic approaches on generalized novel view synthesis, generating diverse outputs.

Conclusion: DT-NVS successfully addresses the under-explored problem of natural scene novel view synthesis from single images, outperforming existing methods through innovative transformer architectures and training strategies.

Abstract: Generating novel views of a natural scene, e.g., every-day scenes both indoors and outdoors, from a single view is an under-explored problem, even though it is an organic extension to the object-centric novel view synthesis. Existing diffusion-based approaches focus rather on small camera movements in real scenes or only consider unnatural object-centric scenes, limiting their potential applications in real-world settings. In this paper we move away from these constrained regimes and propose a 3D diffusion model trained with image-only losses on a large-scale dataset of real-world, multi-category, unaligned, and casually acquired videos of everyday scenes. We propose DT-NVS, a 3D-aware diffusion model for generalized novel view synthesis that exploits a transformer-based architecture backbone. We make significant contributions to transformer and self-attention architectures to translate images to 3d representations, and novel camera conditioning strategies to allow training on real-world unaligned datasets. In addition, we introduce a novel training paradigm swapping the role of reference frame between the conditioning image and the sampled noisy input. We evaluate our approach on the 3D task of generalized novel view synthesis from a single input image and show improvements over state-of-the-art 3D aware diffusion models and deterministic approaches, while generating diverse outputs.

[113] Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms

Jiaxun Guo, Manar Amayri, Nizar Bouguila, Xin Liu, Wentao Fan

Main category: cs.CV

TL;DR: Proposes Shadow-informed Pose Feature (SiPF) and Rotation-invariant Attention Convolution (RIAttnConv) to address wing-tip feature collapse in 3D point cloud analysis by preserving global pose awareness while maintaining rotation invariance.

Details

Motivation: Existing rotation-invariant methods lose global pose information, making them unable to distinguish geometrically similar but spatially distinct structures (wing-tip feature collapse).

Method: Introduces SiPF that augments local RI descriptors with globally consistent reference points (‘shadows’) from learned shared rotation, and RIAttnConv operator that integrates SiPFs into feature aggregation. Uses Bingham distribution-based shadow locating module.

Result: Substantially outperforms existing RI methods on 3D classification and part segmentation benchmarks, especially in fine-grained spatial discrimination tasks under arbitrary rotations.

Conclusion: The proposed approach successfully overcomes wing-tip feature collapse by preserving global pose awareness while maintaining rotation invariance, enabling better discrimination of structurally similar components.

Abstract: Recent advances in rotation-invariant (RI) learning for 3D point clouds typically replace raw coordinates with handcrafted RI features to ensure robustness under arbitrary rotations. However, these approaches often suffer from the loss of global pose information, making them incapable of distinguishing geometrically similar but spatially distinct structures. We identify that this limitation stems from the restricted receptive field in existing RI methods, leading to Wing-tip feature collapse, a failure to differentiate symmetric components (e.g., left and right airplane wings) due to indistinguishable local geometries. To overcome this challenge, we introduce the Shadow-informed Pose Feature (SiPF), which augments local RI descriptors with a globally consistent reference point (referred to as the ‘shadow’) derived from a learned shared rotation. This mechanism enables the model to preserve global pose awareness while maintaining rotation invariance. We further propose Rotation-invariant Attention Convolution (RIAttnConv), an attention-based operator that integrates SiPFs into the feature aggregation process, thereby enhancing the model’s capacity to distinguish structurally similar components. Additionally, we design a task-adaptive shadow locating module based on the Bingham distribution over unit quaternions, which dynamically learns the optimal global rotation for constructing consistent shadows. Extensive experiments on 3D classification and part segmentation benchmarks demonstrate that our approach substantially outperforms existing RI methods, particularly in tasks requiring fine-grained spatial discrimination under arbitrary rotations.

[114] SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Hu Cui, Wenqiang Hua, Renjing Huang, Shurui Jia, Tessai Hayama

Main category: cs.CV

TL;DR: SAS-SSM introduces a structure-aware spatiotemporal convolution and stride-based scan strategy for 3D human pose estimation, maintaining spatial structure while achieving linear complexity.

Details

Motivation: Existing SSM-based methods flatten pose sequences, disrupting spatial structure and entangling spatial-temporal features, making it hard to capture complex pose dependencies.

Method: Uses structure-aware spatiotemporal convolution for local joint interactions, then stride-based scan for multi-scale global structural representations.

Result: SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models.

Conclusion: The proposed SAS-SSM effectively models both local and global pose information while maintaining linear computational complexity.

Abstract: Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at https://hucui2022.github.io/sasmamba_proj/.

[115] Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks

Cheng Wang, Shuisheng Zhou, Fengjiao Peng, Jin Sheng, Feng Ye, Yinli Dong

Main category: cs.CV

TL;DR: Proposes MFAVBs, a novel Vision Transformer-based architecture for contrastive image clustering that explicitly fuses features from positive pairs through multiple fusion-augmentation blocks, outperforming state-of-the-art methods.

Details

Motivation: Existing contrastive learning networks with parameter sharing or momentum updating may not fully exploit complementarity and similarity of positive pairs for clustering feature extraction.

Method: Designs MFAVBs using ViTs: two augmentations as positive pairs go through shared-weight ViTs, features are fused into a larger ViT, then split into new augmented pairs for multiple fusion-augmentation cycles. Uses CLIP-extracted features for preprocessing.

Result: Experiments on seven public datasets show MFAVBs as contrastive clustering backbone outperforms state-of-the-art techniques in clustering performance.

Conclusion: MFAVBs effectively enhance contrastive clustering by explicitly fusing positive pair features through multiple fusion-augmentation operations, demonstrating superior performance over existing methods.

Abstract: In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.

[116] Classifying Histopathologic Glioblastoma Sub-regions with EfficientNet

Sanyukta Adap, Ujjwal Baid, Spyridon Bakas

Main category: cs.CV

TL;DR: A deep learning approach using EfficientNet architectures was developed to classify six histopathological regions in glioblastoma (GBM) tissue sections, achieving high training performance but showing generalization challenges on validation and test data.

Details

Motivation: To enable automated, robust identification of distinct histological sub-regions in GBM for better morphological understanding, as current clinical diagnostics have not substantially improved patient prognosis despite advancements.

Method: Four-step deep learning approach using EfficientNet architectures (B0-B4) evaluated on BraTS-Path 2024 dataset with digitized H&E stained GBM tissue sections annotated for six regions, using 5-fold cross-validation.

Result: EfficientNet-B1 and B4 achieved F1 score of 0.98 on training data, but performance dropped to 0.546 on validation and 0.517 on test data for 6-class classification, highlighting generalization challenges.

Conclusion: While the approach shows promise for automated GBM histopathological analysis, the performance gap between training and test data underscores the critical need for models that generalize well to new data for clinical applications.

Abstract: Glioblastoma (GBM) is the most common aggressive, fast-growing brain tumor, with a grim prognosis. Despite clinical diagnostic advancements, there have not been any substantial improvements to patient prognosis. Histopathological assessment of excised tumors is the first line of clinical diagnostic routine. We hypothesize that automated, robust, and accurate identification of distinct histological sub-regions within GBM could contribute to morphologically understanding this disease at scale. In this study, we designed a four-step deep learning approach to classify six (6) histopathological regions and quantitatively evaluated it on the BraTS-Path 2024 challenge dataset, which includes digitized Hematoxylin & Eosin (H&E) stained GBM tissue sections annotated for six distinct regions. We used the challenge’s publicly available training dataset to develop and evaluate the effectiveness of several variants of EfficientNet architectures (i.e., B0, B1, B2, B3, B4). EfficientNet-B1 and EfficientNet-B4 achieved the best performance, achieving an F1 score of 0.98 in a 5-fold cross-validation configuration using the BraTS-Path training set. The quantitative performance evaluation of our proposed approach with EfficientNet-B1 on the BraTS-Path hold-out validation data and the final hidden testing data yielded F1 scores of 0.546 and 0.517, respectively, for the associated 6-class classification task. The difference in the performance on training, validation, and testing data highlights the challenge of developing models that generalize well to new data, which is crucial for clinical applications. The source code of the proposed approach can be found at the GitHub repository of Indiana University Division of Computational Pathology: https://github.com/IUCompPath/brats-path-2024-enet.

[117] Improving VisNet for Object Recognition

Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu

Main category: cs.CV

TL;DR: Enhanced VisNet variants with RBF neurons, Mahalanobis distance learning, and retinal preprocessing significantly improve object recognition and symmetry classification across multiple datasets compared to baseline VisNet.

Details

Motivation: To bridge the gap between efficient human visual object recognition and artificial systems by developing biologically inspired neural networks that can perform robust and transformation-invariant visual recognition.

Method: Developed enhanced VisNet variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing, leveraging Hebbian learning and temporal continuity to build invariant representations.

Result: Experimental results across MNIST, CIFAR10, and custom symmetric object datasets show that enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model.

Conclusion: VisNet-inspired architectures offer a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence, demonstrating adaptability and biological relevance.

Abstract: Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations

Riling Wei, Kelu Yao, Chuanguang Yang, Jin Wang, Zhuoyan Gao, Chao Li

Main category: cs.CV

TL;DR: The paper proposes SemBridge, a framework for Asymmetric Cross-modal Knowledge Distillation (ACKD) that enables knowledge transfer between modalities with limited semantic overlap, addressing the limitations of traditional symmetric approaches that require strongly paired data.

Details

Motivation: Traditional Cross-modal Knowledge Distillation (SCKD) requires strongly paired modalities with strong semantic connections, which is often unavailable in real-world scenarios. The authors aim to address this limitation by exploring knowledge distillation under weak semantic consistency.

Method: The proposed SemBridge framework includes two key modules: 1) Student-Friendly Matching module that uses self-supervised learning to acquire semantic-based knowledge and provides personalized instruction by dynamically selecting relevant teacher samples, and 2) Semantic-aware Knowledge Alignment module that employs Lagrangian optimization to find optimal transport paths.

Result: The framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets, particularly on a curated benchmark dataset of Multi-Spectral and asymmetric RGB images for remote sensing scene classification.

Conclusion: The proposed ACKD approach and SemBridge framework effectively address the challenges of knowledge distillation under weak semantic consistency, providing a more flexible and practical solution for real-world applications where strongly paired modalities are unavailable.

Abstract: Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

[119] LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.CV

TL;DR: A framework that enhances semi-supervised document layout detection by fusing visual predictions with structural priors from LLMs using probabilistic weighting, achieving state-of-the-art results with minimal labeled data.

Details

Motivation: Document layout understanding remains data-intensive despite advances in semi-supervised learning, and there's a need to leverage LLMs' structural understanding to reduce annotation requirements.

Method: OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels with principled probabilistic weighting.

Result: Achieved 88.2±0.3 AP with lightweight SwiftFormer using only 5% labels on PubLayNet, and 89.7±0.4 AP with LayoutLMv3, surpassing standard semi-supervised learning and matching UDOP which requires extensive pretraining.

Conclusion: LLM structural priors are complementary to both lightweight and pretrained architectures, enabling privacy-preserving deployment and providing targeted semantic disambiguation beyond simple text heuristics.

Abstract: Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\pm$0.3 AP using only 5% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\pm$0.4 AP, p=0.02) and matching UDOP~\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes $12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.

[120] Consistency Change Detection Framework for Unsupervised Remote Sensing Change Detection

Yating Liu, Yan Lu

Main category: cs.CV

TL;DR: Proposes CCDF framework with Cycle Consistency and Semantic Consistency modules to address generator overfitting in unsupervised remote sensing change detection, outperforming state-of-the-art methods.

Details

Motivation: Previous unsupervised methods suffer from poor performance due to generator overfitting when attempting style transfer across multi-temporal remote sensing images for change detection.

Method: Introduces Consistency Change Detection Framework (CCDF) with two modules: Cycle Consistency (CC) to reduce generator overfitting, and Semantic Consistency (SC) for detail reconstruction.

Result: Extensive experiments show the method outperforms other state-of-the-art approaches in unsupervised remote sensing change detection.

Conclusion: The proposed CCDF framework effectively addresses generator overfitting issues and improves change detection performance through consistency modules.

Abstract: Unsupervised remote sensing change detection aims to monitor and analyze changes from multi-temporal remote sensing images in the same geometric region at different times, without the need for labeled training data. Previous unsupervised methods attempt to achieve style transfer across multi-temporal remote sensing images through reconstruction by a generator network, and then capture the unreconstructable areas as the changed regions. However, it often leads to poor performance due to generator overfitting. In this paper, we propose a novel Consistency Change Detection Framework (CCDF) to address this challenge. Specifically, we introduce a Cycle Consistency (CC) module to reduce the overfitting issues in the generator-based reconstruction. Additionally, we propose a Semantic Consistency (SC) module to enable detail reconstruction. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.

[121] HitoMi-Cam: A Shape-Agnostic Person Detection Method Using the Spectral Characteristics of Clothing

Shuji Ono

Main category: cs.CV

TL;DR: HitoMi-Cam is a lightweight, shape-agnostic person detection method using spectral reflectance of clothing, achieving 23.2 fps on edge devices and outperforming CNNs in unpredictable shape scenarios like disaster rescue.

Details

Motivation: CNN-based object detection suffers from shape dependency that degrades performance for postures not in training data, particularly problematic in real-world scenarios like disaster rescue where human shapes are unpredictable.

Method: Spectral-based approach using clothing reflectance properties, implemented on resource-constrained edge devices without GPU. Method is shape-agnostic and focuses on spectral characteristics rather than visual shapes.

Result: Achieved 23.2 fps processing speed (253x190 pixels) and 93.5% AP in search/rescue scenarios, significantly outperforming CNN models (best AP 53.8%). Minimal false positives across all evaluation scenarios.

Conclusion: HitoMi-Cam serves as complementary tool to CNNs for specific conditions, demonstrating spectral-based detection is viable for real-time edge applications in unpredictable shape environments like disaster rescue.

Abstract: While convolutional neural network (CNN)-based object detection is widely used, it exhibits a shape dependency that degrades performance for postures not included in the training data. Building upon our previous simulation study published in this journal, this study implements and evaluates the spectral-based approach on physical hardware to address this limitation. Specifically, this paper introduces HitoMi-Cam, a lightweight and shape-agnostic person detection method that uses the spectral reflectance properties of clothing. The author implemented the system on a resource-constrained edge device without a GPU to assess its practical viability. The results indicate that a processing speed of 23.2 frames per second (fps) (253x190 pixels) is achievable, suggesting that the method can be used for real-time applications. In a simulated search and rescue scenario where the performance of CNNs declines, HitoMi-Cam achieved an average precision (AP) of 93.5%, surpassing that of the compared CNN models (best AP of 53.8%). Throughout all evaluation scenarios, the occurrence of false positives remained minimal. This study positions the HitoMi-Cam method not as a replacement for CNN-based detectors but as a complementary tool under specific conditions. The results indicate that spectral-based person detection can be a viable option for real-time operation on edge devices in real-world environments where shapes are unpredictable, such as disaster rescue.

[122] Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

Zimao Lu, Hui Xu, Bing Liu, Ke Wang

Main category: cs.CV

TL;DR: Negative Entity Suppression (NES) method improves zero-shot image captioning by reducing hallucinations through filtering negative entities from retrieved content and applying attention-level suppression.

Details

Motivation: Text-only training for zero-shot image captioning suffers from poor cross-domain generalization and hallucination when encountering novel visual environments, while retrieval-based methods can exacerbate this issue with irrelevant entities.

Method: NES integrates three stages: (1) uses synthetic images for consistent image-to-text retrieval, (2) filters negative entities from retrieved content, and (3) applies attention-level suppression using identified negative entities.

Result: NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving state-of-the-art results in zero-shot image captioning.

Conclusion: The proposed NES method effectively addresses hallucination in zero-shot image captioning by suppressing negative entities, demonstrating superior cross-domain generalization capabilities.

Abstract: Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities–objects that appear in generated caption but are absent from the input–and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at https://github.com/nidongpinyinme/NESCap.

[123] SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization

Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma

Main category: cs.CV

TL;DR: SPEED-Q is a novel framework for quantizing Vision-Language Models (1B-2B parameters) to low bits (2-4 bits) for edge deployment, addressing modality sensitivity differences and training instability through staged processing and enhanced distillation.

Details

Motivation: Enable deployment of VLMs on resource-constrained edge devices through aggressive quantization, as existing methods rarely explore low-bit quantization for small-scale VLMs suitable for edge deployment.

Method: Proposes SPEED-Q framework with: 1) staged sensitivity adaptive mechanism to harmonize performance across vision and language components, 2) distillation-enhanced quantization strategy to stabilize training and reduce data dependence.

Result: Achieves up to 6x higher accuracy than existing methods under 2-bit settings, consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings across multiple benchmarks.

Conclusion: SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs, making it the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits for edge deployment.

Abstract: Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. In SPEED-Q, a staged sensitivity adaptive mechanism is introduced to effectively harmonize performance across different modalities. We further propose a distillation-enhanced quantization strategy to stabilize the training process and reduce data dependence. Together, SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs. SPEED-Q is the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits. Extensive experiments across multiple benchmarks demonstrate that SPEED-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings and consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings. Our code and models are available at https://github.com/antgroup/SPEED-Q.

[124] Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework

Zifu Zhang, Shengxi Li, Xiancheng Sun, Mai Xu, Zhengyuan Liu, Jingyuan Xia

Main category: cs.CV

TL;DR: Diff-FCHM is a novel collaborative compression method that uses machine-vision-oriented compression as the basis for human vision, employing diffusion priors to restore high-fidelity details for human vision while maintaining efficiency for machine vision tasks.

Details

Motivation: Existing collaborative compression methods are built on human-vision pipelines, which are inefficient for machine vision since machines only need core regions of images/videos. This creates complexity and bit-rate issues when aggregating machine-vision compression.

Method: Proposes Diff-FCHM which uses machine-vision-oriented compression as the foundation, progressively aggregates semantics from machine-vision compression, and uses diffusion priors to restore high-fidelity details for human vision. Includes a plug-and-play variable bit-rate strategy for machine vision tasks.

Result: Experimental results show consistently superior performance on both machine-vision and human-vision compression with remarkable margins compared to existing methods.

Conclusion: Diff-FCHM successfully demonstrates that using machine-vision-oriented compression as the basis for collaborative compression, combined with diffusion priors for detail restoration, achieves superior performance for both human and machine vision tasks.

Abstract: Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.

[125] From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, Jun Du

Main category: cs.CV

TL;DR: Hierarchical Distillation (HD) framework combines trajectory-based and distribution-based distillation methods to achieve high-fidelity single-step diffusion models, achieving FID of 2.26 on ImageNet 256×256.

Details

Motivation: Address the fundamental trade-off between trajectory-based methods (preserve global structure but lose high-frequency details) and distribution-based methods (higher fidelity but suffer from mode collapse and unstable training) in diffusion model distillation.

Method: Proposes Hierarchical Distillation framework: 1) Use trajectory distillation to create structural “sketch” as initialization, 2) Apply distribution-based refinement with Adaptive Weighted Discriminator (AWD) that dynamically allocates token weights to focus on local imperfections.

Result: State-of-the-art performance: single-step model achieves FID of 2.26 on ImageNet 256×256 (rivaling 250-step teacher), and promising results on high-resolution text-to-image MJHQ benchmark.

Conclusion: Establishes a robust new paradigm for high-fidelity, single-step diffusion models by synergistically combining trajectory and distribution distillation approaches.

Abstract: The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a “lossy compressor”, sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet $256\times256$, our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.

[126] Boosting Adversarial Transferability via Ensemble Non-Attention

Yipeng Zou, Qin Liu, Jie Wu, Yu Peng, Guo Chen, Hui Zhou, Guanghui Ye

Main category: cs.CV

TL;DR: NAMEA is a novel ensemble attack that integrates gradients from non-attention areas of heterogeneous models to improve adversarial transferability across different architectures like CNNs and ViTs.

Details

Motivation: Previous ensemble attacks show poor performance when transferring across heterogeneous model architectures due to widely differing gradient update directions, making it hard to reduce gradient variance while utilizing individual models effectively.

Method: Decouples gradients from attention and non-attention areas of ensemble models, then merges them using meta-learning. Specifically integrates gradients from non-attention areas into iterative gradient optimization.

Result: Outperforms state-of-the-art ensemble attacks AdaEA and SMER by average margins of 15.0% and 9.6% respectively on ImageNet dataset.

Conclusion: First work to explore ensemble non-attention for boosting cross-architecture transferability, providing new insights into launching effective ensemble attacks.

Abstract: Ensemble attacks integrate the outputs of surrogate models with diverse architectures, which can be combined with various gradient-based attacks to improve adversarial transferability. However, previous work shows unsatisfactory attack performance when transferring across heterogeneous model architectures. The main reason is that the gradient update directions of heterogeneous surrogate models differ widely, making it hard to reduce the gradient variance of ensemble models while making the best of individual model. To tackle this challenge, we design a novel ensemble attack, NAMEA, which for the first time integrates the gradients from the non-attention areas of ensemble models into the iterative gradient optimization process. Our design is inspired by the observation that the attention areas of heterogeneous models vary sharply, thus the non-attention areas of ViTs are likely to be the focus of CNNs and vice versa. Therefore, we merge the gradients respectively from the attention and non-attention areas of ensemble models so as to fuse the transfer information of CNNs and ViTs. Specifically, we pioneer a new way of decoupling the gradients of non-attention areas from those of attention areas, while merging gradients by meta-learning. Empirical evaluations on ImageNet dataset indicate that NAMEA outperforms AdaEA and SMER, the state-of-the-art ensemble attacks by an average of 15.0% and 9.6%, respectively. This work is the first attempt to explore the power of ensemble non-attention in boosting cross-architecture transferability, providing new insights into launching ensemble attacks.

[127] Neural B-frame Video Compression with Bi-directional Reference Harmonization

Yuxi Liu, Dengchao Jin, Shuai Huo, Jiawen Gu, Chao Zhou, Huihui Bai, Ming Lu, Zhan Ma

Main category: cs.CV

TL;DR: BRHVC is a novel neural B-frame video compression method that harmonizes bi-directional references using Bi-directional Motion Converge and Bi-directional Contextual Fusion to improve compression performance.

Details

Motivation: Neural B-frame video compression is underexplored compared to P-frame compression, and existing hierarchical coding can cause unbalanced contribution from reference frames due to large frame spans.

Method: Proposes Bi-directional Motion Converge (BMC) to converge multiple optical flows for more accurate motion compensation, and Bi-directional Contextual Fusion (BCF) to model reference context weights based on motion compensation accuracy.

Result: BRHVC outperforms previous state-of-the-art neural video compression methods and even surpasses traditional coding VTM-RA on HEVC datasets under random access configuration.

Conclusion: The proposed BRHVC method effectively harmonizes bi-directional references through efficient motion and context utilization, achieving superior compression performance.

Abstract: Neural video compression (NVC) has made significant progress in recent years, while neural B-frame video compression (NBVC) remains underexplored compared to P-frame compression. NBVC can adopt bi-directional reference frames for better compression performance. However, NBVC’s hierarchical coding may complicate continuous temporal prediction, especially at some hierarchical levels with a large frame span, which could cause the contribution of the two reference frames to be unbalanced. To optimize reference information utilization, we propose a novel NBVC method, termed Bi-directional Reference Harmonization Video Compression (BRHVC), with the proposed Bi-directional Motion Converge (BMC) and Bi-directional Contextual Fusion (BCF). BMC converges multiple optical flows in motion compression, leading to more accurate motion compensation on a larger scale. Then BCF explicitly models the weights of reference contexts under the guidance of motion compensation accuracy. With more efficient motions and contexts, BRHVC can effectively harmonize bi-directional references. Experimental results indicate that our BRHVC outperforms previous state-of-the-art NVC methods, even surpassing the traditional coding, VTM-RA (under random access configuration), on the HEVC datasets. The source code is released at https://github.com/kwai/NVC.

[128] FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction

Haowei Zhang, Yuanpei Zhao, Jizhe Zhou, Mao Li

Main category: cs.CV

TL;DR: Proposes FGM-HD framework that uses Hausdorff Dimension to enhance diversity in Fractal Generative Models while maintaining image quality, achieving 39% diversity improvement on ImageNet.

Details

Motivation: Fractal Generative Models generate high-quality images but suffer from limited diversity due to inherent self-similarity, creating a need for methods that enhance diversity without compromising visual quality.

Method: Uses learnable HD estimation from image embeddings, HD-based loss with momentum-driven scheduling during training, and HD-guided rejection sampling during inference to select geometrically richer outputs.

Result: Achieves 39% improvement in output diversity compared to vanilla FGMs on ImageNet dataset while preserving comparable image quality.

Conclusion: Successfully introduces Hausdorff Dimension into FGM for the first time, effectively enhancing diversity while providing theoretical contributions to FGM development.

Abstract: Improving the diversity of generated results while maintaining high visual quality remains a significant challenge in image generation tasks. Fractal Generative Models (FGMs) are efficient in generating high-quality images, but their inherent self-similarity limits the diversity of output images. To address this issue, we propose a novel approach based on the Hausdorff Dimension (HD), a widely recognized concept in fractal geometry used to quantify structural complexity, which aids in enhancing the diversity of generated outputs. To incorporate HD into FGM, we propose a learnable HD estimation method that predicts HD directly from image embeddings, addressing computational cost concerns. However, simply introducing HD into a hybrid loss is insufficient to enhance diversity in FGMs due to: 1) degradation of image quality, and 2) limited improvement in generation diversity. To this end, during training, we adopt an HD-based loss with a monotonic momentum-driven scheduling strategy to progressively optimize the hyperparameters, obtaining optimal diversity without sacrificing visual quality. Moreover, during inference, we employ HD-guided rejection sampling to select geometrically richer outputs. Extensive experiments on the ImageNet dataset demonstrate that our FGM-HD framework yields a 39% improvement in output diversity compared to vanilla FGMs, while preserving comparable image quality. To our knowledge, this is the very first work introducing HD into FGM. Our method effectively enhances the diversity of generated outputs while offering a principled theoretical contribution to FGM development.

[129] AuthSig: Safeguarding Scanned Signatures Against Unauthorized Reuse in Paperless Workflows

RuiQiang Zhang, Zehua Ma, Guanjie Wang, Chang Liu, Hengyi Wang, Weiming Zhang

Main category: cs.CV

TL;DR: AuthSig is a static electronic signature framework that uses generative models and watermarks to bind authentication information to signature images, enabling reliable verification and preventing malicious reuse.

Details

Motivation: Static scanned signatures are widely used but lack authentication attributes, making them vulnerable to copying and reuse. Current solutions like dynamic pressure-sensitive or PKI-based signatures are not as convenient.

Method: Uses generative models to modulate style embeddings during signature generation to implicitly encode watermark bits. Introduces keypoint-driven data augmentation to enhance style diversity for robust watermark embedding.

Result: Achieves over 98% extraction accuracy under digital-domain distortions and signature-specific degradations, and remains effective in print-scan scenarios.

Conclusion: AuthSig provides a practical solution for authenticating static electronic signatures by embedding imperceptible watermarks that enforce One Signature, One Use policy while maintaining visual authenticity.

Abstract: With the deepening trend of paperless workflows, signatures as a means of identity authentication are gradually shifting from traditional ink-on-paper to electronic formats.Despite the availability of dynamic pressure-sensitive and PKI-based digital signatures, static scanned signatures remain prevalent in practice due to their convenience. However, these static images, having almost lost their authentication attributes, cannot be reliably verified and are vulnerable to malicious copying and reuse. To address these issues, we propose AuthSig, a novel static electronic signature framework based on generative models and watermark, which binds authentication information to the signature image. Leveraging the human visual system’s insensitivity to subtle style variations, AuthSig finely modulates style embeddings during generation to implicitly encode watermark bits-enforcing a One Signature, One Use policy.To overcome the scarcity of handwritten signature data and the limitations of traditional augmentation methods, we introduce a keypoint-driven data augmentation strategy that effectively enhances style diversity to support robust watermark embedding. Experimental results show that AuthSig achieves over 98% extraction accuracy under both digital-domain distortions and signature-specific degradations, and remains effective even in print-scan scenarios.

[130] Efficient and Effective In-context Demonstration Selection with Coreset

Zihua Wang, Jiarui Wang, Haiyang Xu, Ming Yan, Fei Huang, Xu Yang, Xiu-Shen Wei, Siya Mi, Yu Zhang

Main category: cs.CV

TL;DR: Proposes CoDR framework for efficient and effective demonstration selection in in-context learning using coreset-based dual retrieval.

Details

Motivation: Traditional demonstration selection methods for in-context learning are inefficient or suboptimal, struggling to balance efficiency and effectiveness.

Method: Uses cluster-pruning to construct diverse coresets and dual retrieval mechanism for global demonstration selection while maintaining efficiency.

Result: Significantly improves ICL performance compared to existing strategies.

Conclusion: CoDR provides a robust solution for effective and efficient demonstration selection in in-context learning.

Abstract: In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that is NP-hard. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection. In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR). We show that samples within a diverse subset achieve a higher expected mutual information. To implement this, we introduce a cluster-pruning method to construct a diverse coreset that aligns more effectively with the query while maintaining diversity. Additionally, we develop a dual retrieval mechanism that enhances the selection process by achieving global demonstration selection while preserving efficiency. Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.

[131] WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

Yifei Sun, Yuzhi He, Junhao Jia, Jinhong Wang, Ruiquan Ge, Changmiao Wang, Hongxia Xu

Main category: cs.CV

TL;DR: WDT-MD is a Wavelet Diffusion Transformer framework that addresses limitations in diffusion-based microaneurysm detection by preventing identity mapping, distinguishing MAs from other anomalies, and improving normal feature reconstruction.

Details

Motivation: Current diffusion-based anomaly detection methods for microaneurysm screening suffer from identity mapping, inability to distinguish MAs from other anomalies, and poor reconstruction of normal features, limiting their clinical application.

Method: Proposed WDT-MD framework with three innovations: noise-encoded image conditioning to avoid identity mapping, pseudo-normal pattern synthesis via inpainting for pixel-level supervision, and wavelet diffusion Transformer combining global modeling with multi-scale wavelet analysis.

Result: Comprehensive experiments on IDRiD and e-ophtha MA datasets show WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level microaneurysm detection.

Conclusion: WDT-MD represents a significant advancement for improving early diabetic retinopathy screening through more accurate microaneurysm detection.

Abstract: Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 $μm$ lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to “identity mapping”, where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid “identity mapping” by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.

[132] An ICTM-RMSAV Framework for Bias-Field Aware Image Segmentation under Poisson and Multiplicative Noise

Xinyu Wang, Wenjun Yao, Fanghui Song, Zhichang Guo

Main category: cs.CV

TL;DR: Proposes a variational segmentation model combining denoising and bias field estimation to handle images with intensity inhomogeneity and multiplicative/Poisson noise, using ICTM with RMSAV optimization.

Details

Motivation: Existing segmentation methods degrade with heavy noise and intensity inhomogeneity, particularly with Gamma-distributed multiplicative noise and Poisson noise.

Method: Integrates denoising (I-divergence + adaptive TV regularizer) with bias field estimation in ICTM framework, using spatially adaptive weights and RMSAV optimization scheme.

Result: Extensive experiments show superior accuracy and robustness on synthetic and real-world images with intensity inhomogeneity and diverse noise types compared to competing methods.

Conclusion: The proposed model effectively handles challenging segmentation scenarios with noise and intensity inhomogeneity, achieving better performance than existing approaches.

Abstract: Image segmentation is a core task in image processing, yet many methods degrade when images are heavily corrupted by noise and exhibit intensity inhomogeneity. Within the iterative-convolution thresholding method (ICTM) framework, we propose a variational segmentation model that integrates denoising terms. Specifically, the denoising component consists of an I-divergence term and an adaptive total-variation (TV) regularizer, making the model well suited to images contaminated by Gamma–distributed multiplicative noise and Poisson noise. A spatially adaptive weight derived from a gray-level indicator guides diffusion differently across regions of varying intensity. To further address intensity inhomogeneity, we estimate a smoothly varying bias field, which improves segmentation accuracy. Regions are represented by characteristic functions, with contour length encoded accordingly. For efficient optimization, we couple ICTM with a relaxed modified scalar auxiliary variable (RMSAV) scheme. Extensive experiments on synthetic and real-world images with intensity inhomogeneity and diverse noise types show that the proposed model achieves superior accuracy and robustness compared with competing approaches.

[133] T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

Jiazhou Zhou, Qing Jiang, Kanghao Chen, Lutao Jiang, Yuanhuiyi Lyu, Ying-Cong Chen, Lei Zhang

Main category: cs.CV

TL;DR: T-Rex-Omni is an open-set object detection framework that incorporates negative visual prompts to address limitations of positive-only methods, improving robustness against visually similar distractors through unified prompt encoding, training-free negative suppression, and discriminative loss functions.

Details

Motivation: Current open-set object detectors rely exclusively on positive indicators (text descriptions or visual exemplars), making them vulnerable to visually similar but semantically different distractors. This positive-only paradigm needs enhancement to handle hard negative cases effectively.

Method: 1) Unified visual prompt encoder for joint processing of positive and negative prompts; 2) Training-free Negating Negative Computing (NNC) module to dynamically suppress negative responses; 3) Negating Negative Hinge (NNH) loss for discriminative margin enforcement during fine-tuning; 4) Flexible deployment supporting both positive-only and joint positive-negative inference modes.

Result: Achieves remarkable zero-shot detection performance, significantly narrowing the gap between visual-prompted and text-prompted methods. Particularly strong in long-tailed scenarios (51.2 AP_r on LVIS-minival). Demonstrates effectiveness in handling hard negative distractors.

Conclusion: Negative prompts represent a crucial new dimension for advancing open-set visual recognition systems, establishing T-Rex-Omni as an effective framework that overcomes limitations of positive-only approaches through comprehensive negative prompt integration.

Abstract: Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.

[134] Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

Liu Yu, Zhonghao Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Lan Wang, Gillian Dobbie

Main category: cs.CV

TL;DR: Owl is a framework that reduces object hallucination in LVLMs by modeling the causal relationship between visual and textual attention, using a novel VTACR metric to detect imbalance and dynamically adjust attention.

Details

Motivation: Existing methods address visual or textual attention separately, ignoring their interaction as key causal factors for object hallucination in LVLMs.

Method: Proposes VTACR metric to quantify modality imbalance, uses structural causal modeling, implements fine-grained attention intervention, and applies dual-path contrastive decoding.

Result: Achieves significant hallucination reduction on POPE and CHAIR benchmarks, setting new state-of-the-art in faithfulness while preserving vision-language understanding.

Conclusion: Owl effectively mitigates object hallucination by causally modeling attention interactions and dynamically balancing visual-textual contributions during decoding.

Abstract: Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones – letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL

[135] Dense Cross-Scale Image Alignment With Fully Spatial Correlation and Just Noticeable Difference Guidance

Jinkun You, Jiaxue Li, Jie Zhang, Yicong Zhou

Main category: cs.CV

TL;DR: Proposes a dense cross-scale image alignment model that improves accuracy and efficiency by leveraging cross-scale feature correlations and a fully spatial correlation module.

Details

Motivation: Existing unsupervised image alignment methods have limited accuracy and high computational complexity, creating a need for more efficient and accurate solutions.

Method: Uses dense cross-scale alignment with correlations between cross-scale features, a fully spatial correlation module, and incorporates just noticeable difference to focus on distortion-sensitive regions.

Result: Extensive experiments show the method surpasses state-of-the-art approaches in both quantitative and qualitative evaluations.

Conclusion: The proposed model achieves superior alignment accuracy with flexible efficiency trade-offs and eliminates noticeable alignment errors.

Abstract: Existing unsupervised image alignment methods exhibit limited accuracy and high computational complexity. To address these challenges, we propose a dense cross-scale image alignment model. It takes into account the correlations between cross-scale features to decrease the alignment difficulty. Our model supports flexible trade-offs between accuracy and efficiency by adjusting the number of scales utilized. Additionally, we introduce a fully spatial correlation module to further improve accuracy while maintaining low computational costs. We incorporate the just noticeable difference to encourage our model to focus on image regions more sensitive to distortions, eliminating noticeable alignment errors. Extensive quantitative and qualitative experiments demonstrate that our method surpasses state-of-the-art approaches.

[136] USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Penghui Niu, Taotao Cai, Jiashuai She, Yajuan Zhang, Junhua Gua, Ping Zhanga, Jungong Hane, Jianxin Li

Main category: cs.CV

TL;DR: USF-Net is a unified spatiotemporal fusion network for cloud image sequence extrapolation that addresses limitations in adaptive feature extraction, temporal guidance, and computational efficiency through adaptive large-kernel convolutions and low-complexity attention mechanisms.

Details

Motivation: Existing cloud image extrapolation methods have limitations: lack adaptive feature extraction mechanisms, insufficient temporal guidance for long-range dependencies, and high computational costs from attention mechanisms, which hinder practical deployment in photovoltaic systems.

Method: Proposed USF-Net with encoder-decoder framework using adaptive large-kernel convolutions and low-complexity attention. Includes USTM with SiB (multi-scale context capture) and TiB (long-range temporal modeling), plus DSM with TGM for unified spatiotemporal dependencies. Decoder uses DUM to prevent ghosting effects.

Result: Extensive experiments on the new ASI-CIS dataset show USF-Net significantly outperforms state-of-the-art methods, achieving superior balance between prediction accuracy and computational efficiency for cloud extrapolation.

Conclusion: USF-Net effectively addresses key challenges in cloud image sequence extrapolation through its unified spatiotemporal fusion approach, demonstrating improved performance and efficiency for practical photovoltaic system applications.

Abstract: Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common “ghosting effect.” It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at https://github.com/she1110/ASI-CIS.

[137] 4KDehazeFlow: Ultra-High-Definition Image Dehazing via Flow Matching

Xingchi Chen, Pu Wang, Xuerui Li, Chaopeng Li, Juxiang Zhou, Jianhou Gan, Dianjie Lu, Guijuan Zhang, Wenqi Ren, Zhuoran Zheng

Main category: cs.CV

TL;DR: 4KDehazeFlow is a novel UHD image dehazing method using Flow Matching and Haze-Aware vector field for progressive optimization, achieving superior performance with 2dB PSNR improvement over state-of-the-art methods.

Details

Motivation: Address challenges in UHD image dehazing including limited scene adaptability in prior-based methods and high computational complexity with color distortion in deep learning approaches.

Method: Models dehazing as progressive optimization of continuous vector field flow using Flow Matching and Haze-Aware vector field. Uses learnable 3D LUT for efficient inference and fourth-order Runge-Kutta ODE solver for stable artifact suppression.

Result: Exceeds seven state-of-the-art methods with 2dB PSNR increase, better performance in dense haze and superior color fidelity.

Conclusion: 4KDehazeFlow provides efficient data-driven adaptive nonlinear color transformation for high-quality UHD image dehazing with general compatibility across various deep learning networks.

Abstract: Ultra-High-Definition (UHD) image dehazing faces challenges such as limited scene adaptability in prior-based methods and high computational complexity with color distortion in deep learning approaches. To address these issues, we propose 4KDehazeFlow, a novel method based on Flow Matching and the Haze-Aware vector field. This method models the dehazing process as a progressive optimization of continuous vector field flow, providing efficient data-driven adaptive nonlinear color transformation for high-quality dehazing. Specifically, our method has the following advantages: 1) 4KDehazeFlow is a general method compatible with various deep learning networks, without relying on any specific network architecture. 2) We propose a learnable 3D lookup table (LUT) that encodes haze transformation parameters into a compact 3D mapping matrix, enabling efficient inference through precomputed mappings. 3) We utilize a fourth-order Runge-Kutta (RK4) ordinary differential equation (ODE) solver to stably solve the dehazing flow field through an accurate step-by-step iterative method, effectively suppressing artifacts. Extensive experiments show that 4KDehazeFlow exceeds seven state-of-the-art methods. It delivers a 2dB PSNR increase and better performance in dense haze and color fidelity.

[138] PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Hector Liu, Zhiting Hu, Eric P. Xing

Main category: cs.CV

TL;DR: PAN is a general, interactable world model that predicts future states through video simulation conditioned on history and natural language actions, combining LLM-based reasoning with video diffusion for coherent long-term dynamics.

Details

Motivation: Current video generation models lack causal control and long-horizon consistency for reasoning, while existing world models are restricted to narrow domains with limited generalization. There's a need for general world models that can simulate diverse environments with interactivity.

Method: Uses Generative Latent Prediction (GLP) architecture: autoregressive latent dynamics backbone based on LLM for text-based reasoning and action conditioning, combined with video diffusion decoder for detailed visual reconstruction. Trained on large-scale video-action pairs across diverse domains.

Result: Achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other models. Supports open-domain simulation with coherent long-term dynamics.

Conclusion: PAN represents progress toward general world models that enable predictive simulation of future states for reasoning and acting, unifying latent space imagination with realizable world dynamics.

Abstract: A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.

[139] VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering

Hai-Dang Nguyen, Minh-Anh Dang, Minh-Tan Le, Minh-Tuan Le

Main category: cs.CV

TL;DR: VietMEAgent is a multimodal explainable framework for Vietnamese cultural VQA that combines cultural object detection with program generation to provide transparent explanations.

Details

Motivation: Current VQA systems struggle with culturally specific content due to under-represented cultural knowledge in training data and lack of interpretable reasoning processes.

Method: Integrates cultural object detection backbone with structured program generation, uses curated Vietnamese cultural knowledge base, and combines attention-based visual evidence with human-readable textual rationales.

Result: Developed a Vietnamese Cultural VQA dataset and demonstrated practicality of programming-based methodologies for cultural AI.

Conclusion: The system provides transparent explanations revealing both computational rationale and cultural context, supporting education and cultural preservation with interpretability and cultural sensitivity.

Abstract: Contemporary Visual Question Answering (VQA) systems remain constrained when confronted with culturally specific content, largely because cultural knowledge is under-represented in training corpora and the reasoning process is not rendered interpretable to end users. This paper introduces VietMEAgent, a multimodal explainable framework engineered for Vietnamese cultural understanding. The method integrates a cultural object detection backbone with a structured program generation layer, yielding a pipeline in which answer prediction and explanation are tightly coupled. A curated knowledge base of Vietnamese cultural entities serves as an explicit source of background information, while a dual-modality explanation module combines attention-based visual evidence with structured, human-readable textual rationales. We further construct a Vietnamese Cultural VQA dataset sourced from public repositories and use it to demonstrate the practicality of programming-based methodologies for cultural AI. The resulting system provides transparent explanations that disclose both the computational rationale and the underlying cultural context, supporting education and cultural preservation with an emphasis on interpretability and cultural sensitivity.

[140] Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference

Chengze Jiang, Minjing Dong, Xinli Shi, Jie Gui

Main category: cs.CV

TL;DR: DOC enhances adversarial robustness in vision-language models by generating diverse counterattacks using orthogonal gradient directions and momentum updates, improving defense against various attacks while maintaining clean accuracy.

Details

Motivation: Existing counterattack methods like TTC have limited diversity and coverage due to narrow search spaces, making them vulnerable to overfitting and ineffective against broad range of adversarial perturbations.

Method: Proposes Directional Orthogonal Counterattack (DOC) that incorporates orthogonal gradient directions and momentum-based updates to expand counterattack space exploration and increase perturbation diversity, plus a directional sensitivity score for adaptive counterattack strength modulation.

Result: Extensive experiments on 16 datasets show DOC improves adversarial robustness under various attacks while maintaining competitive clean accuracy.

Conclusion: Enhancing counterattack diversity and coverage through orthogonal directions and momentum updates is crucial for improving adversarial robustness in test-time defense for vision-language models.

Abstract: Vision-language pre-training models (VLPs) demonstrate strong multimodal understanding and zero-shot generalization, yet remain vulnerable to adversarial examples, raising concerns about their reliability. Recent work, Test-Time Counterattack (TTC), improves robustness by generating perturbations that maximize the embedding deviation of adversarial inputs using PGD, pushing them away from their adversarial representations. However, due to the fundamental difference in optimization objectives between adversarial attacks and counterattacks, generating counterattacks solely based on gradients with respect to the adversarial input confines the search to a narrow space. As a result, the counterattacks could overfit limited adversarial patterns and lack the diversity to fully neutralize a broad range of perturbations. In this work, we argue that enhancing the diversity and coverage of counterattacks is crucial to improving adversarial robustness in test-time defense. Accordingly, we propose Directional Orthogonal Counterattack (DOC), which augments counterattack optimization by incorporating orthogonal gradient directions and momentum-based updates. This design expands the exploration of the counterattack space and increases the diversity of perturbations, which facilitates the discovery of more generalizable counterattacks and ultimately improves the ability to neutralize adversarial perturbations. Meanwhile, we present a directional sensitivity score based on averaged cosine similarity to boost DOC by improving example discrimination and adaptively modulating the counterattack strength. Extensive experiments on 16 datasets demonstrate that DOC improves adversarial robustness under various attacks while maintaining competitive clean accuracy. Code is available at https://github.com/bookman233/DOC.

[141] Composition-Incremental Learning for Compositional Generalization

Zhen Li, Yuwei Wu, Chenchen Jing, Che Sun, Chuanhao Li, Yunde Jia

Main category: cs.CV

TL;DR: The paper introduces Composition-Incremental Learning (CompIL) for compositional generalization in CZSL, creating benchmarks and a pseudo-replay framework with visual synthesis and linguistic distillation.

Details

Motivation: Real-world data emerges continuously with infinite, long-tailed compositions, requiring models to incrementally improve compositional generalization capabilities.

Method: Proposed a pseudo-replay framework using a visual synthesizer to generate representations of learned compositions and linguistic primitive distillation to maintain aligned representations.

Result: Extensive experiments show the framework’s effectiveness on the created MIT-States-CompIL and C-GQA-CompIL benchmarks.

Conclusion: The CompIL approach enables progressive improvement of compositional generalization in incremental learning scenarios.

Abstract: Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

[142] Ultra-Light Test-Time Adaptation for Vision–Language Models

Byunghyun Kim

Main category: cs.CV

TL;DR: UL-TTA is a training-free, backprop-free test-time adaptation method that freezes vision-language model backbones and only adapts logit-level parameters (class prototypes, priors, temperature) using online EM-style updates with Bayesian priors, achieving state-of-the-art accuracy-calibration trade-offs under domain shift.

Details

Motivation: Vision-language models suffer from feature drift, class-prior mismatch, and miscalibration under domain shift, while existing test-time adaptation methods require backpropagation through large backbones or heavy memory/state, making them unsuitable for streaming and edge scenarios.

Method: Freezes backbone and adapts only logit-level parameters via online EM-style procedure with: selective sample filtering, closed-form Bayesian updates for prototypes/priors anchored by text and Dirichlet priors, decoupled temperatures for prediction vs. calibration, and lightweight guards to prevent drift.

Result: Consistently improves top-1 accuracy (+4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30% across large-scale cross-domain benchmarks, with <8% latency overhead. No collapse in long-stream experiments up to 200K samples.

Conclusion: Logit-level Bayesian adaptation is sufficient to achieve state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift without updating any backbone parameters, making it highly efficient for streaming and edge scenarios.

Abstract: Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.

[143] DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

Main category: cs.CV

TL;DR: The paper introduces DKDS, a new dataset for recognizing degraded Kuzushiji documents with seals, providing baselines for text/seal detection using YOLO models and document binarization using traditional and GAN-based methods.

Details

Motivation: Existing OCR methods for Kuzushiji perform well on clean documents but fail on degraded documents with noise like seals and document degradation, with no existing dataset addressing these challenges.

Method: Created DKDS dataset with Kuzushiji expert assistance, defined two benchmark tracks: (1) text and seal detection using YOLO models, (2) document binarization using traditional algorithms, K-means clustering, and GAN-based methods.

Result: Provided baseline results for both tracks, making the DKDS dataset and implementation code publicly available as a benchmark for degraded Kuzushiji document recognition.

Conclusion: DKDS addresses the gap in datasets for noisy Kuzushiji documents and provides a foundation for improving OCR performance on degraded historical documents with seals.

Abstract: Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using multiple versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, and Generative Adversarial Network (GAN)-based methods. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

[144] PIFF: A Physics-Informed Generative Flow Model for Real-Time Flood Depth Mapping

ChunLiang Wu, Tsunhua Yang, Hungying Chen

Main category: cs.CV

TL;DR: PIFF is a physics-informed flow-based generative neural network that provides near real-time flood depth estimation by mapping Digital Elevation Models to flood predictions, integrating simplified physics models and transformer-based rainfall encoding.

Details

Motivation: Traditional flood mapping methods like numerical modeling and aerial photography have limitations in efficiency and reliability, creating a need for faster, more accurate real-time flood prediction systems.

Method: Uses an image-to-image generative framework conditioned on a simplified inundation model (SPM) that embeds hydrodynamic priors, plus a transformer-based rainfall encoder to capture temporal precipitation dependencies, integrating physics-informed constraints with data-driven learning.

Result: Tested on a 26 km study area in Tainan, Taiwan with 182 rainfall scenarios (24-720 mm over 24 hours), PIFF demonstrated effective flood prediction capabilities, replacing costly simulations with accurate real-time flood maps.

Conclusion: PIFF offers an effective, data-driven alternative for flood prediction and response by capturing causal relationships between rainfall, topography, physics models, and flooding in near real-time.

Abstract: Flood mapping is crucial for assessing and mitigating flood impacts, yet traditional methods like numerical modeling and aerial photography face limitations in efficiency and reliability. To address these challenges, we propose PIFF, a physics-informed, flow-based generative neural network for near real-time flood depth estimation. Built on an image-to-image generative framework, it efficiently maps Digital Elevation Models (DEM) to flood depth predictions. The model is conditioned on a simplified inundation model (SPM) that embeds hydrodynamic priors into the training process. Additionally, a transformer-based rainfall encoder captures temporal dependencies in precipitation. Integrating physics-informed constraints with data-driven learning, PIFF captures the causal relationships between rainfall, topography, SPM, and flooding, replacing costly simulations with accurate, real-time flood maps. Using a 26 km study area in Tainan, Taiwan, with 182 rainfall scenarios ranging from 24 mm to 720 mm over 24 hours, our results demonstrate that PIFF offers an effective, data-driven alternative for flood prediction and response.

[145] MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, Guangtao Zhai

Main category: cs.CV

TL;DR: MACEval is a multi-agent continual evaluation network that provides dynamic, automated evaluation of large language models using role assignment, in-process data generation, and evaluation routing to address data contamination and scalability issues in traditional benchmarks.

Details

Motivation: Current benchmarks for large models are closed-ended, prone to data contamination, difficult to maintain, and heavily dependent on human curation, which undermines evaluation credibility and adaptability to advancing model capabilities.

Method: MACEval uses a multi-agent network with role assignment, in-process data generation, and evaluation routing through cascaded agents to create an interactive and autonomous evaluation system.

Result: Experiments on 9 open-ended tasks with 23 large models show MACEval is human-free and automatic, efficient and economical (reducing data and overhead), and flexible/scalable for integrating existing benchmarks.

Conclusion: MACEval provides a sustainable, longitudinal evaluation framework that can broaden future directions for large model evaluation by addressing key limitations of current benchmark approaches.

Abstract: Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.

[146] PressTrack-HMR: Pressure-Based Top-Down Multi-Person Global Human Mesh Recovery

Jiayue Yuan, Fangting Xie, Guangwen Ouyang, Changhai Ma, Ziyu Wu, Heyu Ding, Quan Wan, Yi Ke, Yuchen Wu, Xiaohui Cai

Main category: cs.CV

TL;DR: PressTrack-HMR is a top-down pipeline that recovers multi-person global human meshes from pressure signals using tracking-by-detection to segment individual pressure data and perform human mesh recovery for each person.

Details

Motivation: Traditional vision-based human mesh recovery faces limitations due to occlusions, poor lighting, and privacy concerns. Pressure signals from tactile mats offer an occlusion-free, privacy-friendly alternative, but existing methods struggle with distinguishing intermingled pressure signals from multiple people walking simultaneously.

Method: Top-down pipeline using tracking-by-detection strategy to identify and segment each individual’s pressure signal from raw pressure data, followed by human mesh recovery for each extracted individual signal. Also built a multi-person interaction pressure dataset (MIP).

Result: Achieved 89.2 mm MPJPE and 112.6 mm WA-MPJPE100, demonstrating effective multi-person human mesh recovery using pressure data.

Conclusion: The method showcases the potential of tactile mats for ubiquitous, privacy-preserving multi-person action recognition, successfully extending pressure-based human mesh recovery to multi-person scenarios.

Abstract: Multi-person global human mesh recovery (HMR) is crucial for understanding crowd dynamics and interactions. Traditional vision-based HMR methods sometimes face limitations in real-world scenarios due to mutual occlusions, insufficient lighting, and privacy concerns. Human-floor tactile interactions offer an occlusion-free and privacy-friendly alternative for capturing human motion. Existing research indicates that pressure signals acquired from tactile mats can effectively estimate human pose in single-person scenarios. However, when multiple individuals walk randomly on the mat simultaneously, how to distinguish intermingled pressure signals generated by different persons and subsequently acquire individual temporal pressure data remains a pending challenge for extending pressure-based HMR to the multi-person situation. In this paper, we present \textbf{PressTrack-HMR}, a top-down pipeline that recovers multi-person global human meshes solely from pressure signals. This pipeline leverages a tracking-by-detection strategy to first identify and segment each individual’s pressure signal from the raw pressure data, and subsequently performs HMR for each extracted individual signal. Furthermore, we build a multi-person interaction pressure dataset \textbf{MIP}, which facilitates further research into pressure-based human motion analysis in multi-person scenarios. Experimental results demonstrate that our method excels in multi-person HMR using pressure data, with 89.2~$mm$ MPJPE and 112.6~$mm$ WA-MPJPE$_{100}$, and these showcase the potential of tactile mats for ubiquitous, privacy-preserving multi-person action recognition. Our dataset & code are available at https://github.com/Jiayue-Yuan/PressTrack-HMR.

[147] HOTFLoc++: End-to-End Hierarchical LiDAR Place Recognition, Re-Ranking, and 6-DoF Metric Localisation in Forests

Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani

Main category: cs.CV

TL;DR: HOTFLoc++ is an end-to-end framework for LiDAR place recognition and 6-DoF localization in forests using an octree-based transformer with hierarchical descriptors and multi-scale geometric verification.

Details

Motivation: To address challenges in LiDAR place recognition in forest environments including clutter, self-similarity, viewpoint changes, and degraded correspondences in ground-to-ground and ground-to-aerial scenarios.

Method: Uses octree-based transformer for hierarchical local descriptors at multiple granularities, learnable multi-scale geometric verification module for re-ranking, and coarse-to-fine registration approach.

Result: Achieves 90.7% Recall@1 on CS-Wild-Places (29.6% improvement over baselines), 91.7% on Wild-Places, 96.0% on MulRan. 97.2% of registrations have <2m and <5° error. 2x runtime improvement over RANSAC for dense point clouds.

Conclusion: The method significantly outperforms state-of-the-art approaches in challenging forest environments while maintaining high performance on standard benchmarks, with substantial runtime improvements.

Abstract: This article presents HOTFLoc++, an end-to-end framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts hierarchical local descriptors at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose a learnable multi-scale geometric verification module to reduce re-ranking failures in the presence of degraded single-scale correspondences. Our coarse-to-fine registration approach achieves comparable or lower localisation errors to baselines, with runtime improvements of two orders of magnitude over RANSAC for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 96.0% on Wild-Places and MulRan, respectively. Our method achieves under 2 m and 5 degrees error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2$\times$ on average. The code will be available upon acceptance.

[148] DBINDS – Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?

Yanlin Wu, Xiaogang Yuan, Dezhi An

Main category: cs.CV

TL;DR: DBINDS is a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels to detect AI-generated videos, achieving strong cross-generator generalization with limited training data.

Details

Motivation: AI-generated video poses serious challenges to content security and forensic analysis, and existing detectors relying on pixel-level visual cues generalize poorly to unseen generators.

Method: Proposes DBINDS that uses diffusion inversion to recover initial noise sequences, forms Initial Noise Difference Sequence (INDS), extracts multi-domain multi-scale features, and uses feature optimization with LightGBM classifier tuned by Bayesian search.

Result: DBINDS trained on a single generator achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.

Conclusion: Analyzing latent-space dynamics through diffusion inversion provides an effective approach for detecting AI-generated videos with strong generalization capabilities across different generators.

Abstract: AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.

[149] Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

Yuhao Shen, Jiahe Qian, Shuping Zhang, Zhangtianyi Chen, Tao Lu, Juexiao Zhou

Main category: cs.CV

TL;DR: The paper introduces DermBench and DermEval - a benchmark and automatic evaluator for assessing dermatology diagnostic narratives generated by multimodal LLMs, achieving close alignment with expert ratings.

Details

Motivation: Reliable evaluation is the primary bottleneck for responsible clinical deployment of multimodal LLMs in dermatology diagnosis, as current methods lack clinically meaningful and scalable assessment.

Method: Developed DermBench with 4,000 real-world dermatology images paired with expert-certified narratives, and DermEval - a reference-free multimodal evaluator that produces structured critiques and scores for generated narratives.

Result: Experiments on 4,500 cases show DermBench and DermEval achieve mean deviations of 0.251 and 0.117 (out of 5) from expert ratings, providing reliable measurement across different multimodal LLMs.

Conclusion: The proposed framework enables clinically meaningful, reproducible, and scalable evaluation of dermatology diagnostic narratives, addressing the critical need for reliable assessment in clinical deployment of multimodal LLMs.

Abstract: Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.

[150] Taming Object Hallucinations with Verified Atomic Confidence Estimation

Jiarui Liu, Weihao Xuan, Zhijing Jin, Mona Diab

Main category: cs.CV

TL;DR: TACO is a framework that reduces hallucinations in Multimodal Large Language Models by decomposing responses into atomic queries, paraphrasing them, estimating confidence through self-consistency or self-confidence, and refining answers.

Details

Motivation: MLLMs often suffer from hallucinations in object existence, attributes, or relations, which undermines their reliability and trustworthiness.

Method: TACO decomposes responses into atomic queries, paraphrases them to reduce wording sensitivity, estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, and refines answers with a language model.

Result: Experiments on five benchmarks with two MLLMs show TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration.

Conclusion: TACO effectively enhances the faithfulness of MLLMs by mitigating hallucinations through self-verification and confidence calibration without external vision experts.

Abstract: Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.

[151] Spatial Information Bottleneck for Interpretable Visual Recognition

Kaixiang Shu, Kai Meng, Junqin Luo

Main category: cs.CV

TL;DR: The paper proposes Spatial Information Bottleneck (S-IB) to spatially disentangle neural network representations by optimizing Vector-Jacobian Products during training, improving attribution quality across multiple explanation methods.

Details

Motivation: Deep neural networks learn spatially entangled representations that mix discriminative foreground features with spurious background correlations, undermining model interpretability and robustness.

Method: Proposes Spatial Information Bottleneck (S-IB) framework that maximizes mutual information between foreground Vector-Jacobian Products (VJP) and inputs while minimizing mutual information in background regions, encouraging networks to encode information only in class-relevant spatial areas.

Result: Experiments on five benchmarks show universal improvements across six explanation methods, achieving better foreground concentration and background suppression without method-specific tuning, along with consistent classification accuracy gains.

Conclusion: Directly optimizing VJP’s spatial structure during training improves visualization quality across diverse explanation paradigms, providing a principled approach to enhance model interpretability and robustness.

Abstract: Deep neural networks typically learn spatially entangled representations that conflate discriminative foreground features with spurious background correlations, thereby undermining model interpretability and robustness. We propose a novel understanding framework for gradient-based attribution from an information-theoretic perspective. We prove that, under mild conditions, the Vector-Jacobian Products (VJP) computed during backpropagation form minimal sufficient statistics of input features with respect to class labels. Motivated by this finding, we propose an encoding-decoding perspective : forward propagation encodes inputs into class space, while VJP in backpropagation decodes this encoding back to feature space. Therefore, we propose Spatial Information Bottleneck (S-IB) to spatially disentangle information flow. By maximizing mutual information between foreground VJP and inputs while minimizing mutual information in background regions, S-IB encourages networks to encode information only in class-relevant spatial regions. Since post-hoc explanation methods fundamentally derive from VJP computations, directly optimizing VJP’s spatial structure during training improves visualization quality across diverse explanation paradigms. Experiments on five benchmarks demonstrate universal improvements across six explanation methods, achieving better foreground concentration and background suppression without method-specific tuning, alongside consistent classification accuracy gains.

[152] GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow

Rui Wan, Qi Zheng, Ruoyu Zhang, Bu Chen, Jiaming Liu, Min Li, Minge Jing, Jinjia Zhou, Yibo Fan

Main category: cs.CV

TL;DR: Proposes FPGA-based deployment of Animation-based Generative Codec for edge devices, achieving 24.9× and 4.1× higher energy efficiency vs CPU and GPU respectively.

Details

Motivation: AGC deployment on edge devices faces challenges due to high parameter count, algorithm inflexibility, and power consumption from extensive computations and data transmission.

Method: Network compression via static quantization and layer fusion, co-processor paradigm with hardware processing units (convolution, grid sampling, upsample), and parallelization optimization strategies like double-buffered pipelines and loop unrolling.

Result: Achieved 24.9× higher energy efficiency vs CPU and 4.1× vs GPU, requiring only 11.7 μJ per reconstructed pixel on PYNQ-Z1 FPGA platform.

Conclusion: FPGA-based deployment scheme enables efficient AGC implementation on resource-constrained edge devices with significantly improved energy efficiency.

Abstract: The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf{24.9$\times$} and \textbf{4.1$\times$} higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf{11.7} microjoules ($\upmu$J) are required for one pixel reconstructed by this FPGA system.

[153] Deep Learning for Metabolic Rate Estimation from Biosignals: A Comparative Study of Architectures and Signal Selection

Sarvenaz Babakhani, David Remy, Alina Roitberg

Main category: cs.CV

TL;DR: Systematic evaluation of deep learning vs classical methods for energy expenditure estimation from physiological signals, finding transformers with minute ventilation achieve best performance (0.87 W/kg RMSE), with significant variability across activities and subjects.

Details

Motivation: Existing deep learning approaches for energy expenditure estimation rarely disentangle neural architecture effects from signal choice effects, and there's limited systematic comparison between classical and deep learning methods.

Method: Compared classical baselines with neural architectures (CNN, ResNet, transformers) across single signals, signal pairs, and grouped sensor inputs from Hexoskin smart shirt for diverse physical activities.

Result: Transformer with minute ventilation achieved lowest overall RMSE (0.87 W/kg). Paired/grouped signals worked well with faster models like CNN/ResNet. Per-activity analysis showed better results in low-intensity activities (RMSE down to 0.29 W/kg) but higher errors in intense tasks. Strong inter-individual variability observed.

Conclusion: Minute ventilation is most predictive individual signal, transformers perform best overall, but subject-level variability motivates need for adaptive modeling strategies.

Abstract: Energy expenditure estimation aims to infer human metabolic rate from physiological signals such as heart rate, respiration, or accelerometer data, and has been studied primarily with classical regression methods. The few existing deep learning approaches rarely disentangle the role of neural architecture from that of signal choice. In this work, we systematically evaluate both aspects. We compare classical baselines with newer neural architectures across single signals, signal pairs, and grouped sensor inputs for diverse physical activities. Our results show that minute ventilation is the most predictive individual signal, with a transformer model achieving the lowest root mean square error (RMSE) of 0.87 W/kg across all activities. Paired and grouped signals, such as those from the Hexoskin smart shirt (five signals), offer good alternatives for faster models like CNN and ResNet with attention. Per-activity evaluation revealed mixed outcomes: notably better results in low-intensity activities (RMSE down to 0.29 W/kg; NRMSE = 0.04), while higher-intensity tasks showed larger RMSE but more comparable normalized errors. Finally, subject-level analysis highlights strong inter-individual variability, motivating the need for adaptive modeling strategies. Our code and models will be publicly available at https://github.com/Sarvibabakhani/deeplearning-biosignals-ee .

Amir M. Mansourian, Amir Mohammad Babaei, Shohreh Kasaei

Main category: cs.CV

TL;DR: RichKD enhances knowledge distillation by fusing traditional teacher outputs with CLIP’s vision-language knowledge, improving accuracy, confidence, and robustness.

Details

Motivation: Existing multi-teacher KD methods lack knowledge diversity by relying only on visual information, missing the potential of cross-modal representations from vision-language models like CLIP.

Method: Proposes RichKD framework that fuses logits and features from conventional teacher with CLIP’s multi-prompt textual guidance, capturing both dataset-specific and semantically enriched visual cues.

Result: Outperforms most existing baselines across multiple benchmarks, increases confident-correct predictions, reduces confidently wrong cases, improves inter-class consistency, and shows stronger robustness under distribution shifts and input corruptions.

Conclusion: Incorporating CLIP’s vision-language knowledge as complementary supervision significantly enhances knowledge distillation effectiveness, demonstrating the value of cross-modal representations in KD.

Abstract: Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP’s vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP’s multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.

[155] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

Shengqi Dang, Fu Chai, Jiaxin Li, Chao Yuan, Wei Ye, Nan Cao

Main category: cs.CV

TL;DR: DensiCrafter generates lightweight, self-supporting 3D hollow structures by optimizing density fields with physics-based constraints, achieving up to 43% material reduction while maintaining geometric fidelity.

Details

Motivation: Existing 3D generative models ignore physical constraints and manufacturability, particularly the need for lightweight and self-supporting structures for reliable fabrication.

Method: Optimizes continuous density fields from coarse voxel grids using three differentiable, physics-constrained loss terms and mass regularization, while preserving outer surfaces. Integrates with pretrained Trellis-based models.

Result: Achieves up to 43% material mass reduction in text-to-3D tasks, improves structural stability, and maintains high geometric fidelity compared to state-of-the-art baselines.

Conclusion: The method successfully generates manufacturable hollow designs that are self-supporting and can be reliably 3D-printed, bridging the gap between generative AI and physical fabrication constraints.

Abstract: The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

[156] DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation

Le Yi, Wei Huang, Lei Zhang, Kefu Zhao, Yan Wang, Zizhou Wang

Main category: cs.CV

TL;DR: Introduces a feedback mechanism in teacher-student semi-supervised learning to address error propagation in medical image segmentation, featuring a dual-teacher model for dynamic error correction.

Details

Motivation: Traditional teacher-student frameworks in medical image segmentation are vulnerable to error propagation due to image ambiguities, leading to self-reinforcing bias that existing methods don't adequately address.

Method: Proposes a feedback mechanism where the student provides feedback on teacher’s pseudo-labels, with two key components: feedback attributor and feedback receiver. Further develops a dual-teacher feedback model for dynamic error correction through cross-teacher supervision.

Result: Comprehensive evaluations on three medical image benchmarks demonstrate effectiveness in addressing error propagation in semi-supervised medical image segmentation.

Conclusion: The feedback mechanism successfully counteracts error reconfirmations in teacher-student frameworks, with the dual-teacher model providing additional gains by resolving disagreements and avoiding consistent errors.

Abstract: The teacher-student paradigm has emerged as a canonical framework in semi-supervised learning. When applied to medical image segmentation, the paradigm faces challenges due to inherent image ambiguities, making it particularly vulnerable to erroneous supervision. Crucially, the student’s iterative reconfirmation of these errors leads to self-reinforcing bias. While some studies attempt to mitigate this bias, they often rely on external modifications to the conventional teacher-student framework, overlooking its intrinsic potential for error correction. In response, this work introduces a feedback mechanism into the teacher-student framework to counteract error reconfirmations. Here, the student provides feedback on the changes induced by the teacher’s pseudo-labels, enabling the teacher to refine these labels accordingly. We specify that this interaction hinges on two key components: the feedback attributor, which designates pseudo-labels triggering the student’s update, and the feedback receiver, which determines where to apply this feedback. Building on this, a dual-teacher feedback model is further proposed, which allows more dynamics in the feedback loop and fosters more gains by resolving disagreements through cross-teacher supervision while avoiding consistent errors. Comprehensive evaluations on three medical image benchmarks demonstrate the method’s effectiveness in addressing error propagation in semi-supervised medical image segmentation.

[157] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Main category: cs.CV

TL;DR: FQ-PETR is a fully quantized framework for PETR-based 3D detection models that addresses quantization challenges through three innovations: QFPE for feature alignment, DULUT for non-linear function approximation, and QANS for attention stabilization, achieving near-floating-point accuracy with 75% latency reduction.

Details

Motivation: PETR models excel in multi-view 3D detection but face deployment challenges due to high computational cost and memory footprint. Direct quantization causes severe accuracy degradation due to multi-modal feature disparity and inefficient non-linear operator quantization.

Method: Three key innovations: (1) QFPE replaces multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding; (2) DULUT approximates complex non-linear functions using two cascaded linear lookup tables; (3) QANS performs quantization after softmax numerical stabilization.

Result: FQ-PETR achieves near-floating-point accuracy (only 1% degradation) under W8A8 quantization while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines on PETR variants.

Conclusion: FQ-PETR successfully enables efficient deployment of PETR-based 3D detection models through a comprehensive quantization framework that addresses key challenges in multi-modal feature alignment and non-linear operator quantization.

Abstract: Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.

[158] Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving Infrared Small Target Detection

Houzhang Fang, Shukai Guo, Qiuhuan Chen, Yi Chang, Luxin Yan

Main category: cs.CV

TL;DR: TDCNet is a novel moving infrared small target detection network that combines temporal difference and 3D convolution through re-parameterized TDC blocks to capture multi-scale motion contextual features while suppressing background clutter.

Details

Motivation: Moving IRSTD is challenging due to weak target features and complex background interference. Existing methods either use temporal differences (limited spatial feature extraction) or 3D convolutions (lacks explicit motion awareness), creating a need for better spatio-temporal feature modeling.

Method: Proposes TDCNet with: 1) Temporal Difference Convolution (TDC) re-parameterization module with three parallel TDC blocks that fuse temporal difference and 3D convolution; 2) TDC-guided spatio-temporal attention mechanism that performs cross-attention between TDC-based and 3D backbone features.

Result: Extensive experiments on IRSTD-UAV and public infrared datasets show TDCNet achieves state-of-the-art detection performance in moving target detection.

Conclusion: The proposed TDCNet effectively extracts and enhances spatio-temporal features for accurate moving infrared small target detection by combining the strengths of both temporal difference and 3D convolution approaches.

Abstract: Moving infrared small target detection (IRSTD) plays a critical role in practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based search system. Moving IRSTD still remains highly challenging due to weak target features and complex background interference. Accurate spatio-temporal feature modeling is crucial for moving target detection, typically achieved through either temporal differences or spatio-temporal (3D) convolutions. Temporal difference can explicitly leverage motion cues but exhibits limited capability in extracting spatial features, whereas 3D convolution effectively represents spatio-temporal features yet lacks explicit awareness of motion dynamics along the temporal dimension. In this paper, we propose a novel moving IRSTD network (TDCNet), which effectively extracts and enhances spatio-temporal features for accurate target detection. Specifically, we introduce a novel temporal difference convolution (TDC) re-parameterization module that comprises three parallel TDC blocks designed to capture contextual dependencies across different temporal ranges. Each TDC block fuses temporal difference and 3D convolution into a unified spatio-temporal convolution representation. This re-parameterized module can effectively capture multi-scale motion contextual features while suppressing pseudo-motion clutter in complex backgrounds, significantly improving detection performance. Moreover, we propose a TDC-guided spatio-temporal attention mechanism that performs cross-attention between the spatio-temporal features from the TDC-based backbone and a parallel 3D backbone. This mechanism models their global semantic dependencies to refine the current frame’s features. Extensive experiments on IRSTD-UAV and public infrared datasets demonstrate that our TDCNet achieves state-of-the-art detection performance in moving target detection.

[159] Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo

Main category: cs.CV

TL;DR: Flora is a zero-shot skeleton action recognition method that addresses fragile alignment and rigid classifiers through flexible neighbor-aware semantic attunement and open-form distribution-aware flow classification.

Details

Motivation: Existing zero-shot skeleton action recognition methods suffer from fragile point-to-point alignment due to imperfect semantics and rigid classifiers with static decision boundaries, limiting their performance.

Method: Uses flexible neighbor-aware semantic attunement with cross-modal geometric consistency for robust alignment, and employs noise-free flow matching with condition-free contrastive regularization for distribution-aware classification.

Result: Achieves impressive performance on three benchmark datasets, even when trained with only 10% of seen data, demonstrating strong zero-shot recognition capabilities.

Conclusion: Flora effectively addresses alignment and classification challenges in zero-shot skeleton action recognition through its novel semantic attunement and flow-based classification approach.

Abstract: Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an “align-then-classify” paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

[160] OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS

Haiyi Li, Qi Chen, Denis Kalkofen, Hsiang-Ting Chen

Main category: cs.CV

TL;DR: OUGS introduces an object-aware uncertainty formulation for 3D Gaussian Splatting that uses physical Gaussian parameters and semantic masks to improve active reconstruction of specific objects in complex scenes.

Details

Motivation: Existing active reconstruction methods use scene-level uncertainty metrics that are biased by background clutter, leading to inefficient view selection for object-centric tasks.

Method: Derives uncertainty directly from explicit physical parameters of 3D Gaussian primitives (position, scale, rotation) by propagating covariance through rendering Jacobian, then integrates semantic segmentation masks for object-aware uncertainty scoring.

Result: Significantly improves efficiency of 3DGS reconstruction process and achieves higher quality for targeted objects compared to state-of-the-art methods, while serving as robust uncertainty estimator for global scene.

Conclusion: OUGS provides a principled, physically-grounded uncertainty formulation that enables more effective object-centric active view selection and reconstruction in 3D Gaussian Splatting.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.

[161] BronchOpt : Vision-Based Pose Optimization with Fine-Tuned Foundation Models for Accurate Bronchoscopy Navigation

Hongchao Shu, Roger D. Soberanis-Mukul, Jiru Xu, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

Main category: cs.CV

TL;DR: A vision-based framework for bronchoscope tip localization using 2D-3D registration between endoscopic views and CT scans, with a new synthetic benchmark dataset for standardized evaluation.

Details

Motivation: Existing vision-based methods fail to generalize across domains and patients due to respiratory motion, anatomical variability, and CT-to-body divergence, leading to alignment errors in bronchoscopy navigation.

Method: Proposes a vision-based pose optimization framework with modality- and domain-invariant encoder for similarity computation between real RGB frames and CT-rendered depth maps, using differentiable rendering for iterative camera pose refinement.

Result: Achieves average translational error of 2.65 mm and rotational error of 0.19 rad when trained on synthetic data, with strong cross-domain generalization on real patient data without domain-specific adaptation.

Conclusion: The framework provides robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark enables standardized progress in vision-based bronchoscopy navigation.

Abstract: Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.

[162] Hand Held Multi-Object Tracking Dataset in American Football

Rintaro Otsubo, Kanta Sawafuji, Hideo Saito

Main category: cs.CV

TL;DR: Created the first American football player tracking dataset and benchmarked detection/tracking methods, showing improved performance with fine-tuned models in crowded scenarios.

Details

Motivation: No standardized dataset exists for American football player tracking despite unique challenges like frequent occlusion and physical contact, making fair method comparisons difficult.

Method: Constructed dedicated detection/tracking dataset for American football players and conducted comparative evaluation of various detection and tracking methods with fine-tuned models.

Result: Fine-tuned detection models outperformed pre-trained models, and integrating fine-tuned detectors with re-identification models significantly improved tracking accuracy in crowded scenarios.

Conclusion: Enables robust detection and tracking of American football players in challenging high-density scenarios previously underserved by conventional methods.

Abstract: Multi-Object Tracking (MOT) plays a critical role in analyzing player behavior from videos, enabling performance evaluation. Current MOT methods are often evaluated using publicly available datasets. However, most of these focus on everyday scenarios such as pedestrian tracking or are tailored to specific sports, including soccer and basketball. Despite the inherent challenges of tracking players in American football, such as frequent occlusion and physical contact, no standardized dataset has been publicly available, making fair comparisons between methods difficult. To address this gap, we constructed the first dedicated detection and tracking dataset for the American football players and conducted a comparative evaluation of various detection and tracking methods. Our results demonstrate that accurate detection and tracking can be achieved even in crowded scenarios. Fine-tuning detection models improved performance over pre-trained models. Furthermore, when these fine-tuned detectors and re-identification models were integrated into tracking systems, we observed notable improvements in tracking accuracy compared to existing approaches. This work thus enables robust detection and tracking of American football players in challenging, high-density scenarios previously underserved by conventional methods.

[163] Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, Runhao Zeng

Main category: cs.CV

TL;DR: Dual-Teacher Knowledge Distillation framework that uses both ViT and CNN teachers to guide a lightweight CNN student, addressing architectural mismatch in cross-architecture knowledge distillation.

Details

Motivation: Vision Transformers (ViTs) have strong performance but high computational cost, while lightweight CNNs are efficient but less accurate. Existing cross-architecture knowledge distillation methods struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers.

Method: Proposes a Dual-Teacher Knowledge Distillation framework with two key components: (1) Discrepancy-Aware Teacher Weighting that dynamically fuses ViT and CNN teacher predictions based on confidence and student discrepancy, and (2) Structure Discrepancy-Aware Distillation where student learns residual features between ViT and CNN teachers via lightweight auxiliary branch.

Result: Extensive experiments on HMDB51, EPIC-KITCHENS-100, and Kinetics-400 show consistent outperformance of state-of-the-art distillation approaches, with maximum accuracy gain of 5.95% on HMDB51.

Conclusion: The proposed dual-teacher framework effectively addresses architectural mismatch in cross-architecture knowledge distillation and achieves significant performance improvements for lightweight CNN video action recognition.

Abstract: Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT’s high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.

[164] DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

Jerrin Bright, Yuhao Chen, John S. Zelek

Main category: cs.CV

TL;DR: DreamPose3D is a diffusion-based 3D human pose estimation framework that combines action-aware reasoning with temporal imagination, achieving state-of-the-art performance by using action prompts, kinematic joint affinity, and hallucinative pose decoding.

Details

Motivation: Existing 3D pose estimation methods rely solely on geometric cues and predict poses independently, limiting their ability to resolve ambiguous motions and generalize to real-world scenarios. Humans understand motion through action awareness and temporal reasoning.

Method: DreamPose3D uses a diffusion-based framework with: 1) Action prompts extracted from 2D pose sequences for dynamic conditioning, 2) Representation encoder with kinematic joint affinity in attention mechanism, 3) Hallucinative pose decoder for temporally coherent 3D pose sequence prediction.

Result: State-of-the-art performance on Human3.6M and MPI-3DHP datasets across all metrics. Strong performance on broadcast baseball dataset despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.

Conclusion: DreamPose3D successfully integrates action-aware reasoning with temporal imagination, demonstrating that combining high-level intent understanding with structural joint relationships significantly improves 3D pose estimation accuracy and robustness in challenging real-world scenarios.

Abstract: Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D’s robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.

[165] vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

Main category: cs.CV

TL;DR: vMFCoOp is a framework that uses von Mises-Fisher distributions on a hyperspherical manifold to align semantic biases between LLMs and CLIP models, achieving robust biomedical prompting and superior few-shot classification across medical datasets.

Details

Motivation: Prompt learning in biomedical VLMs faces challenges from semantic misalignment between LLMs and CLIP variants due to different training data and architectures, lacks scalability across evolving foundation models, and conventional Euclidean-space optimization amplifies modality gaps in complex biomedical imaging.

Method: The framework inversely estimates von Mises-Fisher distributions on a shared Hyperspherical Manifold, aligning semantic biases via Unified Semantic Anchors with three complementary constraints for robust biomedical prompting.

Result: vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability.

Conclusion: The proposed framework achieves superior biomedical prompting and few-shot classification, with plans for continuous expansion to more downstream applications and resource sharing through GitHub.

Abstract: Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work will be continuously expanded to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

[166] RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, Neehar Peri

Main category: cs.CV

TL;DR: RF-DETR is a lightweight specialist detection transformer that uses neural architecture search to find optimal accuracy-latency tradeoffs for target datasets, significantly outperforming existing methods on COCO and Roboflow100-VL.

Details

Motivation: Open-vocabulary detectors perform well on COCO but struggle with real-world datasets containing out-of-distribution classes not seen during pre-training, highlighting the need for efficient domain adaptation.

Method: Fine-tunes a pre-trained base network on target datasets and evaluates thousands of network configurations with different accuracy-latency tradeoffs using weight-sharing neural architecture search without retraining.

Result: RF-DETR (nano) achieves 48.0 AP on COCO (5.3 AP improvement over D-FINE at similar latency), and RF-DETR (2x-large) outperforms GroundingDINO by 1.2 AP on Roboflow100-VL while running 20x faster.

Conclusion: RF-DETR establishes new state-of-the-art performance for real-time detectors, with RF-DETR (2x-large) being the first real-time detector to surpass 60 AP on COCO, demonstrating effective domain adaptation through neural architecture search.

Abstract: Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the “tunable knobs” for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr

[167] ACDC: The Adverse Conditions Dataset with Correspondences for Robust Semantic Driving Scene Perception

Christos Sakaridis, Haoran Wang, Ke Li, René Zurbrügg, Arpit Jadon, Wim Abbeloos, Daniel Olmeda Reino, Luc Van Gool, Dengxin Dai

Main category: cs.CV

TL;DR: ACDC is a large-scale driving dataset with 8012 images, half in adverse conditions (fog, night, rain, snow), featuring pixel-level panoptic annotations and corresponding normal-condition images to advance robust visual perception for autonomous driving.

Details

Motivation: Existing driving datasets lack sufficient adverse condition images and scale, limiting development of robust perception systems needed for Level-5 autonomous driving.

Method: Created ACDC dataset with 8012 images (4006 adverse, 4006 normal), featuring pixel-level panoptic annotations, corresponding scene pairs, and uncertainty masks for four adverse conditions: fog, nighttime, rain, and snow.

Result: Dataset enables evaluation of semantic segmentation, object detection, instance segmentation, panoptic segmentation, and uncertainty-aware semantic segmentation, revealing challenges for state-of-the-art methods in adverse conditions.

Conclusion: ACDC provides valuable benchmark for steering future progress in robust visual perception for autonomous driving under adverse conditions.

Abstract: Level-5 driving automation requires a robust visual perception system that can parse input images under any condition. However, existing driving datasets for dense semantic perception are either dominated by images captured under normal conditions or are small in scale. To address this, we introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing methods for diverse semantic perception tasks on adverse visual conditions. ACDC consists of a large set of 8012 images, half of which (4006) are equally distributed between four common adverse conditions: fog, nighttime, rain, and snow. Each adverse-condition image comes with a high-quality pixel-level panoptic annotation, a corresponding image of the same scene under normal conditions, and a binary mask that distinguishes between intra-image regions of clear and uncertain semantic content. 1503 of the corresponding normal-condition images feature panoptic annotations, raising the total annotated images to 5509. ACDC supports the standard tasks of semantic segmentation, object detection, instance segmentation, and panoptic segmentation, as well as the newly introduced uncertainty-aware semantic segmentation. A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches and indicates the value of our dataset in steering future progress in the field. Our dataset and benchmark are publicly available at https://acdc.vision.ee.ethz.ch

[168] Background Invariance Testing According to Semantic Proximity

Zukang Liao, Min Chen

Main category: cs.CV

TL;DR: A method for testing background invariance in ML models using semantic proximity and ontology-based scene selection to improve testing diversity and human annotation consistency.

Details

Motivation: Testing background invariance is challenging due to vast data spaces, and current methods using human analysts lack systematic sampling approaches for consistent decisions.

Method: Uses semantic proximity to select background scenes, constructs ontology for object relationships via association analysis, and enables efficient search for diverse semantic distances.

Result: Achieves superior balance between diversity and human annotation consistency compared to random sampling, nearest neighbors, or VLM-based test suites.

Conclusion: The ontology-based approach enhances reliability and comprehensiveness of background invariance testing by providing systematic and meaningful test suite selection.

Abstract: In many applications, machine-learned (ML) models are required to hold some invariance qualities, such as rotation, size, and intensity invariance. Among these, testing for background invariance presents a significant challenge due to the vast and complex data space it encompasses. To evaluate invariance qualities, we first use a visualization-based testing framework which allows human analysts to assess and make informed decisions about the invariance properties of ML models. We show that such informative testing framework is preferred as ML models with the same global statistics (e.g., accuracy scores) can behave differently and have different visualized testing patterns. However, such human analysts might not lead to consistent decisions without a systematic sampling approach to select representative testing suites. In this work, we present a technical solution for selecting background scenes according to their semantic proximity to a target image that contains a foreground object being tested. We construct an ontology for storing knowledge about relationships among different objects using association analysis. This ontology enables an efficient and meaningful search for background scenes of different semantic distances to a target image, enabling the selection of a test suite that is both diverse and reasonable. Compared with other testing techniques, e.g., random sampling, nearest neighbors, or other sampled test suites by visual-language models (VLMs), our method achieved a superior balance between diversity and consistency of human annotations, thereby enhancing the reliability and comprehensiveness of background invariance testing.

[169] Adjacent-view Transformers for Supervised Surround-view Depth Estimation

Xianda Guo, Wenjie Yuan, Yunpeng Zhang, Tian Yang, Chenming Zhang, Zheng Zhu, Qin Zou, Long Chen

Main category: cs.CV

TL;DR: AVT-SSDepth is a supervised surround-view depth estimation method that uses transformer-based attention mechanisms to jointly predict depth maps across multiple surrounding cameras, achieving state-of-the-art performance on autonomous driving datasets.

Details

Motivation: Existing monocular depth estimation methods mainly focus on front-view cameras and ignore correlations across surround-view cameras, which are crucial for comprehensive 3D perception in robotics and autonomous driving.

Method: Proposes a global-to-local feature extraction combining CNN with transformer layers, and adjacent-view attention mechanism with self-attention (intra-view) and adjacent attention (inter-view) modules to exchange multi-scale representations across surround-view cameras.

Result: Achieves superior performance over state-of-the-art methods on both DDAD and nuScenes datasets, with strong cross-dataset generalization capability.

Conclusion: AVT-SSDepth effectively addresses surround-view depth estimation by leveraging transformer-based attention mechanisms to capture both intra-view and inter-view correlations, demonstrating significant improvements in multi-camera depth prediction for autonomous driving applications.

Abstract: Depth estimation has been widely studied and serves as the fundamental step of 3D perception for robotics and autonomous driving. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose an Adjacent-View Transformer for Supervised Surround-view Depth estimation (AVT-SSDepth), to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module that combines CNN with transformer layers for enriched representations. Further, the adjacent-view attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surroundview feature maps. In addition, AVT-SSDepth has strong crossdataset generalization. Extensive experiments show that our method achieves superior performance over existing state-ofthe-art methods on both DDAD and nuScenes datasets. Code is available at https://github.com/XiandaGuo/SSDepth.

[170] Exploring the Adversarial Robustness of Face Forgery Detection with Decision-based Black-box Attacks

Zhaoyu Chen, Bo Li, Kaixun Jiang, Shuang Wu, Shouhong Ding, Wenqiang Zhang

Main category: cs.CV

TL;DR: The paper proposes novel decision-based attacks on face forgery detection systems using cross-task perturbation and frequency domain attacks to overcome initialization failures and maintain image quality.

Details

Motivation: Face forgery detection systems are vulnerable to adversarial attacks, but existing attacks rely on network architectures or training data rather than predicted labels, creating a gap for attacking deployed applications.

Method: Proposed cross-task perturbation to handle initialization failures by leveraging face feature correlations across tasks, and frequency decision-based attack that adds perturbations in frequency domain while constraining spatial domain quality.

Result: Achieves state-of-the-art attack performance on FaceForensics++, CelebDF, and industrial APIs with high query efficiency and guaranteed image quality. Fake faces can bypass both forgery detection and face recognition.

Conclusion: The method exposes security vulnerabilities in face forgery detectors, demonstrating that fake faces can successfully evade detection while maintaining visual quality, highlighting serious security concerns.

Abstract: Face forgery generation technologies generate vivid faces, which have raised public concerns about security and privacy. Many intelligent systems, such as electronic payment and identity verification, rely on face forgery detection. Although face forgery detection has successfully distinguished fake faces, recent studies have demonstrated that face forgery detectors are very vulnerable to adversarial examples. Meanwhile, existing attacks rely on network architectures or training datasets instead of the predicted labels, which leads to a gap in attacking deployed applications. To narrow this gap, we first explore the decision-based attacks on face forgery detection. We identify challenges in directly applying existing decision-based attacks, such as perturbation initialization failure and reduced image quality. To overcome these issues, we propose cross-task perturbation to handle initialization failures by utilizing the high correlation of face features on different tasks. Additionally, inspired by the use of frequency cues in face forgery detection, we introduce the frequency decision-based attack. This attack involves adding perturbations in the frequency domain while constraining visual quality in the spatial domain. Finally, extensive experiments demonstrate that our method achieves state-of-the-art attack performance on FaceForensics++, CelebDF, and industrial APIs, with high query efficiency and guaranteed image quality. Further, the fake faces by our method can pass face forgery detection and face recognition, which exposes the security problems of face forgery detectors.

[171] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen

Main category: cs.CV

TL;DR: This paper proposes multimodal adversarial training (MAT) to defend against multimodal attacks in vision-language tasks, addressing limitations of existing unimodal defenses and exploring how one-to-many image-text relationships can enhance robustness.

Details

Motivation: Existing defense methods focus only on image classification and overlook multimodal attacks (both image and text perturbations) and the one-to-many relationships in vision-language tasks, leaving VL models vulnerable.

Method: Proposed multimodal adversarial training (MAT) that incorporates adversarial perturbations in both image and text modalities during training, and investigated diverse augmentation techniques leveraging one-to-many relationships.

Result: MAT significantly outperforms existing unimodal defenses, and analysis shows that effective defense requires augmented image-text pairs to be well-aligned, diverse, and avoid distribution shift.

Conclusion: This work pioneers defense strategies against multimodal attacks in VL tasks, providing insights for building robust vision-language models from both optimization and data perspectives.

Abstract: Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift – conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives.

[172] LMSeg: An end-to-end geometric message-passing network on barycentric dual graphs for large-scale landscape mesh segmentation

Zexian Huang, Kourosh Khoshelham, Martin Tomko

Main category: cs.CV

TL;DR: LMSeg is a lightweight deep graph network for 3D mesh segmentation that achieves state-of-the-art performance on urban and natural landscapes using only 2.4M parameters, while also introducing the BBW cultural heritage dataset.

Details

Motivation: Existing 3D mesh segmentation methods struggle with scalability, end-to-end trainability, and accurately segmenting small/irregular objects in complex environments like cultural heritage sites.

Method: LMSeg uses a barycentric dual graph representation with Geometry Aggregation+ (GA+) module for adaptive neighborhood feature combination, and hierarchical-local dual pooling to balance global context with fine-detail preservation.

Result: Achieves 75.1% mIoU on SUM, 78.4% O.A. on H3D, and 62.4% mIoU on BBW dataset, demonstrating accurate segmentation of small objects (vehicles, vegetation) in cities and dry-stone walls in occluded rural landscapes.

Conclusion: The BBW dataset and LMSeg provide a practical, extensible solution for advancing 3D mesh segmentation in cultural heritage, environmental monitoring, and urban applications.

Abstract: Semantic segmentation of large-scale 3D landscape meshes is critical for geospatial analysis in complex environments, yet existing approaches face persistent challenges of scalability, end-to-end trainability, and accurate segmentation of small and irregular objects. To address these issues, we introduce the BudjBim Wall (BBW) dataset, a large-scale annotated mesh dataset derived from high-resolution LiDAR scans of the UNESCO World Heritage-listed Budj Bim cultural landscape in Victoria, Australia. The BBW dataset captures historic dry-stone wall structures that are difficult to detect under vegetation occlusion, supporting research in underrepresented cultural heritage contexts. Building on this dataset, we propose LMSeg, a deep graph message-passing network for semantic segmentation of large-scale meshes. LMSeg employs a barycentric dual graph representation of mesh faces and introduces the Geometry Aggregation+ (GA+) module, a learnable softmax-based operator that adaptively combines neighborhood features and captures high-frequency geometric variations. A hierarchical-local dual pooling integrates hierarchical and local geometric aggregation to balance global context with fine-detail preservation. Experiments on three large-scale benchmarks (SUM, H3D, and BBW) show that LMSeg achieves 75.1% mIoU on SUM, 78.4% O.A. on H3D, and 62.4% mIoU on BBW, using only 2.4M lightweight parameters. In particular, LMSeg demonstrates accurate segmentation across both urban and natural scenes-capturing small-object classes such as vehicles and high vegetation in complex city environments, while also reliably detecting dry-stone walls in dense, occluded rural landscapes. Together, the BBW dataset and LMSeg provide a practical and extensible method for advancing 3D mesh segmentation in cultural heritage, environmental monitoring, and urban applications.

[173] Improving Adversarial Transferability with Neighbourhood Gradient Information

Haijing Guo, Jiafeng Wang, Zhaoyu Chen, Kaixun Jiang, Lingyi Hong, Pinxue Guo, Jinglun Li, Wenqiang Zhang

Main category: cs.CV

TL;DR: NGI-Attack enhances adversarial example transferability in black-box attacks by leveraging neighborhood gradient information through Example Backtracking and Multiplex Mask strategies.

Details

Motivation: To narrow the performance gap between surrogate and target models in black-box adversarial attacks by improving the transferability of adversarial examples.

Method: NGI-Attack uses Example Backtracking to accumulate neighborhood gradient information as initial momentum, and Multiplex Mask to create multi-way attacks focusing on non-discriminative regions for richer gradient information.

Result: Achieves 95.2% average attack success rate against defense models, significantly enhancing transferability without extra time costs.

Conclusion: The method effectively exploits neighborhood gradient information to improve adversarial transferability and can be seamlessly integrated with existing algorithms.

Abstract: Deep neural networks (DNNs) are known to be susceptible to adversarial examples, leading to significant performance degradation. In black-box attack scenarios, a considerable attack performance gap between the surrogate model and the target model persists. This work focuses on enhancing the transferability of adversarial examples to narrow this performance gap. We observe that the gradient information around the clean image, i.e., Neighbourhood Gradient Information (NGI), can offer high transferability.Based on this insight, we introduce NGI-Attack, incorporating Example Backtracking and Multiplex Mask strategies to exploit this gradient information and enhance transferability. Specifically, we first adopt Example Backtracking to accumulate Neighbourhood Gradient Information as the initial momentum term. Then, we utilize Multiplex Mask to form a multi-way attack strategy that forces the network to focus on non-discriminative regions, which can obtain richer gradient information during only a few iterations. Extensive experiments demonstrate that our approach significantly enhances adversarial transferability. Especially, when attacking numerous defense models, we achieve an average attack success rate of 95.2%. Notably, our method can seamlessly integrate with any off-the-shelf algorithm, enhancing their attack performance without incurring extra time costs.

[174] LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation

Longyun Liao, Rong Zheng

Main category: cs.CV

TL;DR: LangPose is a 2D-to-3D human pose lifting framework that leverages semantic action knowledge through text-motion alignment to resolve depth ambiguity and occlusion issues, achieving state-of-the-art performance.

Details

Motivation: Existing methods relying solely on spatial and temporal consistency are insufficient for resolving depth ambiguity and occlusion, especially with significant occlusions or high dynamic actions. Semantic information from action knowledge provides a complementary signal to disambiguate these challenging cases.

Method: Two-stage framework: 1) Pretraining stage learns action recognition and 3D pose reconstruction from masked/noisy 2D poses with text-motion alignment, 2) Fine-tuning stage refines the model using real-world 3D pose datasets without action labels. Incorporates masked body parts and time windows to encourage semantic information usage.

Result: Achieves SOTA performance: MPJPE of 36.7mm on Human3.6M with detected 2D poses and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses.

Conclusion: Leveraging semantic action knowledge through text-motion alignment effectively resolves ambiguity in 2D-to-3D pose lifting, demonstrating superior performance over methods relying only on spatial and temporal consistency.

Abstract: 2D-to-3D human pose lifting is an ill-posed problem due to depth ambiguity and occlusion. Existing methods relying on spatial and temporal consistency alone are insufficient to resolve these problems especially in the presence of significant occlusions or high dynamic actions. Semantic information, however, offers a complementary signal that can help disambiguate such cases. To this end, we propose LangPose, a framework that leverages action knowledge by aligning motion embeddings with text embeddings of fine-grained action labels. LangPose operates in two stages: pretraining and fine-tuning. In the pretraining stage, the model simultaneously learns to recognize actions and reconstruct 3D poses from masked and noisy 2D poses. During the fine-tuning stage, the model is further refined using real-world 3D human pose estimation datasets without action labels. Additionally, our framework incorporates masked body parts and masked time windows in motion modeling, encouraging the model to leverage semantic information when spatial and temporal consistency is unreliable. Experiments demonstrate the effectiveness of LangPose, achieving SOTA level performance in 3D pose estimation on public datasets, including Human3.6M and MPI-INF-3DHP. Specifically, LangPose achieves an MPJPE of 36.7mm on Human3.6M with detected 2D poses as input and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses as input.

[175] CART: Compositional Auto-Regressive Transformer for Image Generation

Siddharth Roheda, Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal

Main category: cs.CV

TL;DR: CART is an Auto-Regressive image generation method that models images as hierarchical compositions of interpretable visual layers, outperforming traditional approaches through semantically meaningful decompositions.

Details

Motivation: While AR models have succeeded in language modeling, replicating this success in vision tasks is challenging due to spatial dependencies in images. The paper aims to address these unique vision challenges.

Method: CART adds image details iteratively via three decomposition strategies: Base-Detail Decomposition (Mumford-Shah smoothness), Intrinsic Decomposition (albedo/shading), and Specularity Decomposition (diffuse/specular).

Result: The next-detail strategy outperforms traditional next-token and next-scale approaches, improving controllability, semantic interpretability, and resolution scalability. Experiments show visually compelling results.

Conclusion: CART enables structured image manipulation and opens new directions for controllable generative modeling through physically or perceptually motivated image factorization.

Abstract: We propose a novel Auto-Regressive (AR) image generation approach that models images as hierarchical compositions of interpretable visual layers. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks remains challenging due to inherent spatial dependencies in images. Addressing the unique challenges of vision tasks, our method (CART) adds image details iteratively via semantically meaningful decompositions. We demonstrate the flexibility and generality of CART by applying it across three distinct decomposition strategies: (i) Base-Detail Decomposition (Mumford-Shah smoothness), (ii) Intrinsic Decomposition (albedo/shading), and (iii) Specularity Decomposition (diffuse/specular). This next-detail strategy outperforms traditional next-token and next-scale approaches, improving controllability, semantic interpretability, and resolution scalability. Experiments show CART generates visually compelling results while enabling structured image manipulation, opening new directions for controllable generative modeling via physically or perceptually motivated image factorization.

[176] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Ronghuan Wu, Wanchao Su, Jing Liao

Main category: cs.CV

TL;DR: Chat2SVG is a hybrid framework combining LLMs and image diffusion models for text-to-SVG generation, addressing limitations in shape regularity and expressiveness through semantic template generation and dual-stage optimization.

Details

Motivation: Creating high-quality SVG content requires technical expertise and time. Existing text-to-SVG methods have limitations in shape regularity, generalization, and expressiveness.

Method: Uses LLM to generate semantic SVG templates from geometric primitives, then employs dual-stage optimization with image diffusion guidance to refine paths in latent space and adjust point coordinates.

Result: Outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Enables intuitive editing through natural language instructions.

Conclusion: Chat2SVG makes professional vector graphics creation accessible to all users by combining LLM semantic understanding with diffusion model visual guidance.

Abstract: Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

[177] The Visual Counter Turing Test (VCT2): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (VAI)

Nasrin Imanpour, Abhilekh Borah, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Song Wang, Amit Sheth, Amitava Das

Main category: cs.CV

TL;DR: The paper introduces VCT2, a comprehensive benchmark for AI-generated image detection, and finds that current detection methods perform poorly (58% accuracy) on images from modern text-to-image models. They also propose VAI, a perceptual realism metric that shows more realistic images are harder to detect.

Details

Motivation: Growing concerns about misuse of AI-generated images in misinformation campaigns, and limitations of existing detection methods that overfit to known generators and fail on outputs from newer models.

Method: Created VCT2 benchmark with 166,000 real and synthetic images from 6 state-of-the-art T2I systems (SD2.1, SDXL, SD3, SD3.5, DALL-E 3, Midjourney 6) across two subsets: COCOAI (structured captions) and TwitterAI (narrative tweets). Evaluated 17 AGID models in zero-shot setting and proposed VAI metric based on 12 low-level visual features.

Result: Detection accuracy was alarmingly low: 58% on COCOAI and 58.34% on TwitterAI. VAI showed moderate inverse correlation with detection accuracy (Pearson -0.532 on COCOAI, -0.503 on TwitterAI), indicating more realistic images are harder to detect.

Conclusion: Current AI-generated image detection methods struggle with modern T2I systems, and there’s a need for more robust detection approaches. The VAI metric provides interpretable realism assessment, and the released benchmark will support future research in generalized detection and perceptual realism.

Abstract: The rapid progress and widespread availability of text-to-image (T2I) generative models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. We introduce the Visual Counter Turing Test (VCT2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL.E 3, and Midjourney 6. We curate two distinct subsets: COCOAI, featuring structured captions from MS COCO, and TwitterAI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCOAI and 58.34% on TwitterAI. To transcend binary classification, we propose the Visual AI Index (VAI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between VAI and detection accuracy: Pearson of -0.532 on COCOAI and -0.503 on TwitterAI, suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCOAI, TwitterAI, and all codes to catalyze future advances in generalized AGID and perceptual realism assessment.

[178] CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An, Jiaxing Sun, Zihan Gui, Wei He

Main category: cs.CV

TL;DR: CHOICE is a comprehensive benchmark for evaluating Large Vision-Language Models’ hierarchical capabilities in remote sensing, covering perception and reasoning across 6 dimensions and 23 tasks with 10,507 quality-controlled problems.

Details

Motivation: Despite rapid advancement of VLMs in remote sensing, there's a lack of systematic benchmarks to objectively evaluate their capabilities in this specialized domain.

Method: Created CHOICE benchmark through rigorous data collection from 50 globally distributed cities, constructing 10,507 multiple-choice questions with definitive answers covering 2 primary dimensions (perception, reasoning), 6 secondary dimensions, and 23 leaf tasks.

Result: Evaluation of 3 proprietary and 21 open-source VLMs revealed their critical limitations in remote sensing context, demonstrating the need for specialized benchmarks.

Conclusion: CHOICE serves as a valuable resource providing deeper insights into VLMs’ challenges and potential in remote sensing, and will be publicly released to advance the field.

Abstract: The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of all 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing. We will release CHOICE at this https URL.

[179] Robust Bayesian Scene Reconstruction with Retrieval-Augmented Priors for Precise Grasping and Planning

Herbert Wright, Weiming Zhi, Martin Matak, Matthew Johnson-Roberson, Tucker Hermans

Main category: cs.CV

TL;DR: BRRP is a probabilistic 3D reconstruction method that uses retrieval-augmented priors from mesh databases to reconstruct multi-object scenes from single RGBD images, enabling estimation of occluded geometry and uncertainty measurement.

Details

Motivation: Traditional scene representation methods cannot infer unobserved object geometry, and deep learning approaches are brittle to noise and novel objects while lacking well-calibrated reconstruction confidences.

Method: Leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction, using retrieval-augmented priors where relevant components are retrieved from an object database during inference.

Result: BRRP produces distributions over object shape for reconstruction and uncertainty measurement, demonstrating robustness against deep learning-only approaches and higher accuracy than methods without informative priors.

Conclusion: The method enables successful dexterous manipulation in clutter through real-world experiments, showing improved reconstruction capabilities for robotics applications.

Abstract: Constructing 3D representations of object geometry is critical for many robotics tasks, particularly manipulation problems. These representations must be built from potentially noisy partial observations. In this work, we focus on the problem of reconstructing a multi-object scene from a single RGBD image using a fixed camera. Traditional scene representation methods generally cannot infer the geometry of unobserved regions of the objects in the image. Attempts have been made to leverage deep learning to train on a dataset of known objects and representations, and then generalize to new observations. However, this can be brittle to noisy real-world observations and objects not contained in the dataset, and do not provide well-calibrated reconstruction confidences. We propose BRRP, a reconstruction method that leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. We introduce the concept of a retrieval-augmented prior, where we retrieve relevant components of our prior distribution from a database of objects during inference. The resulting prior enables estimation of the geometry of occluded portions of the in-scene objects. Our method produces a distribution over object shape that can be used for reconstruction and measuring uncertainty. We evaluate our method in both simulated scenes and in the real world. We demonstrate the robustness of our method against deep learning-only approaches while being more accurate than a method without an informative prior. Through real-world experiments, we particularly highlight the capability of BRRP to enable successful dexterous manipulation in clutter.

[180] Redundant Queries in DETR-Based 3D Detection Methods: Unnecessary and Prunable

Lizhen Xu, Zehao Wu, Wenzhao Qiu, Shanmin Pang, Xiuxiu Bai, Kuizhi Mei, Jianru Xue

Main category: cs.CV

TL;DR: GPQ is a simple method that gradually prunes redundant object queries in 3D object detection models based on classification scores, reducing computational costs while maintaining performance.

Details

Motivation: Query-based 3D object detection models use excessive queries beyond actual object counts, causing unnecessary computational and memory overhead.

Method: Gradually Pruning Queries (GPQ) incrementally removes queries based on classification scores, easily integrated as a fine-tuning step in existing query-based methods.

Result: GPQ reduces redundant queries while maintaining detection performance, achieving 1.31x speedup on desktop GPUs, 67.86% FLOPs reduction, and 76.38% inference time decrease on edge devices.

Conclusion: GPQ provides an effective and straightforward approach to optimize query-based 3D detectors by pruning redundant queries, significantly improving efficiency without compromising accuracy.

Abstract: Query-based models are extensively used in 3D object detection tasks, with a wide range of pre-trained checkpoints readily available online. However, despite their popularity, these models often require an excessive number of object queries, far surpassing the actual number of objects to detect. The redundant queries result in unnecessary computational and memory costs. In this paper, we find that not all queries contribute equally – a significant portion of queries have a much smaller impact compared to others. Based on this observation, we propose an embarrassingly simple approach called \bd{G}radually \bd{P}runing \bd{Q}ueries (GPQ), which prunes queries incrementally based on their classification scores. It is straightforward to implement in any query-based method, as it can be seamlessly integrated as a fine-tuning step using an existing checkpoint after training. With GPQ, users can easily generate multiple models with fewer queries, starting from a checkpoint with an excessive number of queries. Experiments on various advanced 3D detectors show that GPQ effectively reduces redundant queries while maintaining performance. Using our method, model inference on desktop GPUs can be accelerated by up to 1.31x. Moreover, after deployment on edge devices, it achieves up to a 67.86% reduction in FLOPs and a 76.38% decrease in inference time. The code will be available at https://github.com/iseri27/Gpq.

[181] Synth-Align: Improving Trustworthiness in Vision-Language Model with Synthetic Preference Data Alignment

Robert Wijaya, Ngoc-Bao Nguyen, Ngai-Man Cheung

Main category: cs.CV

TL;DR: SynthAlign is a pipeline that generates synthetic human-preference image-text data for post-training alignment using DPO, leveraging reward models as human preference proxies to reduce hallucinations in LVLMs.

Details

Motivation: Current LVLMs suffer from hallucinations that degrade performance and user experience. Existing alignment methods using synthetic data in multimodal settings are under-explored and typically rely on strong models or ground-truth models for data labeling.

Method: Proposes SynthAlign pipeline that generates synthetic human-preference image-text data specifically designed for post-training alignment with DPO, using reward models as proxies for human preference.

Result: Enhanced LLaVA-1.5-7B achieved: 87.6% POPE accuracy and 97.8% precision, MMHal-Bench score increased from 2.36 to 3.49, hallucination rate decreased from 51.0% to 25.0% (50.98% relative reduction).

Conclusion: SynthAlign effectively reduces hallucinations in LVLMs through synthetic data generation for preference alignment, demonstrating significant improvements in accuracy, precision, and hallucination reduction across multiple benchmarks.

Abstract: Large Vision-Language Models (LVLMs) have shown promising capabilities in understanding and generating information by integrating both visual and textual data. However, current models are still prone to hallucinations, which degrade the performance and greatly harm the user experience in real-world applications. Post-training alignment, particularly preference-tuning, is intended to align model outputs and behaviors (safety, instruction-following, style), ensuring robustness and adaptability to a wide range of tasks. The use of synthetic data for alignment, particularly in multimodal settings, remains under explored. Existing approaches typically use a strong model or a ground-truth model (CLIP) to determine positive and negative image-text data points. This paper proposes SynthAlign, a pipeline to generate and collect synthetic human-preference image-text data with optimal control built specifically for post-training alignment with DPO. At the core of the framework is the utilization of reward models as a proxy of human preference. A series of evaluation and benchmarking is provided to validate the effectiveness of the proposed framework and the resulting dataset. Notably, our framework enhanced LLaVA-1.5-7B achieved substantial POPE improvements: 87.6% accuracy and 97.8% precision, MMHal-Bench score increased from 2.36 to 3.49, and hallucination rate decreased from 51.0% to 25.0% (a 50.98% relative reduction).

[182] GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm

Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen

Main category: cs.CV

TL;DR: GreedyPixel is a black-box adversarial attack method that uses greedy pixel-wise optimization guided by surrogate models and query feedback to achieve high precision and sparsity without gradient information.

Details

Motivation: Existing black-box adversarial attack methods face a trade-off between precision and flexibility - sparse pixel attacks lack adaptability while patch/frequency-based attacks sacrifice precision for efficiency.

Method: Performs brute-force-style per-pixel greedy optimization using surrogate-derived priority maps and query feedback, evaluating each coordinate directly without gradients to guarantee monotonic loss reduction.

Result: Achieved state-of-the-art success rates on CIFAR-10 and ImageNet with visually imperceptible perturbations, bridging the gap between black-box practicality and white-box performance.

Conclusion: GreedyPixel effectively combines black-box practicality with white-box-level precision and pixel-wise sparsity, providing a powerful tool for evaluating neural network robustness.

Abstract: Deep neural networks are highly vulnerable to adversarial examples, which are inputs with small, carefully crafted perturbations that cause misclassification – making adversarial attacks a critical tool for evaluating robustness. Existing black-box methods typically entail a trade-off between precision and flexibility: pixel-sparse attacks (e.g., single- or few-pixel attacks) provide fine-grained control but lack adaptability, whereas patch- or frequency-based attacks improve efficiency or transferability, but at the cost of producing larger and less precise perturbations. We present GreedyPixel, a fine-grained black-box attack method that performs brute-force-style, per-pixel greedy optimization guided by a surrogate-derived priority map and refined by means of query feedback. It evaluates each coordinate directly without any gradient information, guaranteeing monotonic loss reduction and convergence to a coordinate-wise optimum, while also yielding near white-box-level precision and pixel-wise sparsity and perceptual quality. On the CIFAR-10 and ImageNet datasets, spanning convolutional neural networks (CNNs) and Transformer models, GreedyPixel achieved state-of-the-art success rates with visually imperceptible perturbations, effectively bridging the gap between black-box practicality and white-box performance. The implementation is available at https://github.com/azrealwang/greedypixel.

[183] Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification

Dan Song, Shumeng Huo, Wenhui Li, Lanjun Wang, Chao Xue, An-An Liu

Main category: cs.CV

TL;DR: A novel domain adaptation approach using synthetic dataset AIMO to address long-tail distribution and domain shift in real maritime object classification, enhanced by CLIP features and curriculum learning.

Details

Motivation: Existing unsupervised methods struggle with long-tail data distributions in maritime object categories and weather conditions, limiting classification performance.

Method: Constructed synthetic dataset AIMO with balanced categories, proposed domain adaptation from AIMO to real dataset RMO, enhanced with CLIP features, and implemented curriculum learning using difficulty scores.

Result: Significantly improved classification accuracy, especially for rare object categories and weather conditions.

Conclusion: The proposed method effectively addresses long-tail distribution and domain shift issues in maritime object classification using synthetic data and domain adaptation techniques.

Abstract: The classification and recognition of maritime objects are crucial for enhancing maritime safety, monitoring, and intelligent sea environment prediction. However, existing unsupervised methods for maritime object classification often struggle with the long-tail data distributions in both object categories and weather conditions. In this paper, we construct a dataset named AIMO produced by large-scale generative models with diverse weather conditions and balanced object categories, and collect a dataset named RMO with real-world images where long-tail issue exists. We propose a novel domain adaptation approach that leverages AIMO (source domain) to address the problem of limited labeled data, unbalanced distribution and domain shift in RMO (target domain), enhance the generalization of source features with the Vision-Language Models such as CLIP, and propose a difficulty score for curriculum learning to optimize training process. Experimental results shows that the proposed method significantly improves the classification accuracy, particularly for samples within rare object categories and weather conditions. Datasets and codes will be publicly available at https://github.com/honoria0204/AIMO.

[184] Improved Wildfire Spread Prediction with Time-Series Data and the WSTS+ Benchmark

Saad Lahrichi, Jake Bova, Jesse Johnson, Jordan Malof

Main category: cs.CV

TL;DR: This paper evaluates various data-driven wildfire modeling strategies and achieves state-of-the-art accuracy for both single-day and multi-day wildfire spread prediction, while also creating an expanded benchmark (WSTS+) with more historical data.

Details

Motivation: To systematically investigate and compare existing data-driven wildfire modeling strategies under controlled conditions to identify the best approaches for accurate wildfire spread prediction.

Method: Evaluated multiple data-driven wildfire modeling strategies using both single-day and multi-day (time-series) input scenarios on the WildfireSpreadTS benchmark, and created an expanded WSTS+ benchmark with additional historical data.

Result: Achieved state-of-the-art accuracy for both single-day and multi-day input scenarios, with time-series input models obtaining the best overall accuracy. Created WSTS+ benchmark that doubles historical data and expands geographic scope.

Conclusion: Time-series input models provide superior accuracy for wildfire spread prediction, and the expanded WSTS+ benchmark represents the largest public dataset for time-series-based wildfire spread prediction research.

Abstract: Recent research has demonstrated the potential of deep neural networks (DNNs) to accurately predict wildfire spread on a given day based upon high-dimensional explanatory data from a single preceding day, or from a time series of T preceding days. For the first time, we investigate a large number of existing data-driven wildfire modeling strategies under controlled conditions, revealing the best modeling strategies and resulting in models that achieve state-of-the-art (SOTA) accuracy for both single-day and multi-day input scenarios, as evaluated on a large public benchmark for next-day wildfire spread, termed the WildfireSpreadTS (WSTS) benchmark. Consistent with prior work, we found that models using time-series input obtained the best overall accuracy, suggesting this is an important future area of research. Furthermore, we create a new benchmark, WSTS+, by incorporating four additional years of historical wildfire data into the WSTS benchmark. Our benchmark doubles the number of unique years of historical data, expands its geographic scope, and, to our knowledge, represents the largest public benchmark for time-series-based wildfire spread prediction.

[185] Surgical AI Copilot: Energy-Based Fourier Gradient Low-Rank Adaptation for Surgical LLM Agent Reasoning and Planning

Jiayuan Huang, Runlong He, Danyal Zaman Khan, Evangelos B. Mazomenos, Danail Stoyanov, Hani Marcus, Linzhe Jiang, Matthew J Clarkson, Mobarak I. Hoque

Main category: cs.CV

TL;DR: Surgical AI Copilot is an LLM agent for pituitary surgery that enables dynamic task planning, conversation, and execution of surgical tasks through efficient fine-tuning using the proposed DEFT-GaLore method and PitAgent dataset.

Details

Motivation: Static AI models are inadequate for real-time surgical decision support, and current LLM agents lack surgical datasets and efficient fine-tuning methods for complex intraoperative reasoning.

Method: Developed PitAgent dataset for surgical context-aware planning and proposed DEFT-GaLore, a deterministic energy-based Fourier transform technique for efficient low-rank adaptation of LLMs like LLaMA 3.2 and Qwen 2.5.

Result: The agent demonstrated strong performance in surgical planning, prompt generation, and zero-shot surgical VQA, outperforming other low-rank adaptation methods.

Conclusion: The approach shows significant potential for developing efficient and scalable surgical LLM agents capable of real-time operative support in image-guided surgery.

Abstract: Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large language models (LLMs)-powered agents offer a promising solution by enabling dynamic task planning and predictive decision support. Despite recent advances, the absence of surgical agent datasets and robust parameter-efficient fine-tuning techniques limits the development of LLM agents capable of complex intraoperative reasoning. In this paper, we introduce Surgical AI Copilot, an LLM agent for image-guided pituitary surgery, capable of conversation, planning, and task execution in response to queries involving tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured agent planning, we develop the PitAgent dataset, a surgical context-aware planning dataset covering surgical tasks like workflow analysis, instrument localization, anatomical segmentation, and query-based reasoning. Additionally, we propose DEFT-GaLore, a Deterministic Energy-based Fourier Transform (DEFT) gradient projection technique for efficient low-rank adaptation of recent LLMs (e.g., LLaMA 3.2, Qwen 2.5), enabling their use as surgical agent planners. We extensively validate our agent’s performance and the proposed adaptation technique against other state-of-the-art low-rank adaptation methods on agent planning and prompt generation tasks, including a zero-shot surgical VQA benchmark, demonstrating the significant potential for truly efficient and scalable surgical LLM agents in real-time operative settings.

[186] ArchCAD-400K: A Large-Scale CAD drawings Dataset and New Baseline for Panoptic Symbol Spotting

Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao

Main category: cs.CV

TL;DR: Proposes ArchCAD-400K, a large-scale CAD dataset with automatic annotation, and DPSS model for panoptic symbol spotting with state-of-the-art performance.

Details

Motivation: To reduce manual labeling efforts in architectural CAD drawings and enable advanced engineering applications through automated symbol recognition.

Method: Developed a CAD data annotation engine using intrinsic attributes from archived drawings, created ArchCAD-400K dataset, and designed Dual-Pathway Symbol Spotter (DPSS) with adaptive fusion module.

Result: Constructed ArchCAD-400K with 413,062 chunks from 5,538 drawings (26x larger than existing datasets), and DPSS achieved state-of-the-art performance with enhanced robustness.

Conclusion: The proposed approach significantly advances CAD symbol recognition, demonstrating the value of large-scale automated annotation and driving innovation in architectural design.

Abstract: Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.

[187] GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model – Bringing Motion Generation to the Clinical Domain

Vida Adeli, Soroush Mehraban, Majid Mirmehdi, Alan Whone, Benjamin Filtjens, Amirhossein Dadashzadeh, Alfonso Fasano, Andrea Iaboni, Babak Taati

Main category: cs.CV

TL;DR: GAITGen is a novel framework that generates realistic parkinsonian gait sequences conditioned on pathology severity levels to address data scarcity in clinical gait analysis.

Details

Motivation: Computer vision models for parkinsonian gait analysis are limited by scarce clinical datasets and challenges in collecting large, well-labeled data, which impacts model accuracy and introduces bias.

Method: GAITGen uses a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, combined with Mask and Residual Transformers for conditioned sequence generation.

Result: GAITGen outperforms state-of-the-art models in reconstruction fidelity and generation quality on the PD-GaM dataset, accurately capturing pathology-specific gait features. Clinical user study confirms realism and relevance, and incorporating generated data improves downstream gait severity estimation.

Conclusion: GAITGen effectively addresses data scarcity in clinical gait analysis by generating realistic, diverse gait sequences across severity levels, enabling large-scale model training and advancing parkinsonian gait analysis.

Abstract: Gait analysis is crucial for the diagnosis and monitoring of movement disorders like Parkinson’s Disease. While computer vision models have shown potential for objectively evaluating parkinsonian gait, their effectiveness is limited by scarce clinical datasets and the challenge of collecting large and well-labelled data, impacting model accuracy and risk of bias. To address these gaps, we propose GAITGen, a novel framework that generates realistic gait sequences conditioned on specified pathology severity levels. GAITGen employs a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, coupled with Mask and Residual Transformers for conditioned sequence generation. GAITGen generates realistic, diverse gait sequences across severity levels, enriching datasets and enabling large-scale model training in parkinsonian gait analysis. Experiments on our new PD-GaM (real) dataset demonstrate that GAITGen outperforms adapted state-of-the-art models in both reconstruction fidelity and generation quality, accurately capturing critical pathology-specific gait features. A clinical user study confirms the realism and clinical relevance of our generated sequences. Moreover, incorporating GAITGen-generated data into downstream tasks improves parkinsonian gait severity estimation, highlighting its potential for advancing clinical gait analysis.

[188] FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Amit Agarwal, Srikant Panda, Kulbhushan Pachauri

Main category: cs.CV

TL;DR: FS-DAG is a scalable and efficient model for visually rich document understanding in few-shot settings, using domain-specific and language/vision backbones to adapt to diverse documents with minimal data.

Details

Motivation: To address practical challenges in real-world document understanding deployments, including OCR errors, misspellings, and domain shifts, while maintaining computational efficiency.

Method: Leverages domain-specific and language/vision specific backbones within a modular framework for few-shot domain adaptation, handling OCR errors and misspellings robustly.

Result: Demonstrates significant improvements in convergence speed and performance compared to state-of-the-art methods for information extraction tasks, with less than 90M parameters.

Conclusion: FS-DAG represents progress in developing smaller, more efficient models that maintain high performance for complex real-world information extraction applications with limited computational resources.

Abstract: In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG’s capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag

[189] HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Xiaopei Zhang, Guan Huang, Yijie Ren, Lihong Liu, Xingang Wang

Main category: cs.CV

TL;DR: HumanDreamer-X is a unified framework for single-image human reconstruction that integrates multi-view generation and 3D reconstruction, using 3D Gaussian Splatting and HumanFixer to enhance geometric consistency and visual fidelity.

Details

Motivation: Current single-image human reconstruction methods suffer from geometric inconsistencies when generating multiple views, leading to fragmented or blurred limbs in reconstructed models.

Method: Uses 3D Gaussian Splatting as explicit 3D representation, HumanFixer to restore 3DGS renderings, and attention modulation strategy to enhance geometric details and identity consistency across multi-views.

Result: Achieves 16.45% improvement in generation PSNR and 12.65% improvement in reconstruction PSNR, reaching up to 25.62 dB PSNR, with generalization on in-the-wild data and compatibility with various backbone models.

Conclusion: HumanDreamer-X significantly enhances geometric consistency and visual fidelity in single-image human reconstruction through unified multi-view generation and reconstruction pipeline.

Abstract: Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

[190] Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

Yufeng Wu, Xin Liao, Baowei Wang, Han Fang, Xiaoshuai Wu, Mingyue Chen, Guiling Wang

Main category: cs.CV

TL;DR: Proposes S2R, an unsupervised method that learns the transformation from simulated to real screen-camera noise distributions to improve watermark robustness against unauthorized screen capturing.

Details

Motivation: Existing watermarking methods for screen-camera images use mathematical modeling or supervised networks that cannot effectively approximate SC noise - mathematical models have biased approximations and supervised networks struggle to learn all noise features from paired data.

Method: Uses an unsupervised noise layer with unpaired data to learn the discrepancy between simulated noise distribution and real-world SC noise distribution, focusing on bridging the noise distribution gap rather than reconstructing image details.

Result: Extensive experiments show superior watermark robustness and generalization compared to state-of-the-art methods.

Conclusion: The S2R approach effectively addresses limitations of existing methods by learning the simulation-to-real transformation, providing better watermark protection against screen-camera attacks.

Abstract: Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeled simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to state-of-the-art methods.

[191] DG-DETR: Toward Domain Generalized Detection Transformer

Seongmin Hwang, Daeyoung Han, Moongu Jeon

Main category: cs.CV

TL;DR: DG-DETR is a domain generalization method for DETR detectors that uses orthogonal projection and wavelet decomposition to improve out-of-distribution robustness.

Details

Motivation: Current domain generalization research focuses on CNN-based detectors, while DETR-based detectors lack attention for improving their robustness across different domains.

Method: Proposes domain-agnostic query selection via orthogonal projection onto instance-specific style space, and uses wavelet decomposition to disentangle features into domain-invariant and domain-specific components for synthesizing diverse latent styles.

Result: Experimental results validate the effectiveness of DG-DETR in improving out-of-distribution robustness for DETR detectors.

Conclusion: DG-DETR provides a simple, effective, and plug-and-play method for enhancing the domain generalization capability of Transformer-based object detectors.

Abstract: End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.

[192] RAFT – A Domain Adaptation Framework for RGB & LiDAR Semantic Segmentation

Edward Humes, Xiaomin Lin, Boxun Hu, Rithvik Jonna, Tinoosh Mohsenin

Main category: cs.CV

TL;DR: RAFT is a novel framework for adapting image segmentation models using minimal labeled real-world data through data/feature augmentations and active learning, achieving state-of-the-art performance on synthetic-to-real and real-to-real benchmarks.

Details

Motivation: To address the Syn2Real problem where deep neural networks trained on synthetic data perform poorly in real-world deployments, and reduce the need for extensive manual data annotation.

Method: Proposes RAFT framework using data and feature augmentations combined with active learning to adapt image segmentation models with minimal labeled real-world data.

Result: Achieved state-of-the-art performance: SYNTHIA->Cityscapes (2.1%/79.9% mIoU improvement), GTAV->Cityscapes (0.4%/78.2% mIoU improvement), and Cityscapes->ACDC (1.3%/73.2% mIoU improvement), surpassing previous SOTA HALO.

Conclusion: RAFT effectively mitigates the domain gap in image segmentation using minimal labeled real data, demonstrating strong performance across multiple benchmarks and providing insights into annotation budget allocation.

Abstract: Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real “SYNTHIA->Cityscapes” and “GTAV->Cityscapes” benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA->Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV->Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of “Cityscapes->ACDC”, and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.

[193] Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

Yuetong Liu, Yunqiu Xu, Yang Wei, Xiuli Bi, Bin Xiao

Main category: cs.CV

TL;DR: ClearNight is a unified framework for multi-weather nighttime image restoration that handles various weather degradations and flare effects simultaneously using Retinex-based dual priors and weather-aware dynamic collaboration.

Details

Motivation: Nighttime image restoration with multiple adverse weather conditions is practical but under-explored, as real-world scenarios often involve coexisting weather degradations and lighting effects at night.

Method: Uses Retinex-based dual priors to focus on uneven illumination regions and intrinsic textures, plus weather-aware dynamic specific-commonality collaboration that identifies weather types and adaptively selects optimal restoration units.

Result: Achieves state-of-the-art performance on both synthetic and real-world images, validated by comprehensive ablation experiments.

Conclusion: ClearNight effectively removes complex degradations in one go and the AllWeatherNight dataset supports research in this challenging multi-weather nighttime restoration task.

Abstract: Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project Page: https://henlyta.github.io/ClearNight/

[194] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

Ronghuan Wu, Wanchao Su, Jing Liao

Main category: cs.CV

TL;DR: LayerPeeler is a novel layer-wise image vectorization method that uses autoregressive peeling with vision-language models to identify and remove topmost non-occluded layers, producing vector graphics with complete paths and coherent layer structures.

Details

Motivation: Existing image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that limit editability. Current optimization-based and learning-based methods have limitations in vectorization quality and flexibility.

Method: Uses autoregressive peeling strategy with vision-language models to construct layer graphs capturing occlusion relationships. Employs localized attention control with a finetuned image diffusion model to precisely remove identified layers while preserving surrounding content.

Result: Significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity. A large-scale dataset for layer peeling tasks was also contributed.

Conclusion: LayerPeeler successfully addresses the challenges of occluded regions in image vectorization through its progressive simplification paradigm, enabling generation of vector graphics with complete paths and coherent layer structures.

Abstract: Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler’s success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

[195] LBMamba: Locally Bi-directional Mamba

Jingwei Zhang, Xi Han, Hong Qin, Mahdi S. Hosseini, Dimitris Samaras

Main category: cs.CV

TL;DR: LBMamba introduces a locally bi-directional SSM block that embeds lightweight backward scans within forward scans to achieve full receptive field without doubling computation, outperforming existing Mamba models across multiple vision tasks.

Details

Motivation: Current Mamba-based vision methods use bi-directional scans to overcome the unidirectional limitation, but this doubles computational load and erodes Mamba's efficiency advantage.

Method: LBMamba embeds a lightweight locally backward scan inside the forward scan executed in per-thread registers, and LBVim alternates scan directions every two layers to recover global receptive field without extra backward sweeps.

Result: LBVim achieves 0.8%-1.6% higher top-1 accuracy on ImageNet-1K, 0.6%-2.7% higher mIoU on ADE20K, 0.9%-1.1% higher AP on COCO, and boosts SOTA Mamba models by 0.5%-3.4%. On pathology datasets, it achieves up to 3.06% better AUC, 3.39% better F1, and 1.67% better accuracy.

Conclusion: LBMamba offers superior performance-throughput trade-off by eliminating extra backward scans while maintaining full receptive field, making it an efficient alternative to bi-directional Mamba approaches.

Abstract: Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel scan, has recently emerged as a linearly-scaling alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this by augmenting Mamba’s global forward scan with a global backward scan, forming a bi-directional scan to restore a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward scan and executes it in per-thread registers. Building on LBMamba, we present LBVim, a backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate our approach on both natural images and whole slide images (WSIs) and show that it constantly offers a superior performance-throughput trade-off. Under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. Our method also boosts the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We integrate LBMamba into the SOTA pathology multiple instance learning (MIL) model, MambaMIL, which is unidirectional. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.

[196] A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada

Main category: cs.CV

TL;DR: The paper introduces a collaborative 3D semantic occupancy prediction framework for autonomous driving, addressing limitations of single-vehicle perception through multi-agent information exchange and establishing benchmarks with varying prediction ranges.

Details

Motivation: Single-vehicle 3D semantic occupancy prediction is limited by occlusion, restricted sensor range, and narrow viewpoints. Collaborative perception can overcome these limitations by exchanging complementary information between vehicles.

Method: Augmented an existing collaborative perception dataset using CARLA simulator with high-resolution semantic voxel sensors. Developed a baseline model with inter-agent feature fusion via spatial alignment and attention aggregation.

Result: The baseline collaborative model consistently outperforms single-agent models, with performance gains increasing as the prediction range expands.

Conclusion: Collaborative 3D semantic occupancy prediction effectively enhances perception completeness and accuracy, especially for larger spatial extents, demonstrating the value of multi-agent information exchange in autonomous driving.

Abstract: 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.

[197] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: TopoStreamer is an end-to-end temporal perception model that improves lane segment topology reasoning for autonomous driving through streaming attribute constraints, dynamic positional encoding, and lane segment denoising.

Details

Motivation: Existing methods for lane segment topology reasoning suffer from limitations in consistent positional embedding and temporal multiple attribute learning, which hinder accurate road network reconstruction needed for autonomous driving maneuvers like turning and lane changing.

Method: TopoStreamer introduces three key improvements: streaming attribute constraints for temporal consistency in centerline/boundary coordinates and classifications, dynamic lane boundary positional encoding for up-to-date positional information, and lane segment denoising to capture diverse lane patterns.

Result: On the OpenLane-V2 dataset, TopoStreamer achieved significant improvements over state-of-the-art methods: +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

Conclusion: The proposed TopoStreamer model effectively addresses limitations in existing lane topology reasoning methods and demonstrates substantial performance gains in both lane segment and centerline perception tasks for autonomous driving applications.

Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

[198] evMLP: An Efficient Event-Driven MLP Architecture for Vision

Zhentan Zheng

Main category: cs.CV

TL;DR: evMLP is an event-driven MLP architecture that selectively processes patches where changes occur between frames, reducing computational costs for video processing while maintaining competitive accuracy.

Details

Motivation: To improve computational efficiency in sequential image processing by avoiding redundant computations on unchanged regions between frames, leveraging event-driven mechanisms.

Method: Uses MLPs to process image patches independently, defines inter-frame changes as ’events’, and implements a local update mechanism that selectively processes only patches where events occur.

Result: Achieves competitive ImageNet classification accuracy and significantly reduces computational cost on video datasets while maintaining output consistency with non-event-driven baselines.

Conclusion: evMLP provides an efficient event-driven approach for vision tasks, particularly effective for video processing where it reduces computation by focusing only on changing regions.

Abstract: Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as ``events’’. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and pre-trained models are available at https://github.com/i-evi/evMLP.

[199] DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection

Paul Hill, Zhiming Liu, Alin Achim, Dave Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Proposes DMAT framework that jointly improves atmospheric turbulence mitigation and object detection using 3D Mamba-based architecture and end-to-end training.

Details

Motivation: Atmospheric turbulence degrades surveillance imagery quality and object detection performance, with existing deep learning methods struggling with spatio-temporal distortions.

Method: End-to-end framework with 3D Mamba-based AT mitigator to handle spatio-temporal distortions, exchanging knowledge between low-level features and semantic features from object detector through joint optimization.

Result: DMAT outperforms state-of-the-art AT mitigation and object detection systems by up to 15% improvement on turbulence-corrupted datasets.

Conclusion: The proposed joint learning framework effectively compensates for distorted features while simultaneously improving both visualization quality and object detection performance.

Abstract: Atmospheric Turbulence (AT) degrades the clarity and accuracy of surveillance imagery, posing challenges not only for visualization quality but also for object classification and scene tracking. Deep learning-based methods have been proposed to improve visual quality, but spatio-temporal distortions remain a significant issue. Although deep learning-based object detection performs well under normal conditions, it struggles to operate effectively on sequences distorted by atmospheric turbulence. In this paper, we propose a novel framework that learns to compensate for distorted features while simultaneously improving visualization and object detection. This end-to-end training strategy leverages and exchanges knowledge of low-level distorted features in the AT mitigator with semantic features extracted in the object detector. Specifically, in the AT mitigator a 3D Mamba-based structure is used to handle the spatio-temporal displacements and blurring caused by turbulence. Optimization is achieved through back-propagation in both the AT mitigator and object detector. Our proposed DMAT outperforms state-of-the-art AT mitigation and object detection systems up to a 15% improvement on datasets corrupted by generated turbulence.

[200] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda

Main category: cs.CV

TL;DR: OpenWorldSAM extends SAM2 for open-vocabulary segmentation using multi-modal embeddings from a lightweight VLM, achieving state-of-the-art performance with high efficiency and strong generalization.

Details

Motivation: To address the challenge of segmenting objects based on open-ended language prompts, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories.

Method: Integrates multi-modal embeddings from a lightweight VLM into SAM2, freezing pre-trained components and training only 4.5M parameters. Uses positional tie-breaker embeddings and cross-attention layers for instance awareness, supporting unified language prompting.

Result: Achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks with strong zero-shot capabilities on unseen categories.

Conclusion: OpenWorldSAM provides an efficient and flexible framework for open-vocabulary segmentation that generalizes well to unseen concepts without additional training, demonstrating superior performance across various segmentation tasks.

Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model’s spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.

[201] Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS

Xinyu Wang, Muhammad Ibrahim, Haitian Wang, Atif Mansoor, Xiuping Jia, Ajmal Mian

Main category: cs.CV

TL;DR: A structured geo-registration method that aligns LiDAR point clouds with satellite images using road segmentation and intersection matching, achieving significant improvements in urban localization accuracy without relying on GNSS signals.

Details

Motivation: Existing geo-registration methods rely on GNSS/IMU data which often fail in dense urban environments due to signal degradation, leading to localization errors. There's a need for accurate geo-registration without prior localization.

Method: Uses Point Transformer for road segmentation, extracts road skeletons and intersections from both LiDAR and satellite images, performs global rigid alignment using intersection correspondences, followed by local non-rigid refinement with RBF interpolation, and corrects elevation using SRTM terrain data.

Result: On KITTI benchmark: 0.69m mean planimetric error (50% improvement over raw data), 30.5% elevation correlation improvement. On Perth dataset: 2.17m mean planimetric error (57.4% improvement over rigid alignment), 55.8% elevation correlation improvement.

Conclusion: The proposed method enables accurate frame-wise geo-registration and city-scale 3D reconstruction without prior localization, significantly outperforming existing approaches in GNSS-denied urban environments.

Abstract: Accurate geo-registration of LiDAR point clouds remains a significant challenge in urban environments where Global Navigation Satellite System (GNSS) signals are denied or degraded. Existing methods typically rely on real-time GNSS and Inertial Measurement Unit (IMU) data, which require pre-calibration and assume stable signals. However, this assumption often fails in dense cities, resulting in localization errors. To address this, we propose a structured geo-registration method that accurately aligns LiDAR point clouds with satellite images, enabling frame-wise geo-registration and city-scale 3D reconstruction without prior localization. Our method uses a pre-trained Point Transformer to segment road points, then extracts road skeletons and intersections from the point cloud and the satellite image. Global alignment is achieved through rigid transformation using corresponding intersection points, followed by local non-rigid refinement with radial basis function (RBF) interpolation. Elevation discrepancies are corrected using terrain data from the Shuttle Radar Topography Mission (SRTM). To evaluate geo-registration accuracy, we measure the absolute distances between the roads extracted from the two modalities. Our method is validated on the KITTI benchmark and a newly collected dataset of Perth, Western Australia. On KITTI, our method achieves a mean planimetric alignment error of 0.69m, representing 50% improvement over the raw KITTI data. On Perth dataset, it achieves a mean planimetric error of 2.17m from GNSS values extracted from Google Maps, corresponding to 57.4% improvement over rigid alignment. Elevation correlation improved by 30.5% (KITTI) and 55.8% (Perth). A demonstration video is available at: https://youtu.be/0wkACAB-O6E.

[202] Knowledge-Guided Brain Tumor Segmentation via Synchronized Visual-Semantic-Topological Prior Fusion

Mingda Zhang, Kaiwen Pan

Main category: cs.CV

TL;DR: STPF is a knowledge-guided brain tumor segmentation framework that integrates pathology, semantic, and geometric priors through dual-level fusion and nested output heads, achieving state-of-the-art performance on BraTS 2020.

Details

Motivation: Existing deep learning methods for brain tumor segmentation rely mainly on visual features and lack explicit integration of medical domain knowledge like anatomical semantics and geometric topology, leading to insufficient discriminative power in ambiguous boundary regions.

Method: Proposes STPF framework that integrates three knowledge priors: pathology-driven differential features, unsupervised semantic descriptions via spatialization operators, and geometric constraints from persistent homology analysis. Uses dual-level fusion architecture with voxel-level confidence weighting and sample-level hypernetwork-generated conditional vectors, plus nested output heads to enforce hierarchical constraints.

Result: Achieves mean Dice coefficient of 0.868 on BraTS 2020, surpassing best baseline by 2.6 percentage points (3.09% relative improvement). Shows stable performance with coefficients of variation between 0.23%-0.33% in cross-validation. Ablation studies show removing topological and semantic priors causes 2.8% and 3.5% performance degradation respectively.

Conclusion: Explicit integration of medical knowledge priors - anatomical semantics and geometric constraints - improves segmentation accuracy in ambiguous boundary regions while demonstrating generalization capability and clinical deployment potential.

Abstract: Background: Brain tumor segmentation requires precise delineation of hierarchical structures from multi-sequence MRI. However, existing deep learning methods primarily rely on visual features, showing insufficient discriminative power in ambiguous boundary regions. Moreover, they lack explicit integration of medical domain knowledge such as anatomical semantics and geometric topology. Methods: We propose a knowledge-guided framework, Synchronized Tri-modal Prior Fusion (STPF), that explicitly integrates three heterogeneous knowledge priors: pathology-driven differential features (T1ce-T1, T2-FLAIR, T1/T2) encoding contrast patterns; unsupervised semantic descriptions transformed into voxel-level guidance via spatialization operators; and geometric constraints extracted through persistent homology analysis. A dual-level fusion architecture dynamically allocates prior weights at the voxel level based on confidence and at the sample level through hypernetwork-generated conditional vectors. Furthermore, nested output heads structurally ensure the hierarchical constraint ET subset TC subset WT. Results: STPF achieves a mean Dice coefficient of 0.868 on the BraTS 2020 dataset, surpassing the best baseline by 2.6 percentage points (3.09% relative improvement). Notably, five-fold cross-validation yields coefficients of variation between 0.23% and 0.33%, demonstrating stable performance. Additionally, ablation experiments show that removing topological and semantic priors leads to performance degradation of 2.8% and 3.5%, respectively. Conclusions: By explicitly integrating medical knowledge priors - anatomical semantics and geometric constraints - STPF improves segmentation accuracy in ambiguous boundary regions while demonstrating generalization capability and clinical deployment potential.

[203] Trustworthy Pedestrian Trajectory Prediction via Pattern-Aware Interaction Modeling

Kaiyuan Zhai, Juan Chen, Chao Wang, Zeyi Xu, Guoming Tang

Main category: cs.CV

TL;DR: InSyn is a Transformer-based model for pedestrian trajectory prediction that explicitly captures interaction patterns and uses SSOS training to reduce initial-step divergence, outperforming black-box methods in accuracy and transparency.

Details

Motivation: Current pedestrian trajectory prediction methods use black-box modeling of interactions, which limits reliability in real-world deployments despite strong performance.

Method: Proposed InSyn model with Transformer architecture to explicitly capture diverse interaction patterns and direction-sensitive social behaviors, plus SSOS training strategy to reduce initial-step divergence.

Result: Outperforms recent black-box baselines on ETH and UCY datasets, especially in high-density scenarios, with SSOS reducing initial-step prediction error by ~6.58%.

Conclusion: InSyn provides both accurate predictions and transparent interaction modeling, making it more reliable for real-world applications compared to black-box approaches.

Abstract: Accurate and reliable pedestrian trajectory prediction is critical for the application of intelligent applications, yet achieving trustworthy prediction remains highly challenging due to the complexity of interactions among pedestrians. Previous methods often adopt black-box modeling of pedestrian interactions. Despite their strong performance, such opaque modeling limits the reliability of predictions in real-world deployments. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy, termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model not only outperforms recent black-box baselines in prediction accuracy, especially under high-density scenarios, but also provides transparent interaction modeling, as shown in the case study. Furthermore, the SSOS strategy proves to be effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%. Code is avaliable at https://github.com/rickzky1001/InSyn

[204] Rethinking Pan-sharpening: A New Training Process for Full-Resolution Generalization

Ran Zhang, Xuanhua He, Li Xueheng, Ke Cao, Liu Liu, Wenbo Xu, Fang Jiabin, Yang Qize, Jie Zhang

Main category: cs.CV

TL;DR: Introduces a multiple-in-one training strategy for pan-sharpening that trains a single compact model on multiple satellite datasets, improving full-resolution generalization while solving the one-model-per-dataset problem.

Details

Motivation: Addresses poor generalization from reduced-resolution training to real-world full-resolution data and the impracticality of one-dataset-one-model approaches in pan-sharpening.

Method: Proposes multiple-in-one training strategy using three satellite datasets (WV2, WV3, GF2) simultaneously, and introduces PanTiny - a lightweight framework designed for this paradigm.

Result: Achieves significant universal boost in full-resolution generalization (QNR) across all tested models, with superior performance-to-efficiency balance compared to brute-force scaling approaches.

Conclusion: Advocates for community shift towards efficient, deployable, and truly generalizable pan-sharpening models, demonstrating that principled simple design is more effective than brute-force scaling.

Abstract: The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This one-dataset, one-model approach leads to high computational overhead and impractical deployment. More critically, it overlooks a core challenge: poor generalization from reduced-resolution (RR) training to real-world full-resolution (FR) data. In response to this issue, we challenge this paradigm. We introduce a multiple-in-one training strategy, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2). Our experiments show the primary benefit of this unified strategy is a significant and universal boost in FR generalization (QNR) across all tested models, directly addressing this overlooked problem. This paradigm also inherently solves the one-model-per-dataset challenge, and we support it with a highly reproducible, dependency-free codebase for true usability. Finally, we propose PanTiny, a lightweight framework designed specifically for this new, robust paradigm. We demonstrate it achieves a superior performance-to-efficiency balance, proving that principled, simple and robust design is more effective than brute-force scaling in this practical setting. Our work advocates for a community-wide shift towards creating efficient, deployable, and truly generalizable models for pan-sharpening. The code is open-sourced at https://github.com/Zirconium233/PanTiny.

[205] Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: A self-supervised framework for procedure learning that discovers key steps and their order from unlabeled videos using fused Gromov-Wasserstein optimal transport with structural prior and contrastive regularization to avoid degenerate solutions.

Details

Motivation: Previous methods for self-supervised procedure learning suffer from order variations, background/redundant frames, and repeated actions when learning frame-to-frame correspondences between videos.

Method: Proposes a self-supervised framework using fused Gromov-Wasserstein optimal transport with structural prior for frame-to-frame mapping, combined with contrastive regularization to prevent degenerate solutions where all frames map to a small cluster.

Result: Extensive experiments on egocentric and third-person benchmarks demonstrate superior performance over prior works, including OPEL which uses classical Kantorovich optimal transport with optimality prior.

Conclusion: The proposed framework effectively addresses challenges in self-supervised procedure learning by combining optimal transport with contrastive regularization, outperforming existing methods on multiple benchmarks.

Abstract: We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions. Finally, extensive experiments on egocentric and third-person benchmarks demonstrate our superior performance over prior works, including OPEL which relies on a classical Kantorovich optimal transport with an optimality prior.

[206] Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Garima Jain, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for predicting cancer survival from whole slide images using quantile-based patch filtering, graph-regularized clustering, hierarchical feature aggregation, and mixture density modeling.

Details

Motivation: To directly predict cancer-specific survival from whole slide pathology images by capturing prognostic and morphological heterogeneity in tumor tissue.

Method: Four-stage framework: 1) Quantile-based patch filtering for informative regions, 2) Graph-regularized patch clustering for phenotype variations, 3) Hierarchical feature aggregation for multiscale tumor organization, 4) Expert-guided mixture density model for survival distribution estimation.

Result: Achieved concordance indices of 0.653 (TCGA LUAD), 0.719 (TCGA KIRC), and 0.733 (TCGA BRCA), surpassing state-of-the-art survival prediction methods.

Conclusion: The proposed modular framework effectively captures prognostic information from pathology images and outperforms existing approaches in cancer survival prediction across multiple cancer types.

Abstract: We propose a modular framework for predicting cancer specific survival directly from whole slide pathology images (WSIs). The framework consists of four key stages designed to capture prognostic and morphological heterogeneity. First, a Quantile Based Patch Filtering module selects prognostically informative tissue regions through quantile thresholding. Second, Graph Regularized Patch Clustering models phenotype level variations using a k nearest neighbor graph that enforces spatial and morphological coherence. Third, Hierarchical Feature Aggregation learns both intra and inter cluster dependencies to represent multiscale tumor organization. Finally, an Expert Guided Mixture Density Model estimates complex survival distributions via Gaussian mixtures, enabling fine grained risk prediction. Evaluated on TCGA LUAD, TCGA KIRC, and TCGA BRCA cohorts, our model achieves concordance indices of 0.653 ,0.719 ,and 0.733 respectively, surpassing existing state of the art approaches in survival prediction from WSIs.

[207] Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation

Yida Tao, Yen-Chia Hsu

Main category: cs.CV

TL;DR: CEDANet is a human-in-the-loop domain adaptation framework that combines citizen-provided weak labels with adversarial feature alignment to achieve industrial smoke segmentation without expensive pixel-level annotations.

Details

Motivation: Industrial smoke segmentation is crucial for environmental monitoring but suffers from high annotation costs and scarcity of pixel-level labels in real-world settings.

Method: Uses citizen-provided video-level labels to refine pseudo-labels from source-trained segmentation model, and employs class-specific domain discriminators for feature alignment between source and target domains.

Result: Achieves F1-score of 0.414 and smoke-class IoU of 0.261 with citizen feedback, representing 5x and 6x improvements over baseline respectively. Performance comparable to fully supervised training on 100 annotated images.

Conclusion: Validates scalability and cost-efficiency of combining citizen science with weakly supervised domain adaptation for complex environmental monitoring applications.

Abstract: Industrial smoke segmentation is critical for air-quality monitoring and environmental protection but is often hampered by the high cost and scarcity of pixel-level annotations in real-world settings. We introduce CEDANet, a human-in-the-loop, class-aware domain adaptation framework that uniquely integrates weak, citizen-provided video-level labels with adversarial feature alignment. Specifically, we refine pseudo-labels generated by a source-trained segmentation model using citizen votes, and employ class-specific domain discriminators to transfer rich source-domain representations to the industrial domain. Comprehensive experiments on SMOKE5K and custom IJmond datasets demonstrate that CEDANet achieves an F1-score of 0.414 and a smoke-class IoU of 0.261 with citizen feedback, vastly outperforming the baseline model, which scored 0.083 and 0.043 respectively. This represents a five-fold increase in F1-score and a six-fold increase in smoke-class IoU. Notably, CEDANet with citizen-constrained pseudo-labels achieves performance comparable to the same architecture trained on limited 100 fully annotated images with F1-score of 0.418 and IoU of 0.264, demonstrating its ability to reach small-sampled fully supervised-level accuracy without target-domain annotations. Our research validates the scalability and cost-efficiency of combining citizen science with weakly supervised domain adaptation, offering a practical solution for complex, data-scarce environmental monitoring applications.

[208] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM is a fast-slow lane topology reasoning framework that uses latent world models to improve temporal perception and overcome limitations of existing methods.

Details

Motivation: Existing lane topology reasoning methods struggle with effectively using temporal information, over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation.

Method: Proposes FASTopoWM with parallel supervision of historical and new queries, and introduces latent query and BEV world models conditioned on action latent to propagate state representations across timesteps.

Result: Outperforms state-of-the-art methods on OpenLane-V2 benchmark: 37.4% vs 33.6% mAP for lane segment detection, and 46.3% vs 41.5% OLS for centerline perception.

Conclusion: The unified fast-slow framework with latent world models effectively enhances temporal perception and reasoning performance for lane topology understanding.

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[209] VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

Main category: cs.CV

TL;DR: Visual Prompt Navigation (VPN) uses visual prompts on 2D top-view maps instead of language instructions to guide embodied agents, reducing ambiguity and improving navigation in complex environments.

Details

Motivation: Natural language instructions for navigation are often ambiguous and verbose, hindering effective guidance in complex environments. Visual prompts provide more intuitive and spatially grounded guidance.

Method: Proposed VPN paradigm with visual prompts marking navigation trajectories on top-view maps. Built VPN tasks in discrete/continuous settings, created R2R-VP and R2R-CE-VP datasets, and developed VPNet baseline with view-level and trajectory-level data augmentation.

Result: Extensive experiments evaluated how visual prompt forms, top-view map formats, and data augmentation strategies affect navigation performance. The approach reduces interpretive ambiguity and is more user-friendly.

Conclusion: Visual Prompt Navigation offers an effective alternative to language-guided navigation by providing intuitive, spatially grounded visual guidance that reduces ambiguity and improves navigation performance in complex environments.

Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

[210] Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models

Haoyang Li, Liang Wang, Chao Wang, Siyu Zhou, Jing Jiang, Yan Peng, Guodong Long

Main category: cs.CV

TL;DR: AugPT is a self-contained prompt tuning method that uses internal data augmentation and consensus-based filtering to enhance model performance without external knowledge.

Details

Motivation: Existing prompt tuning methods rely on expensive external knowledge sources and ignore image modality features, leading to high costs and underutilization of available data.

Method: Uses self-supervised augmentation on training images and a gating mechanism with consensus test to filter noisy samples using the pre-trained backbone model.

Result: Extensive experiments show improved performance and generalization without external knowledge, validated by available code.

Conclusion: AugPT effectively enhances prompt tuning through internal augmentation and consensus filtering, achieving better results without costly external data.

Abstract: For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .

[211] The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness

Wang Yu-Hang, Shiwei Li, Jianxiang Liao, Li Bohan, Jian Liu, Wenfei Yin

Main category: cs.CV

TL;DR: UAA is a plug-and-play adversarial defense framework that pre-computes universal transformations offline to efficiently generate unique adversarial perturbations during training, achieving SOTA performance without online adversarial example generation.

Details

Motivation: Adversarial Training (AT) faces high computational costs and standard performance degradation, while existing data augmentation methods offer limited robustness or high training overhead. There's a need for highly efficient and strongly robust defense mechanisms.

Method: Proposed Universal Adversarial Augmenter (UAA) framework that decouples expensive perturbation generation from model training by pre-computing universal transformations offline, then efficiently generating unique adversarial perturbations for each sample during training.

Result: Extensive experiments on multiple benchmarks show UAA establishes new state-of-the-art for data-augmentation-based adversarial defense strategies without requiring online adversarial example generation during training.

Conclusion: UAA provides a practical and efficient pathway for building robust models by leveraging the synergy of diverse augmentation strategies through a plug-and-play framework with training efficiency.

Abstract: Adversarial perturbations pose a significant threat to deep learning models. Adversarial Training (AT), the predominant defense method, faces challenges of high computational costs and a degradation in standard performance. While data augmentation offers an alternative path, existing techniques either yield limited robustness gains or incur substantial training overhead. Therefore, developing a defense mechanism that is both highly efficient and strongly robust is of paramount importance.In this work, we first conduct a systematic analysis of existing augmentation techniques, revealing that the synergy among diverse strategies – rather than any single method – is crucial for enhancing robustness. Based on this insight, we propose the Universal Adversarial Augmenter (UAA) framework, which is characterized by its plug-and-play nature and training efficiency. UAA decouples the expensive perturbation generation process from model training by pre-computing a universal transformation offline, which is then used to efficiently generate unique adversarial perturbations for each sample during training.Extensive experiments conducted on multiple benchmarks validate the effectiveness of UAA. The results demonstrate that UAA establishes a new state-of-the-art (SOTA) for data-augmentation-based adversarial defense strategies , without requiring the online generation of adversarial examples during training. This framework provides a practical and efficient pathway for building robust models,Our code is available in the supplementary materials.

[212] Learning More by Seeing Less: Structure First Learning for Efficient, Transferable, and Human-Aligned Vision

Tianqin Li, George Liu, Tai Sing Lee

Main category: cs.CV

TL;DR: Structure-first learning uses line drawings as initial training modality to create more compact and generalizable visual representations, leading to better shape bias, data efficiency, and lower intrinsic dimensionality.

Details

Motivation: Modern vision systems depend on rich visual inputs while humans understand sparse representations like line drawings, suggesting structure rather than appearance enables efficient visual understanding.

Method: Propose structure-first learning paradigm using line drawings as initial training modality to induce compact visual representations, evaluated across classification, detection, and segmentation tasks.

Result: Models show stronger shape bias, more focused attention, greater data efficiency, lower intrinsic dimensionality, and produce more compressible representations that enable better distillation into lightweight models.

Conclusion: Structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases, offering a powerful strategy for building more robust and adaptable vision systems.

Abstract: Despite remarkable progress in computer vision, modern recognition systems remain fundamentally limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings, suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose a novel structure-first learning paradigm that uses line drawings as an initial training modality to induce more compact and generalizable visual representations. We demonstrate that models trained with this approach develop a stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance, which mirrors observations of low-dimensional, efficient representations in the human brain. Beyond performance improvements, structure-first learning produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from teachers trained on line drawings consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases, offering a simple yet powerful strategy for building more robust and adaptable vision systems.

[213] An Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

Chao Yin, Jide Li, Hang Yao, Xiaoqiang Li

Main category: cs.CV

TL;DR: Proposes IAPF, a training-free framework for camouflaged object segmentation that generates instance-level prompts for SAM instead of semantic prompts, enabling better handling of multiple camouflaged instances.

Details

Motivation: Existing training-free COS methods produce semantic-level prompts that lead to coarse masks and struggle with multiple discrete camouflaged instances.

Method: Uses Instance Mask Generator with detector-agnostic enumerator for box prompts and SFMBP strategy for point prompts, plus text prompt generator and self-consistency voting.

Result: Achieves state-of-the-art performance on three COS benchmarks, two CIS benchmarks, and two downstream datasets among training-free methods.

Conclusion: IAPF successfully upgrades prompt granularity from semantic to instance-level while keeping components frozen, enabling effective multi-instance camouflaged object segmentation.

Abstract: Training-free Camouflaged Object Segmentation (COS) seeks to segment camouflaged objects without task-specific training, by automatically generating visual prompts to guide the Segment Anything Model (SAM). However, existing pipelines mostly yield semantic-level prompts, which drive SAM to coarse semantic masks and struggle to handle multiple discrete camouflaged instances effectively. To address this critical limitation, we propose an \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF) tailored for the first training-free COS that upgrades prompt granularity from semantic to instance-level while keeping all components frozen. The centerpiece is an Instance Mask Generator that (i) leverages a detector-agnostic enumerator to produce precise instance-level box prompts for the foreground tag, and (ii) introduces the Single-Foreground Multi-Background Prompting (SFMBP) strategy to sample region-constrained point prompts within each box prompt, enabling SAM to output instance masks. The pipeline is supported by a simple text prompt generator that produces image-specific tags and a self-consistency vote across synonymous task-generic prompts to stabilize inference. Extensive evaluations on three COS benchmarks, two CIS benchmarks, and two downstream datasets demonstrate state-of-the-art performance among training-free methods. Code will be released upon acceptance.

[214] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects

Licheng Shen, Saining Zhang, Honghan Li, Peilin Yang, Zihao Huang, Zongzheng Zhang, Hao Zhao

Main category: cs.CV

TL;DR: Unified representation using articulated 3D Gaussians that jointly models geometry and motion for reconstructing articulated objects, supporting up to 20 parts and outperforming prior methods.

Details

Motivation: Prior methods decouple geometry and motion, complicating reconstruction pipelines and limiting scalability for objects with complex, multi-part articulation.

Method: Introduces a unified representation using articulated 3D Gaussians that jointly models geometry and motion, improving motion decomposition robustness.

Result: Achieves superior accuracy in part-level geometry reconstruction and motion estimation across diverse object types, supporting up to 20 parts compared to prior methods’ 2-3 part limit.

Conclusion: Demonstrates applicability to robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations for scalable physical modeling.

Abstract: Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2–3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.

[215] DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency

Tianwei Ye, Yong Ma, Xiaoguang Mei

Main category: cs.CV

TL;DR: DcMatch is an unsupervised learning framework for non-rigid multi-shape matching that uses shape graph attention and dual-domain consistency to achieve superior performance over state-of-the-art methods.

Details

Motivation: Existing methods learn canonical embeddings from single shapes, but there's a need to capture the underlying manifold structure of entire shape collections for more robust and consistent correspondences.

Method: Uses shape graph attention network to capture manifold structure, constructs shared latent space with universe predictor, and enforces dual-level consistency through spatial/spectral domain alignment with cycle consistency loss.

Result: Extensive experiments show DcMatch consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching benchmarks.

Conclusion: The proposed framework successfully addresses multi-shape matching by leveraging collective shape information and dual-domain consistency, achieving more accurate and coherent correspondences.

Abstract: Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios.

[216] Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch

Zia Badar

Main category: cs.CV

TL;DR: This paper presents a differentiable quantization method for neural networks that converges to optimal solutions, supporting n-bit quantization including shift/logarithmic formats, achieving near-full-precision accuracy with only 15 training epochs.

Details

Motivation: Previous quantization methods were non-differentiable with manually set derivatives, making learning questionable, and struggled with activation quantization or multi-bit logarithmic quantization.

Method: Differentiable quantization approach with proof of convergence, supporting n-bit quantization including shift/logarithmic formats (values of form 2^n), enabling both weight and activation quantization.

Result: On ImageNet with ResNet18: weight-only quantization achieves <1% accuracy drop vs full precision; weight+activation quantization achieves SOTA-comparable accuracy, both trained in only 15 epochs with slightly higher CPU instructions but no high-precision multiplication.

Conclusion: The method provides differentiable quantization with convergence guarantees, enabling efficient multi-bit logarithmic quantization for both weights and activations with minimal accuracy loss and fast training.

Abstract: Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.

[217] Mapping Hidden Heritage: Self-supervised Pre-training for Archaeological Stone Wall Mapping in Historic Landscapes Using High-Resolution DEM Derivatives

Zexian Huang, Mashnoon Islam, Brian Armstrong, Billy Bell, Kourosh Khoshelham, Martin Tomko

Main category: cs.CV

TL;DR: DINO-CV uses self-supervised cross-view pre-training on LiDAR DEM derivatives to automatically map historic dry-stone walls in vegetated landscapes, achieving high accuracy with minimal labeled data.

Details

Motivation: Historic dry-stone walls are culturally and environmentally important but remain undocumented in remote/vegetated areas due to accessibility issues and high mapping costs. Deep learning faces challenges from vegetation occlusion and scarce labeled data.

Method: Proposes DINO-CV, a self-supervised cross-view pre-training framework using knowledge distillation. Learns invariant structural representations from multiple DEM-derived views (Multi-directional Hillshade and Visualization for Archaeological Topography) to address occlusion and data scarcity.

Result: Achieved 68.6% mIoU on test areas in Budj Bim Cultural Landscape (UNESCO site). Maintained 63.8% mIoU when fine-tuned with only 10% labeled data, demonstrating effectiveness with minimal supervision.

Conclusion: Self-supervised learning on high-resolution DEM derivatives enables scalable, automated mapping of cultural heritage features in complex vegetated environments, with applications beyond archaeology to environmental monitoring and heritage preservation.

Abstract: Historic dry-stone walls hold significant cultural and environmental importance, serving as historical markers and contributing to ecosystem preservation and wildfire management during dry seasons in Australia. However, many of these stone structures in remote or vegetated landscapes remain undocumented due to limited accessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable approach for automated mapping of such features, but challenges remain: the visual occlusion of low-lying walls by dense vegetation and the scarcity of labeled training data. This study presents DINO-CV, a self-supervised cross-view pre-training framework based on knowledge distillation, designed for accurate mapping of dry-stone walls using high-resolution Digital Elevation Models (DEMs) derived from airborne LiDAR. By learning invariant structural representations across multiple DEM-derived views, specifically Multi-directional Hillshade (MHS) and Visualization for Archaeological Topography (VAT), DINO-CV addresses both occlusion and data scarcity challenges. Applied to the Budj Bim Cultural Landscape (Victoria, Australia), a UNESCO World Heritage site, the approach achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for large-scale, automated mapping of cultural heritage features in complex and vegetated environments. Beyond archaeology, this approach offers a scalable solution for environmental monitoring and heritage preservation across inaccessible or environmentally sensitive regions.

[218] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

Main category: cs.CV

TL;DR: PISA-Bench is a multilingual benchmark for vision-language models derived from expert-created PISA tests, covering 6 languages with human-verified examples to address limitations in existing synthetic datasets.

Details

Motivation: To address limitations in existing benchmarks that rely on synthetic LLM-generated content and are mostly English-only, by creating a high-quality, human-verified multilingual benchmark for multimodal reasoning.

Method: Derived examples from expert-created PISA tests, extracted human-verified instructions, questions, answer options, and images, then translated them into 5 additional languages (Spanish, German, Chinese, French, Italian) to create a parallel corpus across 6 languages.

Result: Evaluation of state-of-the-art VLMs showed small models (<20B parameters) fail to achieve high scores, substantial performance degradation on non-English splits, and high error rates in spatial and geometric reasoning tasks.

Conclusion: PISA-Bench provides a valuable resource for advancing multilingual multimodal reasoning research, highlighting current model limitations in non-English languages and spatial reasoning.

Abstract: Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

[219] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: Weakly supervised deep learning framework for pneumonia classification and localization using Grad-CAM explanations with image-level labels instead of pixel-level annotations.

Details

Motivation: To develop an interpretable AI system for pneumonia screening that uses only image-level labels to avoid costly pixel-level annotations while providing clinically meaningful heatmaps.

Method: Evaluated seven pre-trained architectures and Vision Transformer using focal loss and patient-wise splits to prevent data leakage, with Grad-CAM for heatmap generation.

Result: All models achieved high accuracy (96-98%), with ResNet-18 and EfficientNet-B0 showing best overall performance, and MobileNet-V2 as efficient lightweight alternative. Grad-CAM heatmaps focused on clinically relevant lung regions.

Conclusion: The study demonstrates the potential of weakly supervised, explainable models for enhancing transparency and clinical trust in AI-assisted pneumonia screening.

Abstract: This study proposes a weakly supervised deep learning framework for pneumonia classification and localization from chest X-rays, utilizing Grad-CAM explanations. Instead of costly pixel-level annotations, our approach uses image-level labels to generate clinically meaningful heatmaps that highlight regions affected by pneumonia. We evaluate seven pre-trained architectures and the Vision Transformer under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high accuracy (96-98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V2 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations confirm that the proposed models focus on clinically relevant lung regions, supporting the use of interpretable AI for radiological diagnostics. This work highlights the potential of weakly supervised, explainable models that enhance the transparency of pneumonia screening and clinical trust in AI-assisted screening.

[220] Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, ChoonHwai Yap

Main category: cs.CV

TL;DR: Faithful Contouring is a sparse voxelized representation for 3D meshes that achieves near-lossless fidelity at 2048+ resolutions without requiring water-tightening or isosurface extraction, enabling high-accuracy 3D reconstruction and generation.

Details

Motivation: Existing voxelized representations based on iso-surface methods compromise geometric fidelity due to reliance on water-tightening or rendering optimization, creating a need for a more faithful representation that preserves sharpness and internal structures.

Method: Proposes Faithful Contouring - a sparse voxelized representation that doesn’t convert meshes to field functions or extract isosurface during remeshing, and designs a dual-mode autoencoder for scalable shape reconstruction.

Result: Achieves distance errors at 10^-5 level for direct representation, and for mesh reconstruction yields 93% reduction in Chamfer Distance and 35% improvement in F-score over strong baselines.

Conclusion: Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction, confirming superior fidelity as a representation for 3D learning tasks.

Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

[221] Self-Supervised Implicit Attention Priors for Point Cloud Reconstruction

Kyle Fogarty, Chenyue Cai, Jing Yang, Zhilin Guo, Cengiz Öztireli

Main category: cs.CV

TL;DR: An implicit self-prior approach that learns shape-specific priors directly from input point clouds using cross-attention with a learnable dictionary, combined with robust implicit moving least squares for high-quality surface reconstruction.

Details

Motivation: Recovering high-quality surfaces from irregular point clouds is ill-posed without strong geometric priors, and existing methods often require external training data or fail to preserve fine details.

Method: Jointly trains a dictionary of learnable embeddings with an implicit distance field using cross-attention, optimized with self-supervised point cloud reconstruction losses. Then samples the trained field to extract dense points and normals, integrated into robust implicit moving least squares (RIMLS).

Result: Outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.

Conclusion: The hybrid strategy effectively preserves fine geometric details while leveraging learned priors to regularize sparse regions, requiring no external training data.

Abstract: Recovering high-quality surfaces from irregular point cloud is ill-posed unless strong geometric priors are available. We introduce an implicit self-prior approach that distills a shape-specific prior directly from the input point cloud itself and embeds it within an implicit neural representation. This is achieved by jointly training a small dictionary of learnable embeddings with an implicit distance field; at every query location, the field attends to the dictionary via cross-attention, enabling the network to capture and reuse repeating structures and long-range correlations inherent to the shape. Optimized solely with self-supervised point cloud reconstruction losses, our approach requires no external training data. To effectively integrate this learned prior while preserving input fidelity, the trained field is then sampled to extract densely distributed points and analytic normals via automatic differentiation. We integrate the resulting dense point cloud and corresponding normals into a robust implicit moving least squares (RIMLS) formulation. We show this hybrid strategy preserves fine geometric details in the input data, while leveraging the learned prior to regularize sparse regions. Experiments show that our method outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.

[222] A Mixture-of-Experts Framework with Log-Logistic Components for Survival Analysis on Histopathology Images

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Shivam Madnoorkar, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for predicting cancer survival from pathology images using quantile-based patch selection, graph clustering, hierarchical attention, and mixture modeling.

Details

Motivation: To develop a more accurate method for predicting cancer-specific survival from whole slide pathology images by capturing tissue heterogeneity and complex survival distributions.

Method: Four-component framework: (1) Quantile Gated Patch Selection for informative regions, (2) Graph Guided Clustering for phenotype heterogeneity, (3) Hierarchical Context Attention for cluster interactions, (4) Expert Driven Mixture of Log-logistics for survival distribution estimation.

Result: Achieved concordance indices of 0.644 on TCGA LUAD, 0.751 on TCGA KIRC, and 0.752 on TCGA BRCA, outperforming state-of-the-art methods.

Conclusion: The proposed modular framework effectively captures tissue heterogeneity and complex survival patterns, demonstrating superior performance in predicting cancer-specific survival across multiple cancer types.

Abstract: We propose a modular framework for predicting cancer specific survival from whole slide pathology images (WSIs). The method integrates four components: (i) Quantile Gated Patch Selection via quantile based thresholding to isolate prognostically informative tissue regions; (ii) Graph Guided Clustering using a k nearest neighbor graph to capture phenotype level heterogeneity through spatial and morphological coherence; (iii) Hierarchical Context Attention to learn intra and inter cluster interactions; and (iv) an Expert Driven Mixture of Log logistics framework to estimate complex survival distributions using Log logistics distributions. The model attains a concordance index of 0.644 on TCGA LUAD, 0.751 on TCGA KIRC, and 0.752 on TCGA BRCA respectively, outperforming existing state of the art approaches.

[223] SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection

Xin Zuo, Yuchen Qu, Haibo Zhan, Jifeng Shen, Wankou Yang

Main category: cs.CV

TL;DR: Proposes SFFR method using KAN networks for spatial-frequency feature reconstruction in multispectral object detection, with FCEKAN for frequency component exchange and MSGKAN for multi-scale spatial feature modeling.

Details

Motivation: Current multispectral object detection methods focus mainly on spatial-domain feature fusion, while frequency-domain features remain underexplored. There's a need to leverage both spatial and frequency domains for better feature representation.

Method: SFFR method with two core modules: FCEKAN for selective frequency component exchange between RGB and IR images, and MSGKAN with multi-scale Gaussian basis functions for spatial domain feature modeling that adapts to UAV altitude variations.

Result: Extensive experiments on SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate superior performance in UAV multispectral object perception tasks. The modules are complementary and effectively capture frequency and spatial semantic features.

Conclusion: The proposed SFFR method with FCEKAN and MSGKAN modules provides significant advantages for multispectral object detection by leveraging both spatial and frequency domain representations, enhancing model adaptability and robustness to scale variations.

Abstract: Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model’s adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.

[224] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

Rui Wang, Ying Zhou, Hao Wang, Wenwei Zhang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: TiS-TSL is a time-switchable teacher-student learning framework for video stereo matching in minimally invasive surgery that addresses temporal consistency issues in existing methods through unified image and video prediction modes.

Details

Motivation: Stereo matching in MIS is crucial for navigation and AR, but dense supervision is impossible due to anatomical constraints. Existing teacher-student methods lack temporal consistency, causing unstable predictions and flickering artifacts in videos.

Method: Proposes TiS-TSL with a unified model operating in three modes (IP, FVP, BVP) and a two-stage learning strategy (I2V then V2V) that transfers image knowledge to video and refines predictions using bidirectional spatio-temporal consistency.

Result: Experimental results show TiS-TSL outperforms image-based state-of-the-art methods, improving TEPE and EPE by at least 2.11% and 4.54% respectively on two public datasets.

Conclusion: TiS-TSL effectively addresses temporal consistency issues in surgical video stereo matching through unified temporal modeling and bidirectional consistency estimation, achieving superior performance with minimal supervision.

Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.

[225] Mapping Reduced Accessibility to WASH Facilities in Rohingya Refugee Camps with Sub-Meter Imagery

Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenböck, Meeyoung Cha

Main category: cs.CV

TL;DR: Remote sensing framework using semi-supervised segmentation to detect refugee shelters and quantify WASH accessibility in Rohingya camps, revealing declining access and gender disparities.

Details

Motivation: WASH services remain a major public health concern in refugee camps, with challenges in detecting shelters due to dense spatial configuration and irregular geometric patterns.

Method: Semi-supervised segmentation framework using sub-meter satellite images to detect individual refugee shelters, achieving 76.4% F1-score, applied across multi-year data for WASH accessibility analysis.

Result: Declining WASH accessibility from 25 people per facility in 2022 to 29.4 in 2025, with women and girls experiencing reduced accessibility due to inadequate safety-related segregation.

Conclusion: Importance of demand-responsive allocation strategies and value of high-resolution remote sensing with machine learning to detect inequality and inform equitable resource planning in humanitarian settings.

Abstract: Access to Water, Sanitation, and Hygiene (WASH) services remains a major public health concern in refugee camps. This study introduces a remote sensing-driven framework to quantify WASH accessibility-specifically to water pumps, latrines, and bathing cubicles-in the Rohingya camps of Cox’s Bazar, one of the world’s most densely populated displacement settings. Detecting refugee shelters in such emergent camps presents substantial challenges, primarily due to their dense spatial configuration and irregular geometric patterns. Using sub-meter satellite images, we develop a semi-supervised segmentation framework that achieves an F1-score of 76.4% in detecting individual refugee shelters. Applying the framework across multi-year data reveals declining WASH accessibility, driven by rapid refugee population growth and reduced facility availability, rising from 25 people per facility in 2022 to 29.4 in 2025. Gender-disaggregated analysis further shows that women and girls experience reduced accessibility, in scenarios with inadequate safety-related segregation in WASH facilities. These findings suggest the importance of demand-responsive allocation strategies that can identify areas with under-served populations-such as women and girls-and ensure that limited infrastructure serves the greatest number of people in settings with fixed or shrinking budgets. We also discuss the value of high-resolution remote sensing and machine learning to detect inequality and inform equitable resource planning in complex humanitarian environments.

[226] DI3CL: Contrastive Learning With Dynamic Instances and Contour Consistency for SAR Land-Cover Classification Foundation Model

Zhongle Ren, Hui Ding, Kai Wang, Biao Hou, Xingyu Luo, Weibin Li, Licheng Jiao

Main category: cs.CV

TL;DR: A foundation model for SAR land-cover classification using Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) that outperforms existing methods across various tasks without heavy reliance on labeled data.

Details

Motivation: Current SAR land-cover classification methods rely heavily on supervised learning with extensive labeled datasets, limiting scalability, generalization, and adaptability to diverse scenarios.

Method: Proposed DI3CL pre-training framework with Dynamic Instance module for global contextual awareness and Contour Consistency module for geometric contour focus, trained on large-scale SARSense dataset (460,532 SAR images).

Result: Extensive experiments across SAR land-cover mapping, water body detection, and road extraction tasks consistently demonstrate DI3CL outperforms existing methods.

Conclusion: The foundation model serves as a robust cornerstone for accelerating downstream SAR land-cover classification development and deployment, with publicly available code and pre-trained weights.

Abstract: Although significant advances have been achieved in SAR land-cover classification, recent methods remain predominantly focused on supervised learning, which relies heavily on extensive labeled datasets. This dependency not only limits scalability and generalization but also restricts adaptability to diverse application scenarios. In this paper, a general-purpose foundation model for SAR land-cover classification is developed, serving as a robust cornerstone to accelerate the development and deployment of various downstream models. Specifically, a Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) pre-training framework is presented, which incorporates a Dynamic Instance (DI) module and a Contour Consistency (CC) module. DI module enhances global contextual awareness by enforcing local consistency across different views of the same region. CC module leverages shallow feature maps to guide the model to focus on the geometric contours of SAR land-cover objects, thereby improving structural discrimination. Additionally, to enhance robustness and generalization during pre-training, a large-scale and diverse dataset named SARSense, comprising 460,532 SAR images, is constructed to enable the model to capture comprehensive and representative features. To evaluate the generalization capability of our foundation model, we conducted extensive experiments across a variety of SAR land-cover classification tasks, including SAR land-cover mapping, water body detection, and road extraction. The results consistently demonstrate that the proposed DI3CL outperforms existing methods. Our code and pre-trained weights are publicly available at: https://github.com/SARpre-train/DI3CL.

[227] DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

Seyedehanita Madani, Rama Chellappa, Vishal M. Patel

Main category: cs.CV

TL;DR: DiffRegCD is a unified framework that integrates dense registration and change detection in a single model, achieving sub-pixel accuracy and robust performance under large displacements and viewpoint variations.

Details

Motivation: Real-world imagery often exhibits parallax, viewpoint shifts, and temporal gaps causing severe misalignment, which traditional two-stage methods and recent joint frameworks struggle to handle effectively.

Method: Reformulates correspondence estimation as Gaussian smoothed classification task, leverages frozen multi-scale features from pretrained denoising diffusion model, and uses controlled affine perturbations on standard CD datasets for supervision.

Result: Extensive experiments on aerial and ground-level datasets show DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation.

Conclusion: Diffusion features and classification-based correspondence provide a strong foundation for unified change detection, establishing DiffRegCD as an effective solution for misaligned imagery.

Abstract: Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.

[228] EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision

Yifei Cao, Yu Liu, Guolong Wang, Zhu Liu, Kai Wang, Xianjie Zhang, Jizhe Yu, Xun Tu

Main category: cs.CV

TL;DR: EAGLE is a novel framework for egocentric visual query localization that uses episodic appearance- and geometry-aware memory to achieve unified 2D-3D localization, achieving state-of-the-art performance on Ego4D-VQ benchmark.

Details

Motivation: Egocentric visual query localization is challenging due to camera motion, viewpoint changes, and appearance variations, making it vital for embodied AI and VR/AR applications.

Method: EAGLE integrates segmentation guided by appearance-aware meta-learning memory (AMM) with tracking driven by geometry-aware localization memory (GLM), using memory consolidation inspired by avian memory. It also uses visual geometry grounded Transformer (VGGT) to unify 2D and 3D tasks.

Result: The method achieves state-of-the-art performance on the Ego4D-VQ benchmark, enabling precise contour delineation with robust spatial discrimination and significantly improved retrieval accuracy.

Conclusion: EAGLE’s memory consolidation mechanism through structured appearance and geometry memory banks effectively supports both long- and short-term modeling of target appearance variations, enabling efficient unification of 2D and 3D visual query localization.

Abstract: Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.

[229] Foam Segmentation in Wastewater Treatment Plants: A Federated Learning Approach with Segment Anything Model 2

Mehmet Batuhan Duman, Alejandro Carnero, Cristian Martín, Daniel Garrido, Manuel Díaz

Main category: cs.CV

TL;DR: Proposes a federated learning framework combining SAM2 with Flower framework for privacy-preserving foam segmentation in wastewater treatment plants, enabling collaborative training without sharing sensitive data.

Details

Motivation: Foam formation in WTPs reduces treatment efficiency and increases costs, but standard ML approaches require large labeled datasets and face data privacy concerns across different plants.

Method: Combines Federated Learning with SAM2 image segmentation model, using Flower framework for distributed training across edge nodes with a central Fog server aggregating weights without accessing private data.

Result: The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2’s pre-trained weights, validated on real-world and synthetic datasets.

Conclusion: Offers a practical, scalable, privacy-aware solution for automatic foam tracking in WTPs, demonstrating the potential of integrating foundational models with FL for industrial challenges with distributed sensitive data.

Abstract: Foam formation in Wastewater Treatment Plants (WTPs) is a major challenge that can reduce treatment efficiency and increase costs. The ability to automatically examine changes in real-time with respect to the percentage of foam can be of great benefit to the plant. However, large amounts of labeled data are required to train standard Machine Learning (ML) models. The development of these systems is slow due to the scarcity and heterogeneity of labeled data. Additionally, the development is often hindered by the fact that different WTPs do not share their data due to privacy concerns. This paper proposes a new framework to address these challenges by combining Federated Learning (FL) with the state-of-the-art base model for image segmentation, Segment Anything Model 2 (SAM2). The FL paradigm enables collaborative model training across multiple WTPs without centralizing sensitive operational data, thereby ensuring privacy. The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2’s strong pre-trained weights for initialization. The methodology involves fine-tuning SAM2 on distributed clients (edge nodes) using the Flower framework, where a central Fog server orchestrates the process by aggregating model weights without accessing private data. The model was trained and validated using various data collections, including real-world images captured at a WTPs in Granada, Spain, a synthetically generated foam dataset, and images from publicly available datasets to improve generalization. This research offers a practical, scalable, and privacy-aware solution for automatic foam tracking in WTPs. The findings highlight the significant potential of integrating large-scale foundational models into FL systems to solve real-world industrial challenges characterized by distributed and sensitive data.

[230] MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image Autoencoders

Kunjing Yang, Zhiwei Wang, Minru Bai

Main category: cs.CV

TL;DR: Proposes MAUGIF, a mechanism-aware unsupervised general image fusion method using dual cross-image autoencoders that adapts to different fusion mechanisms (additive vs multiplicative) for better performance and interpretability.

Details

Motivation: Existing fusion methods are either too task-specific or apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms.

Method: Uses dual cross-image autoencoders with shared latent space to capture common content while isolating modality-specific details. Dual decoders act as feature injectors that selectively reintegrate unique characteristics based on fusion mechanism classification.

Result: Extensive experiments validate the method’s effectiveness and generalization ability across diverse fusion tasks.

Conclusion: MAUGIF provides a flexible framework that adapts to different fusion mechanisms, enhancing both performance and interpretability in general image fusion tasks.

Abstract: Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at https://anonymous.4open.science/r/MAUGIF.

[231] SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

Kaiyi Xu, Junchao Gong, Zhiwang Zhou, Zhangrui Li, Yuandong Pu, Yihao Liu, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bei

Main category: cs.CV

TL;DR: SynWeather is the first dataset for unified multi-region, multi-variable weather data synthesis, and SynWeatherDiff is a diffusion transformer model that addresses over-smoothing in weather synthesis.

Details

Motivation: Current weather data synthesis approaches are limited to single-variable, single-region tasks using deterministic modeling, which restricts unified synthesis across variables/regions, overlooks cross-variable complementarity, and causes over-smoothed results.

Method: Created SynWeather dataset covering four regions (Continental US, Europe, East Asia, Tropical Cyclones) with high-resolution observations of key weather variables. Developed SynWeatherDiff, a probabilistic weather synthesis model based on Diffusion Transformer framework.

Result: Experiments on SynWeather dataset demonstrate the effectiveness of SynWeatherDiff compared to both task-specific and general models.

Conclusion: The proposed SynWeather dataset and SynWeatherDiff model successfully address limitations of current approaches by enabling unified multi-region, multi-variable weather data synthesis with improved results.

Abstract: With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.

[232] SASG-DA: Sparse-Aware Semantic-Guided Diffusion Augmentation For Myoelectric Gesture Recognition

Chen Liu, Can Han, Weishi Xu, Yaqi Wang, Dahong Qian

Main category: cs.CV

TL;DR: Proposes SASG-DA, a diffusion-based data augmentation method for sEMG gesture recognition that uses semantic guidance and sparse-aware sampling to generate faithful and diverse training samples, improving model generalization.

Details

Motivation: sEMG-based gesture recognition systems suffer from limited training data leading to overfitting and poor generalization. Existing data augmentation methods struggle to balance faithfulness and diversity, with untargeted diversity creating redundant samples.

Method: SASG-DA uses Semantic Representation Guidance (SRG) for faithful generation, Gaussian Modeling Semantic Sampling (GMSS) for flexible diversity, and Sparse-Aware Semantic Sampling (SASS) to explore underrepresented regions for targeted diversity.

Result: Extensive experiments on Ninapro DB2, DB4, and DB7 datasets show SASG-DA significantly outperforms existing augmentation methods in mitigating overfitting and improving recognition performance and generalization.

Conclusion: The proposed diffusion-based data augmentation approach effectively addresses data scarcity in sEMG gesture recognition by generating both faithful and diverse samples, enhancing model performance and generalization capabilities.

Abstract: Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Sampling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling (SASS) strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.

cs.AI

[233] Bridging Natural Language and ASP: A Hybrid Approach Using LLMs and AMR Parsing

Connar Hite, Sean Saud, Raef Taha, Nayim Rahman, Tanvir Atahary, Scott Douglass, Tarek Taha

Main category: cs.AI

TL;DR: A system that translates English into Answer Set Programming (ASP) code for logic puzzles using LLMs and AMR graphs, minimizing LLM usage for better explainability.

Details

Motivation: ASP is powerful but requires learning syntax, and there's increasing need for non-programmers to interact with code. Current methods rely too heavily on LLMs.

Method: Uses LLM only for simple tasks (simplifying language, keyword identification, fact generation), then parses AMR graphs from simplified text to systematically generate ASP constraints and rules.

Result: Successfully creates complete ASP programs that solve combinatorial logic problems from natural language descriptions.

Conclusion: This is a significant first step toward creating lightweight, explainable systems that convert natural language to solve complex logic problems.

Abstract: Answer Set Programming (ASP) is a declarative programming paradigm based on logic programming and non-monotonic reasoning. It is a tremendously powerful tool for describing and solving combinatorial problems. Like any other language, ASP requires users to learn how it works and the syntax involved. It is becoming increasingly required for those unfamiliar with programming languages to interact with code. This paper proposes a novel method of translating unconstrained English into ASP programs for logic puzzles using an LLM and Abstract Meaning Representation (AMR) graphs. Everything from ASP rules, facts, and constraints is generated to fully represent and solve the desired problem. Example logic puzzles are used to demonstrate the capabilities of the system. While most current methods rely entirely on an LLM, our system minimizes the role of the LLM only to complete straightforward tasks. The LLM is used to simplify natural language sentences, identify keywords, and generate simple facts. The AMR graphs are then parsed from simplified language and used to generate ASP constraints systematically. The system successfully creates an entire ASP program that solves a combinatorial logic problem. This approach is a significant first step in creating a lighter-weight, explainable system that converts natural language to solve complex logic problems.

[234] Vector Symbolic Algebras for the Abstraction and Reasoning Corpus

Isaac Joffe, Chris Eliasmith

Main category: cs.AI

TL;DR: A neurosymbolic ARC-AGI solver using Vector Symbolic Algebras that combines System 1 intuition with System 2 reasoning through object-centric program synthesis, achieving modest ARC-AGI performance but strong results on simpler benchmarks while being computationally efficient.

Details

Motivation: ARC-AGI is a challenging few-shot fluid intelligence benchmark that humans solve effortlessly but remains extremely difficult for AI systems. The authors aim to create a cognitively plausible solver inspired by human intelligence models from neuroscience and psychology.

Method: Integrates System 1 intuitions with System 2 reasoning using neurosymbolic methods based on Vector Symbolic Algebras (VSAs). Uses object-centric program synthesis where VSAs represent abstract objects, guide solution search, and enable sample-efficient neural learning.

Result: Scores 10.8% on ARC-AGI-1-Train and 3.0% on ARC-AGI-1-Eval. Performs well on simpler benchmarks: 94.5% on Sort-of-ARC and 83.1% on 1D-ARC, outperforming GPT-4 at a tiny fraction of computational cost.

Conclusion: This represents the first application of VSAs to ARC-AGI and the most cognitively plausible ARC-AGI solver developed to date, offering an efficient and interpretable approach to solving complex reasoning tasks.

Abstract: The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a generative, few-shot fluid intelligence benchmark. Although humans effortlessly solve ARC-AGI, it remains extremely difficult for even the most advanced artificial intelligence systems. Inspired by methods for modelling human intelligence spanning neuroscience to psychology, we propose a cognitively plausible ARC-AGI solver. Our solver integrates System 1 intuitions with System 2 reasoning in an efficient and interpretable process using neurosymbolic methods based on Vector Symbolic Algebras (VSAs). Our solver works by object-centric program synthesis, leveraging VSAs to represent abstract objects, guide solution search, and enable sample-efficient neural learning. Preliminary results indicate success, with our solver scoring 10.8% on ARC-AGI-1-Train and 3.0% on ARC-AGI-1-Eval. Additionally, our solver performs well on simpler benchmarks, scoring 94.5% on Sort-of-ARC and 83.1% on 1D-ARC – the latter outperforming GPT-4 at a tiny fraction of the computational cost. Importantly, our approach is unique; we believe we are the first to apply VSAs to ARC-AGI and have developed the most cognitively plausible ARC-AGI solver yet. Our code is available at: https://github.com/ijoffe/ARC-VSA-2025.

[235] Interpretable by Design: Query-Specific Neural Modules for Explainable Reinforcement Learning

Mehrdad Zakershahrak

Main category: cs.AI

TL;DR: QDIN introduces a unified RL architecture that treats different types of queries as first-class citizens, showing that inference accuracy can be near-perfect even when control performance is suboptimal.

Details

Motivation: To challenge the traditional RL paradigm focused solely on reward maximization and instead architect RL systems as inference engines that can answer diverse queries about their environment.

Method: Query Conditioned Deterministic Inference Networks (QDIN) - a unified architecture with specialized neural modules optimized for different inference patterns (policy, reachability, paths, comparisons).

Result: QDIN achieves near-perfect inference accuracy (99% reachability IoU) even with suboptimal control performance (31% return), demonstrating decoupling between world knowledge representations and control requirements. Query-specialized architectures outperform unified models and post-hoc extraction methods.

Conclusion: This work establishes a research agenda for RL systems designed as queryable knowledge bases from inception, with implications for interpretability, verification, and human-AI collaboration.

Abstract: Reinforcement learning has traditionally focused on a singular objective: learning policies that select actions to maximize reward. We challenge this paradigm by asking: what if we explicitly architected RL systems as inference engines that can answer diverse queries about their environment? In deterministic settings, trained agents implicitly encode rich knowledge about reachability, distances, values, and dynamics - yet current architectures are not designed to expose this information efficiently. We introduce Query Conditioned Deterministic Inference Networks (QDIN), a unified architecture that treats different types of queries (policy, reachability, paths, comparisons) as first-class citizens, with specialized neural modules optimized for each inference pattern. Our key empirical finding reveals a fundamental decoupling: inference accuracy can reach near-perfect levels (99% reachability IoU) even when control performance remains suboptimal (31% return), suggesting that the representations needed for accurate world knowledge differ from those required for optimal control. Experiments demonstrate that query specialized architectures outperform both unified models and post-hoc extraction methods, while maintaining competitive control performance. This work establishes a research agenda for RL systems designed from inception as queryable knowledge bases, with implications for interpretability, verification, and human-AI collaboration.

[236] Neural Value Iteration

Yang You, Ufuk Çakır, Alex Schutz, Robert Skilton, Nick Hawes

Main category: cs.AI

TL;DR: The paper proposes Neural Value Iteration, a novel POMDP planning algorithm that represents value functions using neural networks instead of α-vectors, enabling scalability to large-scale problems.

Details

Motivation: Traditional POMDP solvers using α-vectors become intractable for large-scale problems due to the high computational cost of Bellman backups on |S|-dimensional vectors.

Method: Leverages the PWLC property to represent POMDP value functions as finite sets of neural networks, combining neural network generalization with classical value iteration framework.

Result: Achieves near-optimal solutions in extremely large POMDPs that are intractable for existing offline solvers.

Conclusion: Neural Value Iteration provides a scalable alternative to traditional α-vector methods for large-scale POMDP planning.

Abstract: The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $α$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $α$-vectors at reachable belief points until convergence. However, since each $α$-vector is $|S|$-dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP’s value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.

[237] UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Kun Kuang, Zhongxiang Dai

Main category: cs.AI

TL;DR: UCO is a reinforcement learning method that uses progress and scaffold rewards to enable LLMs to adapt teaching strategies based on students’ cognitive states and Zone of Proximal Development.

Details

Motivation: Current LLM teaching methods lack dynamic adaptation capabilities and cannot distinguish genuine student understanding from answer echoing, nor perceive evolving cognitive states during interaction.

Method: UCO uses multi-turn interactive reinforcement learning with two reward functions: Progress Reward to capture cognitive advancement, and Scaffold Reward to identify Zone of Proximal Development and maintain teaching within it.

Result: UCO outperforms 11 baseline models on BigMath and MathTutorBench benchmarks, achieving performance comparable to advanced closed-source models.

Conclusion: The proposed UCO method effectively addresses limitations in current LLM teaching approaches by enabling dynamic adaptation to students’ cognitive states and ZPD.

Abstract: Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students’ evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students’ cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students’ cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student’s Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at https://github.com/Mind-Lab-ECNU/UCO.

[238] Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

Main category: cs.AI

TL;DR: Lumine is the first open recipe for developing generalist agents that can complete hours-long complex missions in real time within 3D open-world environments, demonstrating human-level performance and strong cross-game generalization.

Details

Motivation: To create generalist agents capable of handling complex, hours-long missions in challenging 3D open-world environments with human-like efficiency and natural language instruction following.

Method: Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner using a vision-language model. It processes raw pixels at 5 Hz to produce 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary.

Result: Lumine successfully completes the entire five-hour Mondstadt main storyline in Genshin Impact on par with human-level efficiency, follows natural language instructions for various tasks, and demonstrates strong zero-shot cross-game generalization by accomplishing missions in Wuthering Waves and Honkai: Star Rail without fine-tuning.

Conclusion: Lumine represents a concrete step toward generalist agents in open-ended environments, showing effectiveness across distinct worlds and interaction dynamics with promising performance in both 3D open-world exploration and 2D GUI manipulation tasks.

Abstract: We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine’s effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

[239] The Double Contingency Problem: AI Recursion and the Limits of Interspecies Understanding

Graham L. Bishop

Main category: cs.AI

TL;DR: Bioacoustic AI systems overlook how their own recursive cognition interacts with animal communication, creating a double contingency problem that requires rethinking AI as diplomatic encounters between different recursive systems.

Details

Motivation: To address how AI's recursive processing may distort animal communication by examining the interaction between AI's cognitive structures and species' communicative processes through philosophical frameworks.

Method: Drawing on Yuk Hui’s philosophy of recursivity and contingency to analyze the double contingency problem where both species communication and AI processing operate through contingent conditions.

Result: Identified that current bioacoustic AI approaches systematically obscure animal communication structures due to mismatched recursive cognitive processes between AI and species.

Conclusion: Bioacoustic AI should be reconceptualized from universal pattern recognition to diplomatic encounters between different recursive cognitive systems, requiring changes in model design and evaluation frameworks.

Abstract: Current bioacoustic AI systems achieve impressive cross-species performance by processing animal communication through transformer architectures, foundation model paradigms, and other computational approaches. However, these approaches overlook a fundamental question: what happens when one form of recursive cognition–AI systems with their attention mechanisms, iterative processing, and feedback loops–encounters the recursive communicative processes of other species? Drawing on philosopher Yuk Hui’s work on recursivity and contingency, I argue that AI systems are not neutral pattern detectors but recursive cognitive agents whose own information processing may systematically obscure or distort other species’ communicative structures. This creates a double contingency problem: each species’ communication emerges through contingent ecological and evolutionary conditions, while AI systems process these signals through their own contingent architectural and training conditions. I propose that addressing this challenge requires reconceptualizing bioacoustic AI from universal pattern recognition toward diplomatic encounter between different forms of recursive cognition, with implications for model design, evaluation frameworks, and research methodologies.

[240] A Research on Business Process Optimisation Model Integrating AI and Big Data Analytics

Di Liao, Ruijia Liang, Ziyi Ye

Main category: cs.AI

TL;DR: AI-powered business process optimization model reduces processing time by 42%, improves resource utilization by 28%, and cuts operating costs by 35% with 99.9% availability.

Details

Motivation: Digital transformation requires optimized business processes to enhance enterprise competitiveness through intelligent lifecycle management.

Method: Three-layer architecture combining data processing, AI algorithms, and business logic using distributed computing and deep learning for real-time monitoring and optimization.

Result: 42% reduction in process time, 28% improvement in resource utilization, 35% cost reduction, and 99.9% availability under high concurrent loads.

Conclusion: The model provides significant value for enterprise digital transformation and offers new approaches to improve operational efficiency.

Abstract: With the deepening of digital transformation, business process optimisation has become the key to improve the competitiveness of enterprises. This study constructs a business process optimisation model integrating artificial intelligence and big data to achieve intelligent management of the whole life cycle of processes. The model adopts a three-layer architecture incorporating data processing, AI algorithms, and business logic to enable real-time process monitoring and optimization. Through distributed computing and deep learning techniques, the system can handle complex business scenarios while maintaining high performance and reliability. Experimental validation across multiple enterprise scenarios shows that the model shortens process processing time by 42%, improves resource utilisation by 28%, and reduces operating costs by 35%. The system maintained 99.9% availability under high concurrent loads. The research results have important theoretical and practical value for promoting the digital transformation of enterprises, and provide new ideas for improving the operational efficiency of enterprises.

[241] AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting

Xiaohan Zhang, Tian Gao, Mingyue Cheng, Bokai Pan, Ze Guo, Yaguo Liu, Xiaoyu Tao

Main category: cs.AI

TL;DR: AlphaCast is a human-LLM co-reasoning framework that transforms time series forecasting into an interactive process through automated prediction preparation and generative reasoning with continuous self-correction.

Details

Motivation: Current time series forecasting approaches lack the interaction, reasoning, and adaptability of human experts, limiting their usefulness in complex real-world environments.

Method: Two-stage framework: (1) automated prediction preparation with multi-source cognitive foundation (feature set, domain knowledge base, contextual repository, case base), and (2) generative reasoning and reflective optimization with meta-reasoning loop for self-correction.

Result: Extensive experiments show AlphaCast consistently outperforms state-of-the-art baselines in predictive accuracy on both short- and long-term datasets.

Conclusion: AlphaCast successfully bridges the gap between automated forecasting and human expert reasoning by enabling collaborative human-LLM intelligence for more accurate and adaptive time series forecasting.

Abstract: Time series forecasting plays a critical role in high-stakes domains such as energy, healthcare, and climate. Although recent advances have improved accuracy, most approaches still treat forecasting as a static one-time mapping task, lacking the interaction, reasoning, and adaptability of human experts. This gap limits their usefulness in complex real-world environments. To address this, we propose AlphaCast, a human wisdom-large language model (LLM) intelligence co-reasoning framework that redefines forecasting as an interactive process. The key idea is to enable step-by-step collaboration between human wisdom and LLM intelligence to jointly prepare, generate, and verify forecasts. The framework consists of two stages: (1) automated prediction preparation, where AlphaCast builds a multi-source cognitive foundation comprising a feature set that captures key statistics and time patterns, a domain knowledge base distilled from corpora and historical series, a contextual repository that stores rich information for each time window, and a case base that retrieves optimal strategies via pattern clustering and matching; and (2) generative reasoning and reflective optimization, where AlphaCast integrates statistical temporal features, prior knowledge, contextual information, and forecasting strategies, triggering a meta-reasoning loop for continuous self-correction and strategy refinement. Extensive experiments on short- and long-term datasets show that AlphaCast consistently outperforms state-of-the-art baselines in predictive accuracy. Code is available at this repository: https://github.com/SkyeGT/AlphaCast_Official .

[242] AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines

Alvin Chauhan

Main category: cs.AI

TL;DR: Recursive refinement (RR) as a multi-agent pipeline enhances LLM reasoning by implementing gradual, incremental, and sequential (GIS) search through iterative self-criticism and adversarial stress-testing.

Details

Motivation: To extract stronger reasoning capabilities from LLMs by structuring multi-agent pipelines that ensure controlled, incremental search through the reasoning space.

Method: Multi-agent pipeline with recursive refinement layer using historical personas (Hamilton, Jefferson, Madison) via RAG-powered corpora, comparing simple linear pipeline vs complex structured pipeline with RR.

Result: Complex model with recursive refinement consistently outperformed simple model across all test cases (88.3 vs 71.7 average score), producing arguments with superior analytical depth, structural nuance, and strategic framing.

Conclusion: Recursive refinement is a robust architectural feature for enhancing LLM reasoning through GIS search, demonstrating that high-quality reasoning emerges from controlled, incremental search processes.

Abstract: Although Large Language Models (LLMs) show exceptional fluency, efforts persist to extract stronger reasoning capabilities from them. Drawing on search-based interpretations of LLM computation, this paper advances a systematic framework for understanding LLM reasoning and optimization. Namely, that enhancing reasoning is best achieved by structuring a multi-agent pipeline to ensure a traversal of the search space in a gradual, incremental, and sequential (GIS) manner. Stated succinctly, high-quality reasoning is a controlled, incremental search. To test this framework, we investigate the efficacy of recursive refinement (RR)–an iterative process of self-criticism, adversarial stress-testing, and integrating critical feedback–as a practical method for implementing GIS search. We designed an experiment comparing a simple, linear pipeline against a complex, explicitly structured pipeline leveraging a recursive refinement layer. The multi-agent models were constructed to reflect the historical personas of three US Founding Fathers (Hamilton, Jefferson, and Madison) using RAG-powered corpora and were prompted to generate responses to three contemporary political issues. Model performance was evaluated using a two-tiered approach: a quantitative score from an LLM arbiter agent and qualitative human judgment. Our results revealed that the complex model consistently outperformed the simple model across all nine test cases with an average arbiter-outputted score of 88.3 versus 71.7. The complex model’s arguments were superior in analytical depth, structural nuance, and strategic framing. We conclude that recursive refinement is a robust architectural feature for enhancing LLM reasoning via GIS search.

[243] Heterogeneous Graph Neural Networks for Assumption-Based Argumentation

Preesha Gehlot, Anna Rapberger, Fabrizio Russo, Francesca Toni

Main category: cs.AI

TL;DR: First GNN approach for approximating credulous acceptance in Assumption-Based Argumentation (ABA), achieving up to 0.71 F1 score and enabling polynomial-time stable extension reconstruction.

Details

Motivation: Exact computation of extensions in ABA under stable semantics is intractable for large frameworks, requiring scalable approximate reasoning methods.

Method: Model ABA frameworks as dependency graphs with heterogeneous edges, then propose ABAGCN and ABAGAT GNN architectures using residual heterogeneous convolution/attention layers trained on ICCMA 2023 benchmark.

Result: Both GNN models outperform adapted baseline, achieving 0.71 node-level F1 score on ICCMA instances, with extension reconstruction achieving 0.85 F1 on small ABAFs and 0.58 on large frameworks.

Conclusion: This work opens new avenues for scalable approximate reasoning in structured argumentation using GNNs.

Abstract: Assumption-Based Argumentation (ABA) is a powerful structured argumentation formalism, but exact computation of extensions under stable semantics is intractable for large frameworks. We present the first Graph Neural Network (GNN) approach to approximate credulous acceptance in ABA. To leverage GNNs, we model ABA frameworks via a dependency graph representation encoding assumptions, claims and rules as nodes, with heterogeneous edge labels distinguishing support, derive and attack relations. We propose two GNN architectures - ABAGCN and ABAGAT - that stack residual heterogeneous convolution or attention layers, respectively, to learn node embeddings. Our models are trained on the ICCMA 2023 benchmark, augmented with synthetic ABAFs, with hyperparameters optimised via Bayesian search. Empirically, both ABAGCN and ABAGAT outperform a state-of-the-art GNN baseline that we adapt from the abstract argumentation literature, achieving a node-level F1 score of up to 0.71 on the ICCMA instances. Finally, we develop a sound polynomial time extension-reconstruction algorithm driven by our predictor: it reconstructs stable extensions with F1 above 0.85 on small ABAFs and maintains an F1 of about 0.58 on large frameworks. Our work opens new avenues for scalable approximate reasoning in structured argumentation.

[244] Solving a Million-Step LLM Task with Zero Errors

Elliot Meyerson, Giuseppe Paolo, Roberto Dailey, Hormoz Shahrzad, Olivier Francon, Conor F. Hayes, Xin Qiu, Babak Hodjat, Risto Miikkulainen

Main category: cs.AI

TL;DR: MAKER is the first system to solve tasks with over one million LLM steps with zero errors through extreme decomposition into subtasks handled by microagents and multi-agent voting for error correction.

Details

Motivation: LLMs struggle with extended processes due to persistent error rates that prevent scaling beyond a few hundred steps, despite their reasoning and tool use capabilities.

Method: Extreme decomposition of tasks into subtasks handled by focused microagents, combined with multi-agent voting scheme for error correction at each step.

Result: Successfully solved a task with over one million LLM steps with zero errors, demonstrating scalability far beyond previous limitations.

Conclusion: Massively decomposed agentic processes (MDAPs) provide an efficient alternative to continual LLM improvement for solving problems at organizational and societal scales.

Abstract: LLMs have achieved remarkable breakthroughs in reasoning, insights, and tool use, but chaining these abilities into extended processes at the scale of those routinely executed by humans, organizations, and societies has remained out of reach. The models have a persistent error rate that prevents scale-up: for instance, recent experiments in the Towers of Hanoi benchmark domain showed that the process inevitably becomes derailed after at most a few hundred steps. Thus, although LLM research is often still benchmarked on tasks with relatively few dependent logical steps, there is increasing attention on the ability (or inability) of LLMs to perform long range tasks. This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors, and, in principle, scales far beyond this level. The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme. This combination of extreme decomposition and error correction makes scaling possible. Thus, the results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.

[245] Argus: Resilience-Oriented Safety Assurance Framework for End-to-End ADSs

Dingji Wang, You Lu, Bihuan Chen, Shuo Hao, Haowen Jiang, Yifan Tian, Xin Peng

Main category: cs.AI

TL;DR: Argus is a runtime resilience framework for autonomous driving systems that monitors trajectories for hazards and takes control when unsafe conditions are detected, preventing safety violations and improving driving performance.

Details

Motivation: End-to-end autonomous driving systems face diverse driving hazards that can compromise safety and degrade performance, creating a need for resilience capabilities to monitor hazards and respond to safety violations.

Method: Argus continuously monitors ADS-generated trajectories for potential hazards and seamlessly takes control through a hazard mitigator when the ego vehicle is deemed unsafe.

Result: Integration with three state-of-the-art ADSs (TCP, UniAD, VAD) showed Argus improves driving scores by up to 150.30% on average, prevents up to 64.38% of violations, with minimal time overhead.

Conclusion: Argus effectively enhances ADS resilience by mitigating driving hazards and preventing safety violations while maintaining system efficiency.

Abstract: End-to-end autonomous driving systems (ADSs), with their strong capabilities in environmental perception and generalizable driving decisions, are attracting growing attention from both academia and industry. However, once deployed on public roads, ADSs are inevitably exposed to diverse driving hazards that may compromise safety and degrade system performance. This raises a strong demand for resilience of ADSs, particularly the capability to continuously monitor driving hazards and adaptively respond to potential safety violations, which is crucial for maintaining robust driving behaviors in complex driving scenarios. To bridge this gap, we propose a runtime resilience-oriented framework, Argus, to mitigate the driving hazards, thus preventing potential safety violations and improving the driving performance of an ADS. Argus continuously monitors the trajectories generated by the ADS for potential hazards and, whenever the EGO vehicle is deemed unsafe, seamlessly takes control through a hazard mitigator. We integrate Argus with three state-of-the-art end-to-end ADSs, i.e., TCP, UniAD and VAD. Our evaluation has demonstrated that Argus effectively and efficiently enhances the resilience of ADSs, improving the driving score of the ADS by up to 150.30% on average, and preventing up to 64.38% of the violations, with little additional time overhead.

[246] Advancing Autonomous Emergency Response Systems: A Generative AI Perspective

Yousef Emami, Radha Reddy, Azadeh Pourkabirian, Miguel Gutierrez Gaitan

Main category: cs.AI

TL;DR: This paper reviews next-generation optimization strategies for Autonomous Vehicles in emergency services, focusing on Diffusion Model-augmented RL and LLM-assisted In-Context Learning to overcome limitations of conventional Reinforcement Learning.

Details

Motivation: To address the poor sample efficiency and lack of adaptability of conventional RL in dynamic emergency scenarios for Autonomous Vehicles, enabling faster, safer, and more efficient emergency responses.

Method: Reviews and analyzes the shift from conventional RL to Diffusion Model-augmented RL (enhancing robustness through synthetic data) and LLM-assisted In-Context Learning (enabling rapid adaptation without retraining).

Result: Provides a critical framework for understanding next-generation autonomous emergency response systems from a Generative AI perspective, comparing the trade-offs between computational cost and adaptability.

Conclusion: The paper establishes that combining DM-augmented RL and LLM-assisted ICL represents the future direction for optimizing AV intelligence in emergency services, balancing robustness with interpretability and efficiency.

Abstract: Autonomous Vehicles (AVs) are poised to revolutionize emergency services by enabling faster, safer, and more efficient responses. This transformation is driven by advances in Artificial Intelligence (AI), particularly Reinforcement Learning (RL), which allows AVs to navigate complex environments and make critical decisions in real time. However, conventional RL paradigms often suffer from poor sample efficiency and lack adaptability in dynamic emergency scenarios. This paper reviews next-generation AV optimization strategies to address these limitations. We analyze the shift from conventional RL to Diffusion Model (DM)-augmented RL, which enhances policy robustness through synthetic data generation, albeit with increased computational cost. Additionally, we explore the emerging paradigm of Large Language Model (LLM)-assisted In-Context Learning (ICL), which offers a lightweight and interpretable alternative by enabling rapid, on-the-fly adaptation without retraining. By reviewing the state of the art in AV intelligence, DM-augmented RL, and LLM-assisted ICL, this paper provides a critical framework for understanding the next generation of autonomous emergency response systems from a Generative AI perspective.

[247] OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Zezhen Ding, Zhen Tan, Jiheng Zhang, Tianlong Chen

Main category: cs.AI

TL;DR: OR-R1 is a data-efficient framework for automated optimization modeling and solving that achieves state-of-the-art performance using only 1/10 the synthetic data of prior methods, with a 67.7% average solving accuracy.

Details

Motivation: Current LLM-based methods for translating natural language to optimization models require vast amounts of annotated data, resulting in high costs and scalability barriers. There's a need for more data-efficient approaches.

Method: Two-stage framework: 1) Supervised fine-tuning (SFT) to learn reasoning patterns from limited labeled data, 2) Test-Time Group Relative Policy Optimization (TGRPO) to improve capability and consistency using both labeled and unlabeled data.

Result: Achieves 67.7% average solving accuracy, exceeding ORLM by up to 4.2% with only 1/10 the synthetic data. Outperforms ORLM by over 2.4% with just 100 synthetic samples. TGRPO provides additional 3.1%-6.4% accuracy improvement.

Conclusion: OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization, significantly lowering expertise and data barriers for industrial applications.

Abstract: Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM’s solving accuracy by up to $4.2%$. Remarkably, OR-R1 outperforms ORLM by over $2.4%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1%-6.4%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13%$ to $7%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

[248] History-Aware Reasoning for GUI Agents

Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li

Main category: cs.AI

TL;DR: Proposes History-Aware Reasoning (HAR) framework to enhance GUI agents’ episodic reasoning by addressing weak short-term memory in existing methods, enabling history-aware decision-making in long-horizon GUI tasks.

Details

Motivation: Current GUI agents have weak short-term memory and treat interactions as discrete screen understanding, lacking awareness of historical interactions within episodes, which limits their performance in GUI automation.

Method: HAR framework with reflective learning scenario construction, tailored correction guidelines synthesis, and hybrid RL reward function design to enhance short-term memory and episodic reasoning.

Result: Developed HAR-GUI-3B model that transforms reasoning from history-agnostic to history-aware, demonstrating effectiveness across GUI benchmarks with improved short-term memory and screen detail perception.

Conclusion: The HAR framework successfully equips GUI agents with stable short-term memory and reliable episodic reasoning capabilities, addressing the history-agnostic reasoning challenge in GUI automation.

Abstract: Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users’ concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.

[249] ProBench: Benchmarking GUI Agents with Accurate Process Information

Leyang Yang, Ziwei Wang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li

Main category: cs.AI

TL;DR: ProBench is a comprehensive mobile benchmark with 200+ GUI tasks that evaluates GUI agents not just on final state but also on intermediate process steps, revealing significant limitations in current models.

Details

Motivation: Current GUI agent benchmarks only evaluate final screen states, missing critical information from intermediate steps in multi-step GUI operations. There's a need for more comprehensive evaluation that captures the entire process.

Method: Created ProBench with 200+ mobile GUI tasks across common scenarios. Extended evaluation to include Process-related Tasks alongside traditional State-related Tasks. Introduced Process Provider to automatically supply accurate process information for precise assessment.

Result: Evaluation of advanced GUI agents revealed significant limitations in real-world GUI scenarios. Shortcomings were prevalent across diverse models including large generalist models and smaller GUI-specific models. Error analysis exposed several universal problems.

Conclusion: Current GUI agents have substantial limitations for real-world applications. The process-oriented evaluation approach provides concrete directions for future improvements in GUI agent development.

Abstract: With the deep integration of artificial intelligence and interactive technology, Graphical User Interface (GUI) Agent, as the carrier connecting goal-oriented natural language and real-world devices, has received widespread attention from the community. Contemporary benchmarks aim to evaluate the comprehensive capabilities of GUI agents in GUI operation tasks, generally determining task completion solely by inspecting the final screen state. However, GUI operation tasks consist of multiple chained steps while not all critical information is presented in the final few pages. Although a few research has begun to incorporate intermediate steps into evaluation, accurately and automatically capturing this process information still remains an open challenge. To address this weakness, we introduce ProBench, a comprehensive mobile benchmark with over 200 challenging GUI tasks covering widely-used scenarios. Remaining the traditional State-related Task evaluation, we extend our dataset to include Process-related Task and design a specialized evaluation method. A newly introduced Process Provider automatically supplies accurate process information, enabling presice assessment of agent’s performance. Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios. These shortcomings are prevalent across diverse models, including both large-scale generalist models and smaller, GUI-specific models. A detailed error analysis further exposes several universal problems, outlining concrete directions for future improvements.

[250] Efficient Reasoning via Reward Model

Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao

Main category: cs.AI

TL;DR: Proposes Conciseness Reward Model (CRM) and Conciseness Reward Function (CRF) to address overthinking in large reasoning models, reducing verbose responses while improving accuracy and efficiency.

Details

Motivation: Large reasoning models like DeepSeek-R1 and OpenAI o1 generate verbose responses with redundant reasoning steps (overthinking), which increases computational costs. Existing length penalty methods suffer from length collapse and training collapse issues.

Method: Train a Conciseness Reward Model (CRM) to score reasoning path conciseness, and introduce Conciseness Reward Function (CRF) with explicit dependency between outcome reward and conciseness score.

Result: Achieves 8.1% accuracy improvement and 19.9% reduction in response token length on Qwen2.5-7B. Method generalizes well to other LLMs including Llama and Mistral.

Conclusion: The proposed approach effectively mitigates overthinking in large reasoning models, improving both reasoning effectiveness and computational efficiency through better reward formulation.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method’s effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.

[251] Perspectives on a Reliability Monitoring Framework for Agentic AI Systems

Niclas Flehmig, Mary Ann Lundteigen, Shen Yin

Main category: cs.AI

TL;DR: Proposes a two-layered reliability monitoring framework for agentic AI systems to address reliability challenges in high-risk domains through out-of-distribution detection and AI transparency layers.

Details

Motivation: Agentic AI systems have insufficient reliability for high-risk domains like healthcare and process industry, posing risks from unexpected behavior during operation that require mitigation techniques.

Method: A two-layered reliability monitoring framework consisting of: 1) out-of-distribution detection layer for novel inputs, and 2) AI transparency layer to reveal internal operations, providing human operators with decision support.

Result: The framework enables human operators to identify potentially unreliable outputs and intervene, providing a foundation for developing mitigation techniques to reduce operational risks.

Conclusion: The proposed monitoring framework addresses fundamental reliability challenges in agentic AI systems by combining detection of novel inputs with transparency mechanisms, supporting safer deployment in high-risk applications.

Abstract: The implementation of agentic AI systems has the potential of providing more helpful AI systems in a variety of applications. These systems work autonomously towards a defined goal with reduced external control. Despite their potential, one of their flaws is the insufficient reliability which makes them especially unsuitable for high-risk domains such as healthcare or process industry. Unreliable systems pose a risk in terms of unexpected behavior during operation and mitigation techniques are needed. In this work, we derive the main reliability challenges of agentic AI systems during operation based on their characteristics. We draw the connection to traditional AI systems and formulate a fundamental reliability challenge during operation which is inherent to traditional and agentic AI systems. As our main contribution, we propose a two-layered reliability monitoring framework for agentic AI systems which consists of a out-of-distribution detection layer for novel inputs and AI transparency layer to reveal internal operations. This two-layered monitoring approach gives a human operator the decision support which is needed to decide whether an output is potential unreliable or not and intervene. This framework provides a foundation for developing mitigation techniques to reduce risk stemming from uncertain reliability during operation.

[252] MedFuse: Multiplicative Embedding Fusion For Irregular Clinical Time Series

Yi-Hsien Hsieh, Ta-Jung Chien, Chun-Kai Huang, Shao-Hua Sun, Che Lin

Main category: cs.AI

TL;DR: MedFuse is a framework for irregular clinical time series that uses multiplicative embedding fusion (MuFuse) to better capture value-dependent feature interactions, outperforming state-of-the-art methods on predictive tasks.

Details

Motivation: Clinical time series from EHRs are irregular with asynchronous sampling, missing values, and heterogeneous dynamics. Existing embedding strategies use additive operations that limit their ability to capture value-dependent feature interactions.

Method: Proposed MedFuse framework with MuFuse module that fuses value and feature embeddings through multiplicative modulation, preserving feature-specific information while modeling higher-order dependencies across features.

Result: Experiments on three real-world datasets covering intensive and chronic care show MedFuse consistently outperforms state-of-the-art baselines on key predictive tasks. Learned representations demonstrate enhanced expressiveness and support cross-dataset pretraining.

Conclusion: MedFuse establishes a generalizable approach for modeling irregular clinical time series through multiplicative fusion, improving predictive performance and representation quality.

Abstract: Clinical time series derived from electronic health records (EHRs) are inherently irregular, with asynchronous sampling, missing values, and heterogeneous feature dynamics. While numerical laboratory measurements are highly informative, existing embedding strategies usually combine feature identity and value embeddings through additive operations, which constrains their ability to capture value-dependent feature interactions. We propose MedFuse, a framework for irregular clinical time series centered on the MuFuse (Multiplicative Embedding Fusion) module. MuFuse fuses value and feature embeddings through multiplicative modulation, preserving feature-specific information while modeling higher-order dependencies across features. Experiments on three real-world datasets covering both intensive and chronic care show that MedFuse consistently outperforms state-of-the-art baselines on key predictive tasks. Analysis of the learned representations further demonstrates that multiplicative fusion enhances expressiveness and supports cross-dataset pretraining. These results establish MedFuse as a generalizable approach for modeling irregular clinical time series.

[253] HyperD: Hybrid Periodicity Decoupling Framework for Traffic Forecasting

Minlan Shao, Zijian Zhang, Yili Wang, Yiwei Dai, Xu Shen, Xin Wang

Main category: cs.AI

TL;DR: HyperD is a novel traffic forecasting framework that decouples traffic data into periodic and residual components using hybrid periodic representation and frequency-aware modeling to handle complex spatial dependencies and multi-scale patterns.

Details

Motivation: Traffic forecasting faces challenges from complex spatial dependencies and the coexistence of multi-scale periodic patterns with irregular fluctuations caused by unpredictable events like accidents and weather.

Method: HyperD decouples traffic data into periodic and residual components. The periodic component uses Hybrid Periodic Representation Module with learnable embeddings and spatial-temporal attention. The residual component uses Frequency-Aware Residual Representation Module with complex-valued MLP in frequency domain, plus Dual-View Alignment Loss for semantic separation.

Result: Extensive experiments on four real-world datasets show HyperD achieves state-of-the-art prediction accuracy, superior robustness under disturbances, and improved computational efficiency compared to existing methods.

Conclusion: HyperD effectively addresses traffic forecasting challenges by decoupling periodic and non-periodic components, demonstrating strong performance and practical advantages for intelligent transportation systems.

Abstract: Accurate traffic forecasting plays a vital role in intelligent transportation systems, enabling applications such as congestion control, route planning, and urban mobility optimization.However, traffic forecasting remains challenging due to two key factors: (1) complex spatial dependencies arising from dynamic interactions between road segments and traffic sensors across the network, and (2) the coexistence of multi-scale periodic patterns (e.g., daily and weekly periodic patterns driven by human routines) with irregular fluctuations caused by unpredictable events (e.g., accidents, weather, or construction). To tackle these challenges, we propose HyperD (Hybrid Periodic Decoupling), a novel framework that decouples traffic data into periodic and residual components. The periodic component is handled by the Hybrid Periodic Representation Module, which extracts fine-grained daily and weekly patterns using learnable periodic embeddings and spatial-temporal attention. The residual component, which captures non-periodic, high-frequency fluctuations, is modeled by the Frequency-Aware Residual Representation Module, leveraging complex-valued MLP in frequency domain. To enforce semantic separation between the two components, we further introduce a Dual-View Alignment Loss, which aligns low-frequency information with the periodic branch and high-frequency information with the residual branch. Extensive experiments on four real-world traffic datasets demonstrate that HyperD achieves state-of-the-art prediction accuracy, while offering superior robustness under disturbances and improved computational efficiency compared to existing methods.

[254] From Model Training to Model Raising – A call to reform AI model training paradigms from post-hoc alignment to intrinsic, identity-based development

Roland Aydin, Christian Cyron, Steve Bachelor, Ashton Anderson, Robert West

Main category: cs.AI

TL;DR: Proposes shifting from “model training” to “model raising” by integrating alignment into model development from the start through redesigning training corpora.

Details

Motivation: Current AI training methods align models with human values only after core capabilities are established, resulting in easily misaligned models lacking deep-rooted value systems.

Method: Redesign training corpus by: reframing data from first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding data ordering.

Result: Expected to lead to early commitment to values from the first training token onward, making knowledge, skills, and values intrinsically harder to separate.

Conclusion: This paradigm shift is critical as LLM capabilities increasingly overtake human capabilities, requiring deeply integrated value systems from the beginning.

Abstract: Current AI training methods align models with human values only after their core capabilities have been established, resulting in models that are easily misaligned and lack deep-rooted value systems. We propose a paradigm shift from “model training” to “model raising”, in which alignment is woven into a model’s development from the start. We identify several key components for this paradigm, all centered around redesigning the training corpus: reframing training data from a first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding the ordering of training data. We expect that this redesign of the training corpus will lead to an early commitment to values from the first training token onward, such that knowledge, skills, and values are intrinsically much harder to separate. In an ecosystem in which large language model capabilities start overtaking human capabilities in many tasks, this seems to us like a critical need.

[255] Not Everything That Counts Can Be Counted: A Case for Safe Qualitative AI

Stine Beltoft, Lukas Galke

Main category: cs.AI

TL;DR: AI has advanced quantitative research but neglected qualitative methods, creating a need for dedicated qualitative AI systems that are transparent, reproducible, and privacy-friendly.

Details

Motivation: Qualitative research has been left behind in AI adoption, forcing researchers to use general-purpose tools like ChatGPT despite their limitations of bias, opacity, irreproducibility, and privacy issues.

Method: Argue for developing dedicated qualitative AI systems built from the ground up for interpretive research, and review literature to show how existing automated discovery pipelines could be enhanced by robust qualitative capabilities.

Result: Identified a critical gap where qualitative dimensions essential for meaning-making remain poorly integrated with AI, and found opportunities for safe qualitative AI to advance multidisciplinary and mixed-methods research.

Conclusion: There is an urgent need to develop specialized qualitative AI systems that address the unique requirements of interpretive research while being transparent, reproducible, and privacy-preserving.

Abstract: Artificial intelligence (AI) and large language models (LLM) are reshaping science, with most recent advances culminating in fully-automated scientific discovery pipelines. But qualitative research has been left behind. Researchers in qualitative methods are hesitant about AI adoption. Yet when they are willing to use AI at all, they have little choice but to rely on general-purpose tools like ChatGPT to assist with interview interpretation, data annotation, and topic modeling - while simultaneously acknowledging these system’s well-known limitations of being biased, opaque, irreproducible, and privacy-compromising. This creates a critical gap: while AI has substantially advanced quantitative methods, the qualitative dimensions essential for meaning-making and comprehensive scientific understanding remain poorly integrated. We argue for developing dedicated qualitative AI systems built from the ground up for interpretive research. Such systems must be transparent, reproducible, and privacy-friendly. We review recent literature to show how existing automated discovery pipelines could be enhanced by robust qualitative capabilities, and identify key opportunities where safe qualitative AI could advance multidisciplinary and mixed-methods research.

[256] BarrierBench : Evaluating Large Language Models for Safety Verification in Dynamical Systems

Ali Taheri, Alireza Taban, Sadegh Soudjani, Ashutosh Trivedi

Main category: cs.AI

TL;DR: LLM-based agentic framework for synthesizing barrier certificates in dynamical systems, achieving over 90% success rate on a new benchmark of 100 systems.

Details

Motivation: Current barrier certificate synthesis methods suffer from poor scalability, template dependence, and require substantial manual expertise in selecting templates, solvers, and hyperparameters.

Method: LLM-based agentic framework using natural language reasoning to propose, refine, and validate candidate certificates, integrating LLM-driven template discovery with SMT-based verification and supporting barrier-controller co-synthesis.

Result: Achieves more than 90% success rate in generating valid certificates across BarrierBench (100 dynamical systems spanning linear, nonlinear, discrete-time, and continuous-time settings).

Conclusion: Demonstrates effective integration of language-based reasoning with formal verification, establishing a community testbed for advancing this approach in dynamical systems safety verification.

Abstract: Safety verification of dynamical systems via barrier certificates is essential for ensuring correctness in autonomous applications. Synthesizing these certificates involves discovering mathematical functions with current methods suffering from poor scalability, dependence on carefully designed templates, and exhaustive or incremental function-space searches. They also demand substantial manual expertise–selecting templates, solvers, and hyperparameters, and designing sampling strategies–requiring both theoretical and practical knowledge traditionally shared through linguistic reasoning rather than formalized methods. This motivates a key question: can such expert reasoning be captured and operationalized by language models? We address this by introducing an LLM-based agentic framework for barrier certificate synthesis. The framework uses natural language reasoning to propose, refine, and validate candidate certificates, integrating LLM-driven template discovery with SMT-based verification, and supporting barrier-controller co-synthesis to ensure consistency between safety certificates and controllers. To evaluate this capability, we introduce BarrierBench, a benchmark of 100 dynamical systems spanning linear, nonlinear, discrete-time, and continuous-time settings. Our experiments assess not only the effectiveness of LLM-guided barrier synthesis but also the utility of retrieval-augmented generation and agentic coordination strategies in improving its reliability and performance. Across these tasks, the framework achieves more than 90% success in generating valid certificates. By releasing BarrierBench and the accompanying toolchain, we aim to establish a community testbed for advancing the integration of language-based reasoning with formal verification in dynamical systems. The benchmark is publicly available at https://hycodev.com/dataset/barrierbench

[257] The 2025 Planning Performance of Frontier Large Language Models

Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

Main category: cs.AI

TL;DR: Evaluation of three frontier LLMs (DeepSeek R1, Gemini 2.5 Pro, GPT-5) on PDDL planning tasks shows GPT-5 is competitive with LAMA planner on standard domains, with all models showing improved reasoning capabilities compared to previous generations.

Details

Motivation: To assess the current state of LLM reasoning capabilities for planning tasks and measure progress against traditional planners, particularly as frontier models continue to advance.

Method: Evaluated three 2025 frontier LLMs on PDDL domain and task descriptions from the International Planning Competition Learning Track, testing both standard and obfuscated PDDL to isolate pure reasoning ability.

Result: GPT-5 performed competitively with LAMA planner on standard PDDL domains. All LLMs showed performance degradation on obfuscated tasks but less severe than previously reported for other models, indicating substantial improvements over prior LLM generations.

Conclusion: Frontier LLMs have significantly reduced the performance gap to traditional planners on challenging planning benchmarks, showing substantial reasoning improvements over previous generations.

Abstract: The capacity of Large Language Models (LLMs) for reasoning remains an active area of research, with the capabilities of frontier models continually advancing. We provide an updated evaluation of the end-to-end planning performance of three frontier LLMs as of 2025, where models are prompted to generate a plan from PDDL domain and task descriptions. We evaluate DeepSeek R1, Gemini 2.5 Pro, GPT-5 and as reference the planner LAMA on a subset of domains from the most recent Learning Track of the International Planning Competition. Our results show that on standard PDDL domains, the performance of GPT-5 in terms of solved tasks is competitive with LAMA. When the PDDL domains and tasks are obfuscated to test for pure reasoning, the performance of all LLMs degrades, though less severely than previously reported for other models. These results show substantial improvements over prior generations of LLMs, reducing the performance gap to planners on a challenging benchmark.

[258] What We Don’t C: Representations for scientific discovery beyond VAEs

Brian Rogers, Micah Bowles, Chris J. Lintott, Steve Croft

Main category: cs.AI

TL;DR: A novel latent flow matching method with classifier-free guidance that disentangles latent subspaces by separating conditioning information from residual representations, enabling access to meaningful features in high-dimensional data.

Details

Motivation: To enable scientific discovery by accessing information in learned representations of high-dimensional domains, particularly focusing on analyzing what isn't captured or cataloged in generative models.

Method: Latent flow matching with classifier-free guidance that explicitly separates information included in conditioning from information remaining in residual representations.

Result: Successfully demonstrated across three experiments (2D Gaussian toy problem, colored MNIST, Galaxy10 astronomy dataset) that the method enables access to meaningful features of high-dimensional data.

Conclusion: Provides a simple yet powerful mechanism for analyzing, controlling, and repurposing latent representations, offering a pathway for using generative models in scientific exploration of uncaptured information.

Abstract: Accessing information in learned representations is critical for scientific discovery in high-dimensional domains. We introduce a novel method based on latent flow matching with classifier-free guidance that disentangles latent subspaces by explicitly separating information included in conditioning from information that remains in the residual representation. Across three experiments – a synthetic 2D Gaussian toy problem, colored MNIST, and the Galaxy10 astronomy dataset – we show that our method enables access to meaningful features of high dimensional data. Our results highlight a simple yet powerful mechanism for analyzing, controlling, and repurposing latent representations, providing a pathway toward using generative models for scientific exploration of what we don’t capture, consider, or catalog.

[259] CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

Peiyu Li, Xiaobao Huang, Nitesh V. Chawla

Main category: cs.AI

TL;DR: CrochetBench is a benchmark for evaluating multimodal LLMs’ fine-grained procedural reasoning in crochet, focusing on executable correctness rather than surface-level understanding.

Details

Motivation: To address limitations in existing benchmarks that focus on high-level description rather than executable procedural reasoning in real-world creative domains like crochet.

Method: Uses CrochetPARADE DSL as intermediate representation for structural validation and functional evaluation via execution. Covers stitch classification, instruction grounding, and natural language/image-to-DSL translation tasks.

Result: Performance sharply declines when evaluation shifts from surface-level similarity to executable correctness, revealing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis.

Conclusion: CrochetBench highlights the gap between surface-level understanding and executable precision in multimodal models, providing a new lens for assessing procedural competence in creative domains.

Abstract: We present CrochetBench, a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, and generate compilable crochet procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply declines as the evaluation shifts from surface-level similarity to executable correctness, exposing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at https://github.com/Peiyu-Georgia-Li/crochetBench.

[260] Consensus Sampling for Safer Generative AI

Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir

Main category: cs.AI

TL;DR: A consensus sampling algorithm that aggregates multiple generative models to enhance safety, achieving risk comparable to the safest subset while abstaining when models disagree.

Details

Motivation: Many AI safety approaches rely on inspecting model outputs or activations, but some risks are inherently undetectable by inspection alone. This work proposes a complementary approach that doesn't depend on model architecture.

Method: Consensus sampling algorithm that uses k models and achieves risk competitive with the average risk of the safest s models. It leverages models’ ability to compute output probabilities and abstains when there’s insufficient agreement between models.

Result: The algorithm bounds the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. It amplifies safety guarantees from an unknown subset of safe models to create a single reliable model.

Conclusion: Provides a new model-agnostic approach for AI safety by aggregating multiple models, inheriting safety from the safest subset while maintaining the ability to abstain when consensus is lacking.

Abstract: Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models’ ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

[261] Fundamentals of Physical AI

Vahid Salehi

Main category: cs.AI

TL;DR: This paper establishes Physical AI as a framework where intelligence emerges from embodied interaction between body, environment, and experience, contrasting with classical AI’s symbolic processing approach.

Details

Motivation: To create a theoretical foundation for physically intelligent systems that understands intelligence as an emergent phenomenon from real physical interactions rather than symbolic processing or data-driven models.

Method: Proposes six fundamental principles (embodiment, sensory perception, motor action, learning, autonomy, context sensitivity) that form a closed control loop where energy, information, control, and context constantly interact.

Result: Develops a coherent framework showing that physical intelligence generates meaning from physical experience rather than databases, with learning understood as structural coupling changes between agents and environment.

Conclusion: Physical AI represents a paradigm shift where intelligence arises from immediate embodied experience, demonstrated through practical applications like adaptive assistant robots in rehabilitation settings.

Abstract: This work will elaborate the fundamental principles of physical artificial intelligence (Physical AI) from a scientific and systemic perspective. The aim is to create a theoretical foundation that describes the physical embodiment, sensory perception, ability to act, learning processes, and context sensitivity of intelligent systems within a coherent framework. While classical AI approaches rely on symbolic processing and data driven models, Physical AI understands intelligence as an emergent phenomenon of real interaction between body, environment, and experience. The six fundamentals presented here are embodiment, sensory perception, motor action, learning, autonomy, and context sensitivity, and form the conceptual basis for designing and evaluating physically intelligent systems. Theoretically, it is shown that these six principles do not represent loose functional modules but rather act as a closed control loop in which energy, information, control, and context are in constant interaction. This circular interaction enables a system to generate meaning not from databases, but from physical experience, a paradigm shift that understands intelligence as an physical embodied process. Physical AI understands learning not as parameter adjustment, but as a change in the structural coupling between agents and the environment. To illustrate this, the theoretical model is explained using a practical scenario: An adaptive assistant robot supports patients in a rehabilitation clinic. This example illustrates that physical intelligence does not arise from abstract calculation, but from immediate, embodied experience. It shows how the six fundamentals interact in a real system: embodiment as a prerequisite, perception as input, movement as expression, learning as adaptation, autonomy as regulation, and context as orientation.

[262] Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart Russell, Michael Dennis

Main category: cs.AI

TL;DR: RPO introduces rationality-preserving adversarial optimization to prevent self-sabotage in cooperative multi-agent settings, enabling robust policy learning without irrational incentives.

Details

Motivation: Adversarial optimization fails in cooperative settings due to self-sabotage where agents irrationally block task completion to halt learning.

Method: Developed Rational Policy Gradient (RPG) that trains agents in modified games using opponent shaping to preserve rationality while optimizing adversarial objectives.

Result: RPG enables adversarial optimization in cooperative settings, finding adversarial examples, improving robustness, and learning diverse policies without self-sabotage.

Conclusion: RPO successfully extends adversarial optimization to cooperative and general-sum games while preserving agent rationality, achieving strong empirical performance.

Abstract: Adversarial optimization algorithms that explicitly search for flaws in agents’ policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational–that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at https://rational-policy-gradient.github.io.

[263] Breadth-First Search vs. Restarting Random Walks for Escaping Uninformed Heuristic Regions

Daniel Platnick, Dawson Tomasz, Eamon Earl, Sourena Khanzadeh, Richard Valenzano

Main category: cs.AI

TL;DR: The paper compares two methods for escaping Uninformed Heuristic Regions (UHRs) in greedy search algorithms: breadth-first search (BrFS) and restarting random walks (RRWs), with theoretical runtime analysis and empirical evaluation.

Details

Motivation: Greedy search methods like GBFS and EHC struggle with UHRs (heuristic local minima/plateaus), creating a need for effective escape mechanisms.

Method: Theoretical derivation of expected runtime for BrFS and RRWs to escape UHRs, followed by empirical comparison of EHC (using BrFS) vs EHC-RRW (using RRWs) on PDDL planning benchmarks.

Result: EHC-RRW shows strong expected runtime guarantees in cases where EHC is effective, and experimental evaluation provides insights into their relative effectiveness for escaping UHRs.

Conclusion: RRWs can be faster than BrFS for escaping UHRs in certain scenarios, and EHC-RRW variants offer promising alternatives to standard EHC with better theoretical guarantees.

Abstract: Greedy search methods like Greedy Best-First Search (GBFS) and Enforced Hill-Climbing (EHC) often struggle when faced with Uninformed Heuristic Regions (UHRs) like heuristic local minima or plateaus. In this work, we theoretically and empirically compare two popular methods for escaping UHRs in breadth-first search (BrFS) and restarting random walks (RRWs). We first derive the expected runtime of escaping a UHR using BrFS and RRWs, based on properties of the UHR and the random walk procedure, and then use these results to identify when RRWs will be faster in expectation than BrFS. We then evaluate these methods for escaping UHRs by comparing standard EHC, which uses BrFS to escape UHRs, to variants of EHC called EHC-RRW, which use RRWs for that purpose. EHC-RRW is shown to have strong expected runtime guarantees in cases where EHC has previously been shown to be effective. We also run experiments with these approaches on PDDL planning benchmarks to better understand their relative effectiveness for escaping UHRs.

[264] ElicitationGPT: Text Elicitation Mechanisms via Language Models

Yifan Wu, Jason Hartline

Main category: cs.AI

TL;DR: This paper develops mechanisms for scoring elicited text against ground truth text by reducing text elicitation to forecast elicitation using large language models (ChatGPT), with theoretical guarantees and empirical evaluation against human preferences.

Details

Motivation: To create mechanisms for scoring textual information elicitation against ground truth text, addressing the fundamental need for proper scoring rules in incentivized information elicitation.

Method: Reduces textual information elicitation to forecast elicitation using domain-knowledge-free queries to ChatGPT, with theoretical analysis proving properness via black-box language models.

Result: Empirical evaluation on peer reviews from a peer-grading dataset shows alignment with manual instructor scores, suggesting the approach works effectively.

Conclusion: Presents a paradigm of algorithmic AI useful for developing AI technologies with provable guarantees, demonstrating successful reduction of text scoring to forecast elicitation via LLMs.

Abstract: Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.

[265] Discussion Graph Semantics of First-Order Logic with Equality for Reasoning about Discussion and Argumentation

Ryuta Arisaka

Main category: cs.AI

TL;DR: This paper introduces a discussion-graph semantics for first-order logic with equality, generalizes Dung’s argumentation extensions to handle equivalent nodes, and shows these generalized extensions are first-order characterizable.

Details

Motivation: To address the lack of a formal reasoning framework capable of handling diverse discussion and argumentation models in AI, and to extend argumentation theory beyond propositional limitations.

Method: Formulated discussion-graph semantics for first-order logic with equality, generalized Dung’s notion of extensions to handle equivalent graph nodes, and connected these concepts through first-order characterizability proofs.

Result: Showed that generalized extensions are first-order characterizable within the proposed semantics, with propositional characterizability of Dung’s extensions and acceptability semantics as immediate consequences.

Conclusion: The paper successfully bridges first-order logic with argumentation theory, providing a more general framework for reasoning about discussions and argumentation in AI systems.

Abstract: We make three contributions. First, we formulate a discussion-graph semantics for first-order logic with equality, enabling reasoning about discussion and argumentation in AI more generally than before. This addresses the current lack of a formal reasoning framework capable of handling diverse discussion and argumentation models. Second, we generalise Dung’s notion of extensions to cases where two or more graph nodes in an argumentation framework are equivalent. Third, we connect these two contributions by showing that the generalised extensions are first-order characterisable within the proposed discussion-graph semantics. Propositional characterisability of all Dung’s extensions is an immediate consequence. We furthermore show that the set of all generalised extensions (acceptability semantics), too, are first-order characterisable. Propositional characterisability of all Dung’s acceptability semantics is an immediate consequence.

[266] rLLM: Relational Table Learning with LLMs

Weichen Li, Xiaotong Huang, Jianwu Zheng, Zheng Wang, Chaokun Wang, Li Pan, Jianhua Li

Main category: cs.AI

TL;DR: rLLM is a PyTorch library for Relational Table Learning with LLMs that decomposes GNNs, LLMs, and Table Neural Networks into modules for fast model construction via “combine, align, and co-train” approach.

Details

Motivation: To provide a development framework that simplifies and accelerates the creation of novel Relational Table Learning models by standardizing components from different neural network architectures.

Method: Decomposes state-of-the-art Graph Neural Networks, Large Language Models, and Table Neural Networks into standardized modules, enabling fast construction of RTL models through a “combine, align, and co-train” methodology.

Result: Developed rLLM library and introduced BRIDGE method as an example. Created three novel relational tabular datasets (TML1M, TLF2K, TACM12K) by enhancing classic datasets.

Conclusion: rLLM serves as a useful and easy-to-use development framework for Relational Table Learning-related tasks, with code publicly available.

Abstract: We introduce rLLM (relationLLM), a PyTorch library designed for Relational Table Learning (RTL) with Large Language Models (LLMs). The core idea is to decompose state-of-the-art Graph Neural Networks, LLMs, and Table Neural Networks into standardized modules, to enable the fast construction of novel RTL-type models in a simple “combine, align, and co-train” manner. To illustrate the usage of rLLM, we introduce a simple RTL method named \textbf{BRIDGE}. Additionally, we present three novel relational tabular datasets (TML1M, TLF2K, and TACM12K) by enhancing classic datasets. We hope rLLM can serve as a useful and easy-to-use development framework for RTL-related tasks. Our code is available at: https://github.com/rllm-project/rllm.

[267] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu

Main category: cs.AI

TL;DR: This paper provides a comprehensive survey of low-bit quantization methods for large language models (LLMs), covering fundamental principles, system implementations, and algorithmic strategies to reduce memory and computational requirements.

Details

Motivation: LLMs have achieved remarkable advancements but face practical deployment challenges due to expensive memory and computational requirements. Low-bit quantization has emerged as a critical approach to mitigate these challenges.

Method: The survey systematically reviews low-bit quantization methods from three perspectives: basic concepts and data formats, system implementations across hardware platforms, and algorithmic techniques for efficient training and inference.

Result: The paper presents a comprehensive overview of frameworks, systems, techniques, and toolkits that enable low-bit quantization for LLMs, categorizing and analyzing various approaches.

Conclusion: The systematic overview offers valuable insights and guidelines for future works to enhance LLM efficiency and applicability through low-bit quantization, with discussions on future trends and potential advancements.

Abstract: Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

[268] Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, Jiajun Bu

Main category: cs.AI

TL;DR: This paper identifies and addresses Cognition and Perception (C&P) knowledge conflicts in multimodal large language models (MLLMs) for document understanding, where models generate answers inconsistent with visual content.

Details

Motivation: Current MLLMs face conflicts between perception (what they see via OCR) and cognition (what they understand), challenging performance and explainability in document understanding tasks.

Method: Proposes Multimodal Knowledge Consistency Fine-tuning to mitigate C&P knowledge conflicts by aligning perceptual and cognitive capabilities.

Result: Even GPT-4o achieves only 75.26% C&P consistency. The proposed method reduces conflicts across all tested MLLMs and improves performance in both cognitive and perceptual tasks.

Conclusion: C&P knowledge conflicts are a significant issue in MLLMs, and the proposed fine-tuning method effectively mitigates these conflicts while enhancing model performance.

Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it “sees” and what it “understands”. Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflict, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.

[269] MAPS: Multi-Agent Personality Shaping for Collaborative Reasoning

Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Fangzhi Xu, Qika Lin, Lingling Zhang, Rui Mao, Erik Cambria, Jun Liu

Main category: cs.AI

TL;DR: MAPS is a multi-agent reasoning framework that assigns distinct personality traits to agents based on Big Five theory and introduces a Critic agent for reflection and iterative refinement, achieving strong performance across multiple benchmarks.

Details

Motivation: Existing multi-agent reasoning approaches suffer from homogeneous agent behaviors and lack reflective/rethinking capabilities, limiting their problem-solving robustness and diversity.

Method: Assigns distinct personality traits to agents using Big Five theory to shape reasoning styles, and introduces a Critic agent that reflects on intermediate outputs, revisits flawed steps, and guides iterative refinement.

Result: Empirical evaluations across three benchmarks demonstrate strong performance, with analysis confirming generalizability across different LLMs and validating benefits of multi-agent collaboration.

Conclusion: The integration of personality-driven agent design and structured collaboration improves both reasoning depth and flexibility in multi-agent systems.

Abstract: Collaborative reasoning with multiple agents offers the potential for more robust and diverse problem-solving. However, existing approaches often suffer from homogeneous agent behaviors and lack of reflective and rethinking capabilities. We propose Multi-Agent Personality Shaping (MAPS), a novel framework that enhances reasoning through agent diversity and internal critique. Inspired by the Big Five personality theory, MAPS assigns distinct personality traits to individual agents, shaping their reasoning styles and promoting heterogeneous collaboration. To enable deeper and more adaptive reasoning, MAPS introduces a Critic agent that reflects on intermediate outputs, revisits flawed steps, and guides iterative refinement. This integration of personality-driven agent design and structured collaboration improves both reasoning depth and flexibility. Empirical evaluations across three benchmarks demonstrate the strong performance of MAPS, with further analysis confirming its generalizability across different large language models and validating the benefits of multi-agent collaboration.

[270] Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining

Yu Shi, Yitong Duan, Jian Li

Main category: cs.AI

TL;DR: A novel framework combining LLMs with MCTS for automated alpha factor mining that generates interpretable formulas with superior performance compared to existing methods.

Details

Motivation: Traditional formulaic alpha mining relies on human expertise, while automated methods suffer from search inefficiency and produce uninterpretable factors.

Method: Integrates LLMs with MCTS, using LLM’s reasoning to generate symbolic formulas guided by financial backtesting feedback, with frequent subtree avoidance for diversity.

Result: Outperforms existing methods on real-world stock data with superior predictive accuracy and trading performance, producing more interpretable formulas.

Conclusion: Establishes a more effective and efficient paradigm for formulaic alpha mining through LLM-MCTS integration.

Abstract: Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often struggle with search inefficiency or yield alpha factors that are difficult to interpret. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our framework leverages the LLM’s instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to enhance search diversity and prevent formulaic homogenization, further improving performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy and trading performance. The resulting formulas are also more amenable to human interpretation, establishing a more effective and efficient paradigm for formulaic alpha mining.

[271] Learning API Functionality from In-Context Demonstrations for Tool-based Agents

Bhrij Patel, Ashish Jagmohan, Aditya Vempaty

Main category: cs.AI

TL;DR: Learning API functionality from in-context demonstrations instead of documentation, showing this remains challenging for LLMs but improves with explicit function calls and critiques.

Details

Motivation: API documentation is often missing, outdated, privatized, or inconsistent, hindering reliable general-purpose agents that need to understand API functionality.

Method: Proposed learning API functionality from in-context demonstrations collected from expert agents and self-exploration, studying effects of demonstration count, LLM-generated summaries, and evaluations on task success.

Result: Learning from demonstrations is non-trivial for state-of-the-art LLMs; explicit function calls and natural language critiques significantly improve task success through better parameter filling.

Conclusion: Documentation-free API learning from demonstrations presents key challenges for future work in self-improving API-based agents, with identified failure modes and error sources.

Abstract: Digital tool-based agents, powered by Large Language Models (LLMs), that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose a new research direction: learning of API functionality directly from in-context demonstrations. This task is a new paradigm applicable in scenarios without documentation. Using API benchmarks, we collect demonstrations from both expert agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 6 models show that learning functionality from in-context demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent’s task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.

[272] Higher-Order Responsibility

Junli Jiang, Pavel Naumov

Main category: cs.AI

TL;DR: The paper analyzes whether higher-order responsibility up to degree d can close the responsibility gap in group decision-making, showing this problem is Π_{2d+1}-complete.

Details

Motivation: Frankfurt's principle of alternative possibilities fails in group settings, creating responsibility gaps where no party can be held responsible. Existing approaches like group responsibility and higher-order responsibility aim to address this issue.

Method: The paper investigates the computational complexity of determining if higher-order responsibility up to degree d is sufficient to close responsibility gaps in group decision-making scenarios.

Result: The main technical result proves that deciding whether higher-order responsibility up to degree d closes the responsibility gap is Π_{2d+1}-complete, indicating high computational complexity.

Conclusion: Higher-order responsibility provides a theoretical framework for addressing responsibility gaps in group settings, but determining its sufficiency is computationally complex, with complexity increasing with the degree of responsibility considered.

Abstract: In ethics, individual responsibility is often defined through Frankfurt’s principle of alternative possibilities. This definition is not adequate in a group decision-making setting because it often results in the lack of a responsible party or “responsibility gap’’. One of the existing approaches to address this problem is to consider group responsibility. Another, recently proposed, approach is “higher-order’’ responsibility. The paper considers the problem of deciding if higher-order responsibility up to degree $d$ is enough to close the responsibility gap. The main technical result is that this problem is $Π_{2d+1}$-complete.

[273] About the Unreal

John Beverley, Jim Logan, Barry Smith

Main category: cs.AI

TL;DR: A framework for representing non-existent entities using intersections of actual types rather than dummy instances or modal logic, positioned within Basic Formal Ontology with practical implementation focus.

Details

Motivation: Traditional approaches to non-existent entities (fictional, hypothetical, future scenarios) either overcommit to metaphysical assumptions or create computational inefficiencies, hindering practical applications.

Method: Models non-existent entities using intersections of actual types rather than specific non-existent tokens, within the Basic Formal Ontology framework with realist commitments.

Result: Develops a structured ontology-driven approach that provides a computationally viable means of handling references to hypothetical or non-existent entities.

Conclusion: The proposed framework offers a practical, implementable solution for representing information about non-existent entities that balances philosophical rigor with computational efficiency.

Abstract: This paper introduces a framework for representing information about entities that do not exist or may never exist, such as those involving fictional entities, blueprints, simulations, and future scenarios. Traditional approaches that introduce “dummy instances” or rely on modal logic are criticized, and a proposal is defended in which such cases are modeled using the intersections of actual types rather than specific non existent tokens. The paper positions itself within the Basic Formal Ontology and its realist commitments, emphasizing the importance of practical, implementable solutions over purely metaphysical or philosophical proposals, arguing that existing approaches to non existent entities either overcommit to metaphysical assumptions or introduce computational inefficiencies that hinder applications. By developing a structured ontology driven approach to unreal patterns, the paper aims to provide a useful and computationally viable means of handling references to hypothetical or non existent entities.

[274] Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics

Ziqi Zhu, Tao Hu, Honglong Zhang, Dan Yang, HanGeng Chen, Mengran Zhang, Xilun Chen

Main category: cs.AI

TL;DR: CID-GraphRAG is a novel framework that combines intent transition graphs with semantic retrieval to improve multi-turn customer service dialogues, outperforming existing RAG approaches by balancing contextual coherence and goal-oriented progression.

Details

Motivation: Existing dialogue systems struggle to maintain both contextual coherence and goal-oriented progression in multi-turn customer service conversations, with traditional RAG systems relying solely on either semantic similarity or knowledge graphs.

Method: Constructs dynamic intent transition graphs from historical dialogues and implements dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search.

Result: Significantly outperforms Conversation RAG and GraphRAG baselines with 11% BLEU, 5% ROUGE-L, 6% METEOR improvements, and 58% improvement in response quality according to LLM-as-judge evaluations.

Conclusion: Integration of intent transition structures with semantic retrieval creates synergistic effects, establishing CID-GraphRAG as effective for maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.

Abstract: We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge graphs (GraphRAG), CID-GraphRAG constructs dynamic intent transition graphs from goal achieved historical dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversional intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we employ both automatic metrics and LLM-as-judge assessments, demonstrating that CID-GraphRAG significantly outperforms both semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria. Quantitatively, CID-GraphRAG demonstrates substantial improvements over Conversation RAG across automatic metrics, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and most notably, a 58% improvement in response quality according to LLM-as-judge evaluations. These results demonstrate that the integration of intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for addressing the challenges of maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.

[275] Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind

Myung Ho Kim

Main category: cs.AI

TL;DR: Agentic Flow architecture unintentionally converges four major cognitive theories (Kahneman, Friston, Minsky, Clark) through practical AI implementation, achieving 95.8% task success vs 62.3% for baseline LLMs.

Details

Motivation: To address limitations of large language models by creating a practical AI architecture that naturally converges multiple cognitive theories through implementation demands rather than deliberate synthesis.

Method: Developed Agentic Flow with five interlocking modules (Retrieval, Cognition, Control, Memory, Action) organized into a repeatable cognitive loop, later formalized as Structured Cognitive Loop (SCL).

Result: Structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning.

Conclusion: Intelligent architectures may evolve toward shared structural patterns shaped by practical constraints, with Agentic Flow/SCL showing how unified cognitive forms emerge from real-world reasoning necessities rather than abstraction.

Abstract: We report a structural convergence among four influential theories of mind: Kahneman’s dual-system theory, Friston’s predictive processing, Minsky’s society of mind, and Clark’s extended mind, emerging unintentionally within a practical AI architecture known as Agentic Flow. Designed to address the limitations of large language models (LLMs), Agentic Flow comprises five interlocking modules: Retrieval, Cognition, Control, Memory, and Action, organized into a repeatable cognitive loop. Although originally inspired only by Minsky and Clark, subsequent analysis revealed that its structure echoes computational motifs from all four theories, suggesting that theoretical convergence can emerge naturally from implementation demands rather than deliberate synthesis. Controlled evaluations confirmed this: the structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning. We describe this convergence under a broader descriptive meta-architecture called PEACE, highlighting recurring design patterns such as predictive modeling, associative recall, and error-sensitive control. Later formalized as the Structured Cognitive Loop (SCL), this framework generalizes the same principles as a foundation for behavioral intelligence in LLM-based agents. Rather than claiming theoretical unification, this paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints. As a position paper, it aims to frame this convergence as an interpretive reflection rather than a finalized theory, inviting further theoretical and experimental dialogue. Agentic Flow, or equivalently the Structured Cognitive Loop, thus offers a glimpse of how a unified cognitive form can arise not from abstraction, but from the necessities of real-world reasoning.

[276] Pushdown Reward Machines for Reinforcement Learning

Giovanni Varricchione, Toryn Q. Klassen, Natasha Alechina, Mehdi Dastani, Brian Logan, Sheila A. McIlraith

Main category: cs.AI

TL;DR: This paper introduces pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata that can recognize and reward behaviors representable in deterministic context-free languages, making them more expressive than regular reward machines.

Details

Motivation: To extend the expressiveness of reward machines beyond regular languages to handle more complex temporally extended behaviors representable in deterministic context-free languages, thereby improving reinforcement learning capabilities for more complex tasks.

Method: Proposed pushdown reward machines (pdRMs) based on deterministic pushdown automata, introduced two policy variants (full stack access vs top k symbols), developed theoretical analysis of expressive power and complexity, and created an off-policy RL approach exploiting counterfactual experiences.

Result: Theoretical results established the expressive power of pdRMs and space complexity bounds, while experimental results demonstrated successful training of agents to perform tasks representable in deterministic context-free languages using pdRMs.

Conclusion: pdRMs successfully extend reward machines to handle deterministic context-free languages, providing a more expressive framework for reinforcement learning with improved capabilities for complex temporally extended behaviors.

Abstract: Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognise and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top $k$ symbols (for a given constant $k$) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant $k$) achieve the same optimal state values. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results for the proposed learning problems. Lastly, we propose an approach for off-policy RL algorithms that exploits counterfactual experiences with pdRMs. We conclude by providing experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.

[277] LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi

Main category: cs.AI

TL;DR: LeanRAG is a knowledge graph-based RAG framework that addresses semantic isolation and inefficient retrieval by creating navigable semantic networks and using structure-guided retrieval to improve response quality while reducing redundancy.

Details

Motivation: Current knowledge graph-based RAG methods suffer from disconnected semantic islands in hierarchical structures and inefficient flat retrieval that fails to exploit graph topology, compromising effectiveness.

Method: Uses semantic aggregation to form entity clusters with explicit relations, creating navigable semantic networks, followed by bottom-up structure-guided retrieval that anchors queries to fine-grained entities and traverses semantic pathways.

Result: Significantly outperforms existing methods on four QA benchmarks across different domains while reducing retrieval redundancy by 46%.

Conclusion: LeanRAG effectively overcomes limitations of hierarchical knowledge graph RAG by enabling cross-community reasoning and efficient structure-aware retrieval, achieving better performance with reduced overhead.

Abstract: Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands’’, lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph’s rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph’s semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG

[278] Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang, Tu Nguyen, Matthieu Zimmer

Main category: cs.AI

TL;DR: The paper proposes Staged Advantage Estimation (SAE) to improve policy optimization in preference-based RL by leveraging MCTS-derived trajectories and addressing the challenge of computing advantages for samples from different prefixes with distinct expected returns.

Details

Motivation: To explore how MCTS-derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in preference-based reinforcement learning, specifically focusing on Group Relative Policy Optimization (GRPO).

Method: Reframe GRPO into a staged training paradigm using teacher’s MCTS rollouts to construct a tree-structured curriculum of prefixes, and propose SAE framework for computing low-variance, prefix-aware advantages by projecting rewards onto a constraint set that respects the tree’s hierarchy.

Result: Empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO, with theoretical analysis confirming that SAE reduces gradient variance leading to improved sample efficiency.

Conclusion: SAE provides a principled approach to leverage MCTS trajectories for policy optimization in preference-based RL, demonstrating improved performance through reduced gradient variance and better sample efficiency.

Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories-traditionally used for training value or reward models-can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We reframe GRPO into a staged training paradigm, leveraging a teacher’s MCTS rollouts to construct a tree-structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low-variance, prefix-aware advantages by projecting rewards onto a constraint set that respects the tree’s hierarchy. Our empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO. This outcome is grounded in our theoretical analysis, which confirms that SAE reduces gradient variance-a principled path to improved sample efficiency. We demonstrate this through practical SAE implementations, comparing efficient heuristics against a formal quadratic program.

[279] Exploring the Paradigm Shift from Grounding to Skolemization for Complex Query Answering on Knowledge Graphs

Yuyin Lu, Hegang Chen, Shanrui Xie, Yanghui Rao, Haoran Xie, Fu Lee Wang, Qing Li

Main category: cs.AI

TL;DR: LVSA is a neuro-symbolic framework that unifies differentiable Skolemization and neural negation to efficiently answer complex queries over incomplete knowledge graphs while maintaining logical consistency.

Details

Motivation: Address the fundamental tradeoff between logic fidelity and computational efficiency in Complex Query Answering over incomplete Knowledge Graphs, where existing methods either suffer from combinatorial explosion (Grounding-based) or compromise logical consistency (Skolemization-based).

Method: Propose Logic-constrained Vector Symbolic Architecture (LVSA) with differentiable Skolemization module, neural negator, and logical constraint-driven optimization protocol to harmonize geometric and logical requirements.

Result: Theoretically guarantees universality for all EFO₁ queries with low computational complexity. Empirically outperforms state-of-the-art Skolemization-based methods and reduces inference costs by orders of magnitude compared to Grounding-based baselines.

Conclusion: LVSA provides an effective paradigm shift in Complex Query Answering by systematically addressing the Grounding-Skolemization dichotomy through a neuro-symbolic approach that balances computational efficiency with logical consistency.

Abstract: Complex Query Answering (CQA) over incomplete Knowledge Graphs (KGs), typically formalized as reasoning with Existential First-Order predicate logic with one free variable (EFO\textsubscript{1}), faces a fundamental tradeoff between logic fidelity and computational efficiency. This work establishes a Grounding-Skolemization dichotomy to systematically analyze this challenge and motivate a paradigm shift in CQA. While Grounding-based methods inherently suffer from combinatorial explosion, most Skolemization-based methods neglect to explicitly model Skolem functions and compromise logical consistency. To address these limitations, we propose the Logic-constrained Vector Symbolic Architecture (LVSA), a neuro-symbolic framework that unifies a differentiable Skolemization module and a neural negator, as well as a logical constraint-driven optimization protocol to harmonize geometric and logical requirements. Theoretically, LVSA guarantees universality for all EFO\textsubscript{1} queries with low computational complexity. Empirically, it outperforms state-of-the-art Skolemization-based methods and reduces inference costs by orders of magnitude compared to Grounding-based baselines.

[280] AgentFlux: Decoupled Fine-Tuning & Inference for On-Device Agentic Systems

Rohan Kadekodi, Zhan Jin, Keisuke Kamahori, Yile Gu, Sean Khatiri, Noah H. Bayindirli, Sergey Gorbunov, Baris Kasikci

Main category: cs.AI

TL;DR: Decoupled fine-tuning method improves local LLM tool calling by 46% through separate tool selection and argument generation adapters, enabling efficient on-device agent orchestration.

Details

Motivation: Local LLMs underperform frontier models in tool calling, struggling with tool selection from large sets and accurate argument generation for complex parameters, requiring privacy-preserving on-device solutions.

Method: Decoupled fine-tuning using LoRA adapters for separate tool selection and argument generation tasks with loss masking, plus AgentFlux framework for dynamic adapter loading and hierarchical orchestration.

Result: Qwen-2.5-7B model with decoupled fine-tuning improves tool calling accuracy by 46%, outperforms similar-sized models and often larger models on MCP-Bench benchmark.

Conclusion: The proposed decoupled fine-tuning approach and AgentFlux framework enable local LLMs to achieve competitive tool calling performance while maintaining privacy and cost-effectiveness for on-device deployment.

Abstract: The deployment of Large Language Models (LLMs) as agentic orchestrators has revolutionized task automation, but the need for privacy-preserving, cost-effective solutions demands on-device inference capabilities. However, local LLMs consistently underperform compared to frontier models in tool calling scenarios, struggling with both tool selection from large tool sets and accurate argument generation for complex parameter structures. We introduce a methodology that disaggregates a tool-calling task into two distinct subtasks: tool selection and argument generation. We propose “decoupled fine-tuning”, a novel post-training approach that employs LoRA fine-tuning to create dedicated LoRA adapters for tool selection and tool-specific argument generation using separate loss masking for each of the subtasks. Furthermore, we present AgentFlux, an inference framework that leverages the LoRA adapters created using decoupled fine-tuning to perform efficient agent orchestration with the help of local models on end-user devices. AgentFlux decomposes the tool-call generation step into tool selection and argument generation, and dynamically loads the corresponding LoRA adapters to generate tool calls. Additionally, AgentFlux implements hierarchical orchestration to restrict the number of tools required for tool selection. Our experiments on the MCP-Bench benchmark demonstrate that the Qwen-2.5-7B model trained using decoupled fine-tuning improves the tool calling accuracy of the base model by 46%, and outperforms other local reasoning, non-reasoning and fine-tuned models of similar size in all cases, and models that are 2x larger, in most cases.

[281] Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents

Myung Ho Kim

Main category: cs.AI

TL;DR: The Structured Cognitive Loop (SCL) separates cognition, memory, and control in LLM agents, achieving 86.3% task success rate vs 70.5-76.8% for baselines like ReAct and LangChain.

Details

Motivation: Existing LLM agent frameworks mix cognition, memory, and control in single prompts, reducing coherence and predictability for multi-step tasks.

Method: SCL architecture separates functions: LLM handles cognition, external memory storage, and lightweight controller guides execution in goal-directed loop with intermediate verification.

Result: SCL achieved 86.3% average task success rate across travel planning, email drafting, and image generation tasks, outperforming baselines (70.5-76.8%) with higher goal fidelity and fewer redundant calls.

Conclusion: Separating cognition, memory, and control enhances reliability and interpretability without requiring larger models or heavier prompts, though findings are preliminary and broader testing is planned.

Abstract: Large language models have advanced natural language understanding and generation, but their use as autonomous agents introduces architectural challenges for multi-step tasks. Existing frameworks often mix cognition, memory, and control in a single prompt, reducing coherence and predictability. The Structured Cognitive Loop (SCL) is proposed as an alternative architecture that separates these functions. In SCL, the language model handles cognition, memory is stored externally, and execution is guided by a lightweight controller within a goal-directed loop. This design allows intermediate results to be recorded and verified before actions are taken, improving traceability and evaluation. SCL is evaluated against prompt-based baselines such as ReAct and LangChain agents across three tasks: travel planning, conditional email drafting, and constraint-guided image generation. Under matched settings, SCL achieves an average task success rate of 86.3 percent, compared with 70.5 to 76.8 percent for baselines. It also shows higher goal fidelity, fewer redundant calls, and reduced unsupported assertions. These results indicate that separating cognition, memory, and control can enhance reliability and interpretability without relying on larger models or heavier prompts. The findings should be regarded as preliminary evidence, with broader tests across model families and task domains planned for future work.

[282] Simpliflow: A Lightweight Open-Source Framework for Rapid Creation and Deployment of Generative Agentic AI Workflows

Deven Panchal

Main category: cs.AI

TL;DR: simpliflow is a lightweight Python framework for building deterministic agentic AI workflows with simple JSON configuration, designed to reduce complexity compared to existing frameworks.

Details

Motivation: Existing frameworks for generative agentic AI systems introduce significant complexity, steep learning curves, and boilerplate code, hindering rapid prototyping and deployment.

Method: Uses declarative JSON-based configuration for linear, deterministic workflows, with modular architecture separating agent management, workflow execution, and post-processing. Integrates with LiteLLM for support of 100+ LLMs.

Result: Enables rapid development of agentic workflows through diverse use cases including software development simulation and real-time system interaction.

Conclusion: simpliflow offers a unique position optimized for simplicity, control, and speed in deterministic workflow environments compared to frameworks like LangChain and AutoGen.

Abstract: Generative Agentic AI systems are emerging as a powerful paradigm for automating complex, multi-step tasks. However, many existing frameworks for building these systems introduce significant complexity, a steep learning curve, and substantial boilerplate code, hindering rapid prototyping and deployment. This paper introduces simpliflow, a lightweight, open-source Python framework designed to address these challenges. simpliflow enables the rapid development and orchestration of linear, deterministic agentic workflows through a declarative, JSON-based configuration. Its modular architecture decouples agent management, workflow execution, and post-processing, promoting ease of use and extensibility. By integrating with LiteLLM, it supports over 100 Large Language Models (LLMs) out-of-the-box. We present the architecture, operational flow, and core features of simpliflow, demonstrating its utility through diverse use cases ranging from software development simulation to real-time system interaction. A comparative analysis with prominent frameworks like LangChain and AutoGen highlights simpliflow’s unique position as a tool optimized for simplicity, control, and speed in deterministic workflow environments.

[283] LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: LiveResearchBench is a benchmark of 100 expert-curated tasks requiring extensive web search and synthesis, with DeepEval providing comprehensive evaluation metrics for citation-grounded long-form reports.

Details

Motivation: Existing benchmarks for deep research systems fail to meet essential principles: they lack user-centricity, dynamic information requirements, unambiguous tasks, and multi-faceted search intensity needed for realistic evaluation.

Method: Developed LiveResearchBench with 100 expert-curated tasks spanning daily life, enterprise, and academia, requiring dynamic web search and synthesis. Created DeepEval evaluation suite covering content- and report-level quality with four complementary evaluation protocols.

Result: Comprehensive evaluation of 17 frontier deep research systems revealed current strengths, recurring failure modes, and identified key system components needed for advancing reliable deep research capabilities.

Conclusion: LiveResearchBench and DeepEval provide a rigorous framework for systematically evaluating deep research systems, highlighting the need for improved web search, synthesis, and citation capabilities in agentic systems.

Abstract: Deep research – producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources – marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

[284] From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL

Ali Khosravi Kazazi, Zhenlong Li, M. Naser Lessani, Guido Cervone

Main category: cs.AI

TL;DR: A multi-agent framework that translates natural language to spatial SQL queries, achieving 87.7% accuracy on spatial queries through specialized agents and self-verification.

Details

Motivation: To overcome the complexity barriers of SQL and geospatial functions for non-experts, addressing limitations of single-agent Text-to-SQL approaches in handling spatial query complexities.

Method: Multi-agent framework with knowledge base, schema profiling, semantic enrichment, embeddings for context retrieval, and specialized agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and programmatic/semantic validation.

Result: 81.2% overall accuracy on KaggleDBQA (221/272 questions) and 87.7% accuracy on SpatialQueryQA (79/90 spatial questions), with significant improvement from 76.7% without the review agent.

Conclusion: The system makes spatial analysis more accessible and provides a robust foundation for spatial Text-to-SQL systems, advancing autonomous GIS development.

Abstract: The complexity of Structured Query Language (SQL) and the specialized nature of geospatial functions in tools like PostGIS present significant barriers to non-experts seeking to analyze spatial data. While Large Language Models (LLMs) offer promise for translating natural language into SQL (Text-to-SQL), single-agent approaches often struggle with the semantic and syntactic complexities of spatial queries. To address this, we propose a multi-agent framework designed to accurately translate natural language questions into spatial SQL queries. The framework integrates several innovative components, including a knowledge base with programmatic schema profiling and semantic enrichment, embeddings for context retrieval, and a collaborative multi-agent pipeline as its core. This pipeline comprises specialized agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and a review agent that performs programmatic and semantic validation of the generated SQL to ensure correctness (self-verification). We evaluate our system using both the non-spatial KaggleDBQA benchmark and a new, comprehensive SpatialQueryQA benchmark that includes diverse geometry types, predicates, and three levels of query complexity. On KaggleDBQA, the system achieved an overall accuracy of 81.2% (221 out of 272 questions) after the review agent’s review and corrections. For spatial queries, the system achieved an overall accuracy of 87.7% (79 out of 90 questions), compared with 76.7% without the review agent. Beyond accuracy, results also show that in some instances the system generates queries that are more semantically aligned with user intent than those in the benchmarks. This work makes spatial analysis more accessible, and provides a robust, generalizable foundation for spatial Text-to-SQL systems, advancing the development of autonomous GIS.

[285] Mixed-Density Diffuser: Efficient Planning with Non-Uniform Temporal Resolution

Crimson Stambaugh, Rajesh P. N. Rao

Main category: cs.AI

TL;DR: MDD is a diffusion planner with tunable temporal density hyperparameters that achieves state-of-the-art performance across multiple D4RL benchmarks by allowing non-uniform step skipping throughout the planning horizon.

Details

Motivation: While sparse-step planning in diffusion models captures long-term dependencies efficiently, excessive sparsity degrades performance. The authors hypothesize that temporal density thresholds are non-uniform across planning horizons, requiring different densities in different trajectory segments.

Method: Proposed Mixed-Density Diffuser (MDD) where temporal densities throughout the planning horizon are tunable hyperparameters, allowing flexible control over step skipping patterns.

Result: MDD achieves new state-of-the-art performance across Maze2D, Franka Kitchen, and Antmaze D4RL task domains.

Conclusion: Adaptive temporal density control in diffusion planning enables better performance by addressing the non-uniform nature of optimal step skipping across different parts of the trajectory horizon.

Abstract: Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed-Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. We show that MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.

[286] Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

Main category: cs.AI

TL;DR: Proposes Emotional Rationale Verifier (ERV) and Explanation Reward to improve consistency between emotion predictions and explanations in Multimodal Large Language Models without architectural changes or additional annotations.

Details

Motivation: Current MLLM-based methods generate emotion explanations that often diverge from target labels and contradict predicted emotions, posing risks for misunderstanding and eroding reliability in human-computer interaction.

Method: Introduces Emotional Rationale Verifier (ERV) and Explanation Reward to guide models to produce reasoning explicitly consistent with target emotions during multimodal emotion recognition, without modifying model architecture or requiring additional video-description annotations.

Result: Significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on MAFW and DFEW datasets, enhancing alignment between explanation and prediction through extensive experiments and human evaluations.

Conclusion: The approach empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems by ensuring consistent and faithful emotion explanations.

Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

[287] e1: Learning Adaptive Control of Reasoning Effort

Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto

Main category: cs.AI

TL;DR: Adaptive Effort Control enables users to control AI reasoning effort via a continuous parameter, achieving 2-3x reduction in chain-of-thought length while maintaining or improving performance across model scales.

Details

Motivation: Users need fine-grained control over reasoning effort to balance output quality versus latency and cost, but existing methods require knowing problem difficulty beforehand to set token budgets.

Method: Self-adaptive reinforcement learning that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query.

Result: Eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves. Models automatically learn to allocate resources proportionally to task difficulty.

Conclusion: The approach enables dynamic adjustment of cost-accuracy trade-off and achieves significant efficiency gains (2-3x reduction in reasoning length) while maintaining or improving performance across various model sizes.

Abstract: Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables a 2-3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.

[288] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu

Main category: cs.AI

TL;DR: Ariadne framework uses RL post-training with synthetic mazes to extend VLMs’ capability boundaries for visual-centric spatial reasoning, achieving significant improvements on both synthetic and real-world benchmarks.

Details

Motivation: To investigate whether RL post-training can truly extend VLMs' inherent capability boundaries for visual-centric spatial tasks where they initially fail, rather than just improving language-dominant tasks.

Method: Ariadne framework using synthetic mazes for multi-step spatial reasoning with controlled difficulty, trained with Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.

Result: Post-RLVR training achieved over 50% accuracy on problems where base model scored 0%, with zero-shot improvements of 16% on MapBench and 24% on ReasonMap for real-world spatial reasoning tasks.

Conclusion: RL post-training can expand VLMs’ initial capability boundaries and enhance generalization to real-world spatial reasoning, though limited to post-training phase due to pre-training data opaqueness.

Abstract: While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model’s initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model’s fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

[289] Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis

Daniel Gomm, Cornelius Wolff, Madelon Hulsebos

Main category: cs.AI

TL;DR: The paper reframes query ambiguity in natural language interfaces to tabular data as a cooperative interaction feature, proposing a framework that distinguishes between unambiguous, ambiguous cooperative, and uncooperative queries.

Details

Motivation: To address the challenge of query ambiguity in natural language interfaces to tabular data by treating ambiguity as a feature of cooperative interaction rather than a deficiency, enabling better system design and evaluation.

Method: Developed a principled framework based on shared responsibility between user and system for query specification, distinguishing query types (unambiguous, ambiguous cooperative, uncooperative), and applied it to analyze 15 popular datasets for tabular question answering and analysis.

Result: Found that current datasets mix query types inadequately for evaluating both execution accuracy and interpretation capabilities, revealing limitations in existing evaluation approaches.

Conclusion: The cooperative query resolution framework provides concrete directions for designing and evaluating natural language interfaces for tabular data analysis, with broader implications for future research.

Abstract: Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system’s execution accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.

[290] Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning

Xinran Li, Yu Liu, Jiaqi Qiao, Xiujuan Xu

Main category: cs.AI

TL;DR: PRC-Emo: A novel ERC framework combining Prompt engineering, demonstration Retrieval, and Curriculum learning to enhance LLMs’ emotion perception in conversations, achieving SOTA results on IEMOCAP and MELD datasets.

Details

Motivation: LLMs have shown potential in Emotion Recognition in Conversation (ERC) but struggle to capture intrinsic connections between explicit and implicit emotions, limiting their ability to understand speakers' psychological states.

Method: Integrates three components: emotion-sensitive prompt templates for explicit/implicit cues, first dedicated demonstration retrieval repository with LLM-generated dialogues, and curriculum learning with weighted emotional shifts in LoRA fine-tuning organized from easy to hard samples.

Result: Achieves new state-of-the-art performance on IEMOCAP and MELD benchmark datasets, demonstrating effectiveness and generalizability in improving LLM-based emotional understanding.

Conclusion: The PRC-Emo framework successfully enhances LLMs’ emotion perception capabilities in conversational contexts through integrated prompt engineering, retrieval, and curriculum learning strategies.

Abstract: Emotion Recognition in Conversation (ERC) is a crucial task for understanding human emotions and enabling natural human-computer interaction. Although Large Language Models (LLMs) have recently shown great potential in this field, their ability to capture the intrinsic connections between explicit and implicit emotions remains limited. We propose a novel ERC training framework, PRC-Emo, which integrates Prompt engineering, demonstration Retrieval, and Curriculum learning, with the goal of exploring whether LLMs can effectively perceive emotions in conversational contexts. Specifically, we design emotion-sensitive prompt templates based on both explicit and implicit emotional cues to better guide the model in understanding the speaker’s psychological states. We construct the first dedicated demonstration retrieval repository for ERC, which includes training samples from widely used datasets, as well as high-quality dialogue examples generated by LLMs and manually verified. Moreover, we introduce a curriculum learning strategy into the LoRA fine-tuning process, incorporating weighted emotional shifts between same-speaker and different-speaker utterances to assign difficulty levels to dialogue samples, which are then organized in an easy-to-hard training sequence. Experimental results on two benchmark datasets – IEMOCAP and MELD – show that our method achieves new state-of-the-art (SOTA) performance, demonstrating the effectiveness and generalizability of our approach in improving LLM-based emotional understanding.

[291] Green AI: A systematic review and meta-analysis of its definitions, lifecycle models, hardware and measurement attempts

Marcel Rojahn, Marcus Grum

Main category: cs.AI

TL;DR: This paper establishes a unified framework for Green AI that addresses environmental burdens across the entire AI lifecycle, including energy, carbon, water, and embodied impacts, with systematic measurement and governance approaches.

Details

Motivation: Current AI environmental burden assessments are heterogeneous, often omit water and value chain effects, and lack comparability and reproducibility, requiring a comprehensive lifecycle approach.

Method: The paper proposes: (i) unified Green AI definition distinct from Sustainable AI; (ii) five-phase lifecycle mapping to LCA stages; (iii) PDCA governance with decision gateways; (iv) hardware/system strategies across edge-cloud continuum; (v) calibrated measurement framework combining estimator models with direct metering.

Result: A comprehensive framework that makes energy, carbon, water, and embodied impacts first-class considerations across the AI lifecycle, enabling reproducible, provider-agnostic comparisons and burden reduction.

Conclusion: The article provides actionable, evidence-based guidance combining definition, lifecycle processes, hardware strategies, and calibrated measurement for researchers, practitioners, and policymakers to systematically address AI’s environmental impacts.

Abstract: Across the Artificial Intelligence (AI) lifecycle - from hardware to development, deployment, and reuse - burdens span energy, carbon, water, and embodied impacts. Cloud provider tools improve transparency but remain heterogeneous and often omit water and value chain effects, limiting comparability and reproducibility. Addressing these multi dimensional burdens requires a lifecycle approach linking phase explicit mapping with system levers (hardware, placement, energy mix, cooling, scheduling) and calibrated measurement across facility, system, device, and workload levels. This article (i) establishes a unified, operational definition of Green AI distinct from Sustainable AI; (ii) formalizes a five phase lifecycle mapped to Life Cycle Assessment (LCA) stages, making energy, carbon, water, and embodied impacts first class; (iii) specifies governance via Plan Do Check Act (PDCA) cycles with decision gateways; (iv) systematizes hardware and system level strategies across the edge cloud continuum to reduce embodied burdens; and (v) defines a calibrated measurement framework combining estimator models with direct metering to enable reproducible, provider agnostic comparisons. Combining definition, lifecycle processes, hardware strategies, and calibrated measurement, this article offers actionable, evidence based guidance for researchers, practitioners, and policymakers.

[292] A Theoretical Analysis of Detecting Large Model-Generated Time Series

Junji Hou, Junzhou Zhao, Shuo Zhang, Pinghui Wang

Main category: cs.AI

TL;DR: The paper proposes a method to detect synthetic time series generated by Time-Series Large Models (TSLMs) by identifying uncertainty contraction patterns during recursive forecasting.

Details

Motivation: Increasing risks of data misuse and fabrication, and the inapplicability of existing text-based detection methods to time series data due to modality differences (lower information density, smoother distributions).

Method: Proposes the contraction hypothesis that model-generated time series exhibit progressively decreasing uncertainty under recursive forecasting. Introduces Uncertainty Contraction Estimator (UCE), a white-box detector that aggregates uncertainty metrics over successive prefixes.

Result: Empirical validation across diverse datasets shows UCE consistently outperforms state-of-the-art baselines in detecting TSLM-generated time series.

Conclusion: UCE provides a reliable and generalizable solution for detecting model-generated time series by leveraging the uncertainty contraction phenomenon.

Abstract: Motivated by the increasing risks of data misuse and fabrication, we investigate the problem of identifying synthetic time series generated by Time-Series Large Models (TSLMs) in this work. While there are extensive researches on detecting model generated text, we find that these existing methods are not applicable to time series data due to the fundamental modality difference, as time series usually have lower information density and smoother probability distributions than text data, which limit the discriminative power of token-based detectors. To address this issue, we examine the subtle distributional differences between real and model-generated time series and propose the contraction hypothesis, which states that model-generated time series, unlike real ones, exhibit progressively decreasing uncertainty under recursive forecasting. We formally prove this hypothesis under theoretical assumptions on model behavior and time series structure. Model-generated time series exhibit progressively concentrated distributions under recursive forecasting, leading to uncertainty contraction. We provide empirical validation of the hypothesis across diverse datasets. Building on this insight, we introduce the Uncertainty Contraction Estimator (UCE), a white-box detector that aggregates uncertainty metrics over successive prefixes to identify TSLM-generated time series. Extensive experiments on 32 datasets show that UCE consistently outperforms state-of-the-art baselines, offering a reliable and generalizable solution for detecting model-generated time series.

[293] DigiData: Training and Evaluating General-Purpose Mobile Control Agents

Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D’Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe

Main category: cs.AI

TL;DR: DigiData is a large-scale, high-quality mobile control dataset with complex goals, plus DigiData-Bench benchmark with improved evaluation protocols beyond step-accuracy.

Details

Motivation: To accelerate development of AI agents for controlling user interfaces by providing better training data and evaluation methods for mobile control agents.

Method: Created DigiData dataset through comprehensive app feature exploration (not unstructured interactions), and developed DigiData-Bench with dynamic evaluation protocols and AI-powered assessments.

Result: Produced a diverse, multi-modal dataset with higher goal complexity than existing datasets, and demonstrated limitations of step-accuracy metric while proposing better evaluation alternatives.

Conclusion: These contributions advance mobile control agent development toward more intuitive human-device interactions through improved datasets and evaluation frameworks.

Abstract: AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.

[294] Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: The paper introduces ‘information capacity’ as a unified metric for LLM efficiency, measuring text compression performance relative to computational complexity, enabling fair comparisons across different model sizes and architectures.

Details

Motivation: There is no unified metric that accurately reflects LLM efficiency across different model sizes and architectures, especially with growing computational demands and test-time scaling.

Method: Propose information capacity based on text compression performance relative to computational complexity, evaluating 49 models on 5 heterogeneous datasets while considering tokenizer efficiency.

Result: Models of varying sizes within a series exhibit consistent information capacity, enabling fair efficiency comparisons across model series and accurate performance prediction within a model series.

Conclusion: Information capacity provides a comprehensive efficiency metric that incorporates tokenizer efficiency and reveals consistent patterns across different model architectures and training data.

Abstract: Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM’s efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 49 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.

[295] Hyperdimensional Decoding of Spiking Neural Networks

Cedrick Kinavuidi, Luca Peres, Oliver Rhodes

Main category: cs.AI

TL;DR: A novel SNN-HDC decoding method combining spiking neural networks with hyperdimensional computing achieves high accuracy, noise robustness, low latency, and significant energy reductions (1.24x-3.67x) compared to existing approaches.

Details

Motivation: To create a decoding method that overcomes limitations of existing approaches by combining SNNs with HDC for high accuracy, noise robustness, low latency, and low energy consumption.

Method: Combines spiking neural networks (SNNs) with hyperdimensional computing (HDC) to create a novel decoding architecture that processes temporal spike patterns efficiently.

Result: Achieved generally better classification accuracy, lower latency, and 1.24x-3.67x energy reduction on DvsGesture dataset and 1.38x-2.27x on SL-Animals-DVS dataset. Can identify 100% of samples from unseen classes.

Conclusion: The SNN-HDC decoding method represents a compelling alternative to both rate and latency decoding methods due to its numerous benefits in accuracy, efficiency, and unknown class identification.

Abstract: This work presents a novel spiking neural network (SNN) decoding method, combining SNNs with Hyperdimensional computing (HDC). The goal is to create a decoding method with high accuracy, high noise robustness, low latency and low energy usage. Compared to analogous architectures decoded with existing approaches, the presented SNN-HDC model attains generally better classification accuracy, lower classification latency and lower estimated energy consumption on multiple test cases from literature. The SNN-HDC achieved estimated energy consumption reductions ranging from 1.24x to 3.67x on the DvsGesture dataset and from 1.38x to 2.27x on the SL-Animals-DVS dataset. The presented decoding method can also efficiently identify unknown classes it has not been trained on. In the DvsGesture dataset the SNN-HDC model can identify 100% of samples from an unseen/untrained class. Given the numerous benefits shown and discussed in this paper, this decoding method represents a very compelling alternative to both rate and latency decoding.

cs.SD

[296] Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Shulei Ji, Zihao Wang, Jiaxing Yu, Xiangyuan Yang, Shuyu Li, Songruoyao Wu, Kejun Zhang

Main category: cs.SD

TL;DR: Diff-V2M is a hierarchical conditional diffusion model for video-to-music generation that addresses rhythm modeling challenges through explicit rhythmic representations and adaptive feature fusion.

Details

Motivation: Existing video-to-music generation methods lack explicit rhythm modeling and struggle with effectively integrating diverse visual features, hindering audiovisual temporal alignment and contextual coherence.

Method: Proposes a hierarchical conditional diffusion model with visual feature extraction (rhythmic, semantic, emotional) and hierarchical cross-attention mechanism. Uses onset detection functions for rhythm modeling and timestep-aware fusion strategies (FiLM, weighted fusion) for adaptive feature integration.

Result: Extensive experiments show low-resolution ODF is most effective for rhythm modeling. Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in objective metrics and subjective comparisons.

Conclusion: Diff-V2M successfully addresses rhythm modeling and feature integration challenges in video-to-music generation through explicit rhythmic representations and hierarchical cross-attention, demonstrating superior performance across diverse datasets.

Abstract: Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at https://Tayjsl97.github.io/Diff-V2M-Demo/.

[297] Chord-conditioned Melody and Bass Generation

Alexandra C Salem, Mohammad Shokri, Johanna Devaney

Main category: cs.SD

TL;DR: Evaluation of five Transformer-based chord-conditioned generation strategies for melody and bass using music theory metrics, showing bass-first model performs best.

Details

Motivation: To systematically compare different chord-conditioning strategies for music generation and assess their effectiveness using objective music theory metrics.

Method: Evaluated five Transformer-based models: no conditioning, independent line conditioning, bass-first, melody-first, and co-generation, using pitch content, interval size, and chord tone usage metrics.

Result: Chord-conditioning improves stylistic pitch content and chord tone usage, with bass-first model showing particularly strong performance.

Conclusion: Chord-conditioning is beneficial for music generation, and the bass-first approach is especially effective for capturing stylistic characteristics.

Abstract: We evaluate five Transformer-based strategies for chord-conditioned melody and bass generation using a set of music theory-motivated metrics capturing pitch content, pitch interval size, and chord tone usage. The evaluated models include (1) no chord conditioning, (2) independent line chord-conditioned generation, (3) bass-first chord-conditioned generation, (4) melody-first chord-conditioned generation, and (5) chord-conditioned co-generation. We show that chord-conditioning improves the replication of stylistic pitch content and chord tone usage characteristics, particularly for the bass-first model.

[298] SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

Main category: cs.SD

TL;DR: Two music editing methods that use score distillation to improve consistency between original and edited music, outperforming existing approaches in content preservation and editing fidelity.

Details

Motivation: Existing zero-shot text-guided music editing methods struggle to preserve musical content and text instructions alone often fail to accurately describe desired music.

Method: SteerMusic (coarse-grained zero-shot editing using delta denoising score) and SteerMusic+ (fine-grained personalized editing by manipulating concept tokens for user-defined musical styles).

Result: Experimental results show the methods outperform existing approaches in preserving music content consistency and editing fidelity, with user studies validating superior editing quality.

Conclusion: The proposed score distillation-based methods enable more effective music editing with better content preservation and editing accuracy compared to text-only approaches.

Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that improve the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality.

[299] Non-verbal Perception of Room Acoustics using Multi Dimensional Scaling Metho

Leonie Böhlke, Tim Ziemer, Rolf Bader

Main category: cs.SD

TL;DR: This paper presents an alternative approach to characterizing subjective room acoustics impressions using Multi Dimensional Scaling (MDS) to identify 5 perceptual dimensions explained by psychoacoustic measures.

Details

Motivation: To develop a better understanding of subjective room acoustics impressions for music performance and auralizations, moving beyond traditional correlation-based approaches with expert ratings.

Method: Convolved music with binaural room impulse response measurements and used Multi Dimensional Scaling (MDS) to identify perceptual dimensions of room acoustics.

Result: Identified 5 perceptual dimensions of room acoustics that can be explained by echo density, fractal correlation dimension, roughness, loudness, and early decay time.

Conclusion: The MDS approach successfully reveals 5 key perceptual dimensions in room acoustics perception, providing an alternative framework to traditional correlation-based methods.

Abstract: Subjective room acoustics impressions play an important role for the performance and reception of music in concert venues and auralizations. Therefore, room acoustics since the 20th century dealt with the relationship between objective, acoustic parameters and subjective impressions of room acoustics. One common approach is to correlate acoustic measures with experts’ subjective ratings of rooms as recalled from their long-term memory, and explain them using acoustical measures. Another approach is to let listeners rate auralized room acoustics on bipolar scales and find objective correlates. In this study, we present an alternative approach to characterizing the subjective impressions of room acoustics. We concolve music with binaural room impulse response measurements and utilize Multi Dimensional Scaling (MDS) to identify the perceptual dimensions of room acoustics. Results show that the perception of room acoustics has $5$ dimensions that can be explained by the (psycho-)acoustical measures echo density, fractal correlation dimension, roughness, loudness, and early decay time.

[300] Two-stage Audio-Visual Target Speaker Extraction System for Real-Time Processing On Edge Device

Zixuan Li, Xueliang Zhang, Lei Miao, Zhipeng Yan, Ying Sun, Chong Zhu

Main category: cs.SD

TL;DR: Proposed a two-stage ultra-compact AVTSE system using visual-based voice activity detection followed by audio processing to reduce computational complexity for edge devices.

Details

Motivation: Existing AVTSE methods encode visual and audio features simultaneously, resulting in extremely high computational complexity that makes them impractical for real-time processing on edge devices.

Method: Two-stage approach: 1) Compact network for voice activity detection using visual information, 2) VAD results combined with audio inputs to isolate target speaker’s voice.

Result: Effectively suppresses background noise and interfering voices while spending little computational resources.

Conclusion: The proposed ultra-compact AVTSE system provides efficient target speaker extraction suitable for real-time processing on edge devices with minimal computational overhead.

Abstract: Audio-Visual Target Speaker Extraction (AVTSE) aims to isolate a target speaker’s voice in a multi-speaker environment with visual cues as auxiliary. Most of the existing AVTSE methods encode visual and audio features simultaneously, resulting in extremely high computational complexity and making it impractical for real-time processing on edge devices. To tackle this issue, we proposed a two-stage ultra-compact AVTSE system. Specifically, in the first stage, a compact network is employed for voice activity detection (VAD) using visual information. In the second stage, the VAD results are combined with audio inputs to isolate the target speaker’s voice. Experiments show that the proposed system effectively suppresses background noise and interfering voices while spending little computational resources.

[301] Sound impact of simple viscoelastic damping changes due to aging and the role of the double bentside on soundboard tension in a 1755 Dulcken harpsichord

Rolf Bader, Niko Plath, Patrick Kontopidis

Main category: cs.SD

TL;DR: Study investigates wood aging effects on a 1755 Dulcken harpsichord’s sound perception using FDTD modeling, finding counterintuitive brightness changes in higher strings due to frequency-dependent damping effects.

Details

Motivation: To understand how wood aging affects sound perception in historical harpsichords, particularly examining the relationship between internal damping changes and sound brightness across different string positions.

Method: Used Finite-Difference Time Domain (FDTD) modeling with measured soundboard thickness (497 positions) and impulse responses to estimate internal damping. Simulated impulse responses at 52 string positions while varying damping parameters, and analyzed spectral centroids for brightness assessment. Also used 3D FEM modeling to study string attachment variations.

Result: Found position-dependent brightness changes: lower strings showed increased brightness as expected, but higher strings showed decreased brightness due to frequency-dependent damping filter effects. No significant soundboard tension changes were found from the unique 8’ string attachment to outer wall construction.

Conclusion: Wood aging affects sound brightness differently across string positions due to frequency-dependent damping effects. The special Dulcken construction feature (outer wall string attachment) doesn’t significantly impact soundboard tension, suggesting other reasons for this design. Future studies should incorporate viscoelasticity for more detailed analysis.

Abstract: The sound perception of wood aging is investigated on a Dulcken harpsichord of 1755 from the Museum of Applied Arts in Hamburg, Germany using a Finite-Difference Time Domain (FDTD) model of the harpsichords soundboard. The soundboard thickness was measured on the instrument at 497 positions during strings being deattached and used in the model. Impulse responses were taken on the instrument to estimate the present internal damping by calculating the T60 decay time and used as a model input. By varying the internal damping from this measured damping as a logarithmic decrement, impulse responses were simulated at 52 string positions on both, the 8’ and 4’ bridge. To estimate the changed sound brightness due to changed internal damping, spectral centroids were calculated from the simulated impulse responses. A dependency of brightness change due to aging on string position was found, where the lower strings have higher brightness, as expected, while the higher strings have decreased brightness. This counterintuitive finding is caused by the frequency-dependent filter effect of changed damping. Future studies need to incorporate viscoelasticity to differentiate this effect further. Furthermore, the attachment of the 8’ string to the outer instead of the inner wall, a characteristic feature of Dulcken harpsichords, is investigated using a 3D Finite-Element Method (FEM) model simulation of the whole instrument. No considerable changes on the soundboard tension were found compared to an attachment of the 8’ strings to the inner wall, pointing to another reason for this special construction.

[302] HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li

Main category: cs.SD

TL;DR: HQ-SVC is an efficient framework for high-quality zero-shot singing voice conversion that jointly models content and speaker features using a decoupled codec, enhances fidelity through pitch and volume modeling, and progressively refines outputs via differentiable signal processing and diffusion techniques.

Details

Motivation: Existing zero-shot SVC methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources.

Method: Extracts jointly content and speaker features using a decoupled codec, enhances fidelity through pitch and volume modeling, and progressively refines outputs via differentiable signal processing and diffusion techniques.

Result: Significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency, and achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

Conclusion: HQ-SVC provides an efficient and high-quality solution for zero-shot singing voice conversion that preserves critical acoustic information typically lost in separate modeling approaches.

Abstract: Zero-shot singing voice conversion (SVC) transforms a source singer’s timbre to an unseen target speaker’s voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

[303] End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang

Main category: cs.SD

TL;DR: CLSR is an end-to-end contrastive language-speech retriever that extracts question-relevant segments from long audio for spoken question answering, outperforming existing methods by converting acoustic features to text-like representations before alignment.

Details

Motivation: Existing SQA methods struggle with long audio, and current speech-related retrievers have poor performance despite the potential of retrieval-augmented generation approaches.

Method: Proposed CLSR - an end-to-end contrastive language-speech retriever that incorporates an intermediate step converting acoustic features into text-like representations prior to modality alignment.

Result: Experimental results across four cross-modal retrieval datasets show CLSR surpasses both end-to-end speech retrievers and pipeline approaches combining speech recognition with text retrieval.

Conclusion: CLSR provides a robust foundation for advancing practical long-form SQA applications by effectively bridging the gap between speech and text modalities.

Abstract: Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.

Bingshen Mu, Hexin Liu, Hongfei Xue, Kun Wei, Lei Xie

Main category: cs.SD

TL;DR: MARS is a multi-modal retrieval-and-selection method that enhances conversational LLM-ASR by intelligently selecting the most relevant acoustic and textual historical context, achieving superior performance with significantly less training data.

Details

Motivation: Existing conversational LLM-ASR methods use fixed context windows or entire conversation history, leading to ASR confusion and high computational costs due to irrelevant and redundant information.

Method: Multi-modal retrieval obtains candidate historical contexts with high acoustic/textual similarity to current utterance, then multi-modal selection calculates both similarities and uses a near-ideal ranking method to select the best historical context.

Result: LLM-ASR trained on only 1.5K hours of data with MARS outperforms state-of-the-art system trained on 179K hours of data on Interspeech 2025 challenge dataset.

Conclusion: MARS effectively addresses context selection challenges in conversational ASR, demonstrating that intelligent context retrieval and selection can achieve superior performance with dramatically reduced training data requirements.

Abstract: Automatic Speech Recognition (ASR) aims to convert human speech content into corresponding text. In conversational scenarios, effectively utilizing context can enhance its accuracy. Large Language Models’ (LLMs) exceptional long-context understanding and reasoning abilities enable LLM-based ASR (LLM-ASR) to leverage historical context for recognizing conversational speech, which has a high degree of contextual relevance. However, existing conversational LLM-ASR methods use a fixed number of preceding utterances or the entire conversation history as context, resulting in significant ASR confusion and computational costs due to massive irrelevant and redundant information. This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. Specifically, multi-modal retrieval obtains a set of candidate historical contexts, each exhibiting high acoustic or textual similarity to the current utterance. Multi-modal selection calculates the acoustic and textual similarities for each retrieved candidate historical context and, by employing our proposed near-ideal ranking method to consider both similarities, selects the best historical context. Evaluations on the Interspeech 2025 Multilingual Conversational Speech Language Model Challenge dataset show that the LLM-ASR, when trained on only 1.5K hours of data and equipped with the MARS, outperforms the state-of-the-art top-ranking system trained on 179K hours of data.

[305] MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc, Siti Maryam Binte Ahmad Subaidi, Siti Umairah Binte Mohammad Salleh, Sun Shuo, Tarun Kumar Vangani, Wang Qiongqiong, Won Cheng Yi Lewis, Wong Heng Meng Jeremy, Wu Jinyang, Zhang Huayun, Zhang Longyin, Zou Xunlong

Main category: cs.SD

TL;DR: MERaLiON-SER is a multilingual speech emotion recognition model that outperforms existing models by combining categorical and dimensional emotion modeling using hybrid loss functions.

Details

Motivation: To create a robust speech emotion recognition system that works across English and Southeast Asian languages, capturing both discrete emotion categories and fine-grained emotional dimensions for more comprehensive human affect understanding.

Method: Hybrid training objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modeling, enabling capture of both emotion categories (happy, angry) and dimensions (arousal, valence, dominance).

Result: Extensive evaluations show MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs across multilingual Singaporean languages (English, Chinese, Malay, Tamil) and other public benchmarks.

Conclusion: Specialized speech-only models are crucial for accurate paralinguistic understanding and cross-lingual generalization, providing a foundation for integrating emotion-aware perception into future agentic audio systems for more empathetic multimodal reasoning.

Abstract: We present MERaLiON-SER, a robust speech emotion recognition model designed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), leading to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralinguistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.

cs.LG

[306] A Lightweight CNN-Attention-BiLSTM Architecture for Multi-Class Arrhythmia Classification on Standard and Wearable ECGs

Vamsikrishna Thota, Hardik Prajapati, Yuvraj Joshi, Shubhangi Rathi

Main category: cs.LG

TL;DR: A lightweight deep learning model combining 1D CNN, attention mechanisms, and BiLSTM achieves superior arrhythmia classification accuracy with only 0.945M parameters, suitable for wearable health monitoring.

Details

Motivation: Early and accurate detection of cardiac arrhythmias is vital for timely diagnosis and intervention in healthcare.

Method: Proposed a lightweight deep learning model combining 1D Convolutional Neural Networks, attention mechanisms, and Bidirectional LSTM for classifying arrhythmias from both 12-lead and single-lead ECGs, using class-weighted loss to address class imbalance.

Result: Evaluated on CPSC 2018 dataset, the model demonstrates superior accuracy and F1-scores over baseline models with only 0.945 million parameters.

Conclusion: The lightweight model is well-suited for real-time deployment in wearable health monitoring systems due to its small parameter size and high performance.

Abstract: Early and accurate detection of cardiac arrhythmias is vital for timely diagnosis and intervention. We propose a lightweight deep learning model combining 1D Convolutional Neural Networks (CNN), attention mechanisms, and Bidirectional Long Short-Term Memory (BiLSTM) for classifying arrhythmias from both 12-lead and single-lead ECGs. Evaluated on the CPSC 2018 dataset, the model addresses class imbalance using a class-weighted loss and demonstrates superior accuracy and F1- scores over baseline models. With only 0.945 million parameters, our model is well-suited for real-time deployment in wearable health monitoring systems.

[307] Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion

Kaleem Ullah Qasim, Jiashu Zhang

Main category: cs.LG

TL;DR: CGAR introduces curriculum learning on architectural depth for recursive reasoning models, achieving 1.71x training speedup with minimal accuracy drop through Progressive Depth Curriculum and Hierarchical Supervision Weighting.

Details

Motivation: Training recursive reasoning models is computationally expensive (36 GPU-hours per dataset), limiting broader adoption and research.

Method: CGAR uses Progressive Depth Curriculum (dynamic recursion depth adjustment) and Hierarchical Supervision Weighting (exponentially decaying supervision importance) to train models more efficiently.

Result: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours, 42% cost reduction) with only 0.63% accuracy drop (86.65% to 86.02%), plus 100% halting accuracy and 11% fewer reasoning steps.

Conclusion: Principled curriculum on architectural depth enables efficient training of recursive reasoning models on modest hardware, demonstrating Pareto improvement where architectural curriculum enhances both training efficiency and solution quality.

Abstract: Recursive reasoning models achieve remarkable performance on complex reasoning tasks through iterative refinement, enabling tiny networks to match large language models thousands of times their size. However, training remains computationally expensive, prior work reporting approximately 36 GPU-hours per dataset, limiting broader adoption and research. We propose CGAR, a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering. CGAR introduces two synergistic components: Progressive Depth Curriculum dynamically adjusts recursion depth from shallow to deep configurations during training, preventing early overfitting while reducing computational cost, and Hierarchical Supervision Weighting applies exponentially decaying importance to supervision steps, aligning loss weighting with observed gradient magnitude decay. On Sudoku-Extreme with 423,168 test puzzles, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours, 42% cost reduction) with only 0.63% accuracy drop (86.65% to 86.02%). Systematic ablations reveal Progressive Depth Curriculum alone achieves 2.26x speedup with 85.47% accuracy, demonstrating a rare Pareto improvement where architectural curriculum simultaneously enhances training efficiency and solution quality. CGAR-trained models exhibit superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Our work demonstrates that principled curriculum on architectural depth enables efficient training of recursive reasoning models on modest hardware. Code and models: https://github.com/Kaleemullahqasim/CGAR and https://huggingface.co/Kaleemullah/trm-cgar-sudoku

[308] Learning the Basis: A Kolmogorov-Arnold Network Approach Embedding Green’s Function Priors

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Wenbo Wang, Wei Xiang

Main category: cs.LG

TL;DR: PhyKAN replaces static RWG basis functions with learnable, adaptive basis representations using physics-informed Kolmogorov-Arnold Networks for electromagnetic modeling.

Details

Motivation: Traditional Method of Moments (MoM) is limited by static, geometry-defined basis functions like RWG, which lack adaptability and learnability.

Method: Proposes PhyKAN - a physics-informed Kolmogorov-Arnold Network that generalizes RWG basis into learnable basis family, combining local KAN branch with global branch embedded with Green’s function priors derived from EFIE.

Result: Achieves sub-0.01 reconstruction errors across canonical geometries and accurate, unsupervised radar cross section predictions.

Conclusion: PhyKAN provides an interpretable, physics-consistent bridge between classical electromagnetic solvers and modern neural network models.

Abstract: The Method of Moments (MoM) is constrained by the usage of static, geometry-defined basis functions, such as the Rao-Wilton-Glisson (RWG) basis. This letter reframes electromagnetic modeling around a learnable basis representation rather than solving for the coefficients over a fixed basis. We first show that the RWG basis is essentially a static and piecewise-linear realization of the Kolmogorov-Arnold representation theorem. Inspired by this insight, we propose PhyKAN, a physics-informed Kolmogorov-Arnold Network (KAN) that generalizes RWG into a learnable and adaptive basis family. Derived from the EFIE, PhyKAN integrates a local KAN branch with a global branch embedded with Green’s function priors to preserve physical consistency. It is demonstrated that, across canonical geometries, PhyKAN achieves sub-0.01 reconstruction errors as well as accurate, unsupervised radar cross section predictions, offering an interpretable, physics-consistent bridge between classical solvers and modern neural network models for electromagnetic modeling.

[309] TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter

Main category: cs.LG

TL;DR: TabPFN-2.5 is the next-generation tabular foundation model that scales to 50K data points and 2K features, outperforming tuned tree models and matching AutoGluon 1.4, with a new distillation engine for production deployment.

Details

Motivation: To advance tabular AI by creating a foundation model that handles larger datasets (20x more data cells than TabPFNv2) and provides superior performance for industrial applications.

Method: Developed TabPFN-2.5 as a tabular foundation model with increased capacity, plus a distillation engine that converts the model into compact MLP or tree ensembles for production use.

Result: Achieves leading performance on TabArena benchmark, 100% win rate against XGBoost on small-medium datasets, 87% win rate on larger datasets, and enables low-latency deployment through distillation.

Conclusion: TabPFN-2.5 significantly advances tabular AI capabilities and will enhance existing applications built on the TabPFN ecosystem through improved performance and production-ready deployment options.

Abstract: The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

[310] PEGNet: A Physics-Embedded Graph Network for Long-Term Stable Multiphysics Simulation

Can Yang, Zhenzhong Wang, Junyuan Liu, Yunpeng Gong, Min Jiang

Main category: cs.LG

TL;DR: PEGNet is a Physics-Embedded Graph Network that incorporates PDE-guided message passing to improve physical consistency and stability in solving partial differential equations, outperforming existing methods on benchmarks including respiratory airflow and drug delivery.

Details

Motivation: Traditional numerical solvers for PDEs are computationally expensive, while data-driven methods suffer from error accumulation and limited physical consistency, especially in multiphysics and complex geometries.

Method: Proposes PEGNet with PDE-guided message passing that embeds key PDE dynamics (convection, viscosity, diffusion) into distinct message functions, uses hierarchical architecture for multi-scale features, and integrates physical regularization into the loss function.

Result: Significant improvements in long-term prediction accuracy and physical consistency over existing methods on benchmarks including custom datasets for respiratory airflow and drug delivery.

Conclusion: PEGNet effectively integrates physical constraints into neural network architecture, producing more stable and physically consistent solutions for PDE-based simulations.

Abstract: Accurate and efficient simulations of physical phenomena governed by partial differential equations (PDEs) are important for scientific and engineering progress. While traditional numerical solvers are powerful, they are often computationally expensive. Recently, data-driven methods have emerged as alternatives, but they frequently suffer from error accumulation and limited physical consistency, especially in multiphysics and complex geometries. To address these challenges, we propose PEGNet, a Physics-Embedded Graph Network that incorporates PDE-guided message passing to redesign the graph neural network architecture. By embedding key PDE dynamics like convection, viscosity, and diffusion into distinct message functions, the model naturally integrates physical constraints into its forward propagation, producing more stable and physically consistent solutions. Additionally, a hierarchical architecture is employed to capture multi-scale features, and physical regularization is integrated into the loss function to further enforce adherence to governing physics. We evaluated PEGNet on benchmarks, including custom datasets for respiratory airflow and drug delivery, showing significant improvements in long-term prediction accuracy and physical consistency over existing methods. Our code is available at https://github.com/Yanghuoshan/PEGNet.

[311] FAIRPLAI: A Human-in-the-Loop Approach to Fair and Private Machine Learning

David Sanchez, Holly Lopez, Michelle Buraczyk, Anantaa Kotal

Main category: cs.LG

TL;DR: FAIRPLAI is a framework that integrates human oversight into ML systems to balance accuracy, privacy, and fairness through privacy-fairness frontiers, stakeholder input, and private auditing.

Details

Motivation: Current ML systems struggle to simultaneously achieve fairness, privacy, and accountability in high-stakes applications like healthcare and hiring, with existing solutions creating trade-offs between these goals.

Method: FAIRPLAI uses three components: privacy-fairness frontiers to visualize trade-offs, interactive stakeholder input for domain-specific fairness criteria, and differentially private auditing for human review without compromising data security.

Result: On benchmark datasets, FAIRPLAI maintains strong privacy protections while reducing fairness disparities compared to automated baselines, providing an interpretable process for practitioners.

Conclusion: By embedding human judgment in critical decision points, FAIRPLAI enables the development of ML systems that are effective, responsible, and trustworthy in real-world applications.

Abstract: As machine learning systems move from theory to practice, they are increasingly tasked with decisions that affect healthcare access, financial opportunities, hiring, and public services. In these contexts, accuracy is only one piece of the puzzle - models must also be fair to different groups, protect individual privacy, and remain accountable to stakeholders. Achieving all three is difficult: differential privacy can unintentionally worsen disparities, fairness interventions often rely on sensitive data that privacy restricts, and automated pipelines ignore that fairness is ultimately a human and contextual judgment. We introduce FAIRPLAI (Fair and Private Learning with Active Human Influence), a practical framework that integrates human oversight into the design and deployment of machine learning systems. FAIRPLAI works in three ways: (1) it constructs privacy-fairness frontiers that make trade-offs between accuracy, privacy guarantees, and group outcomes transparent; (2) it enables interactive stakeholder input, allowing decision-makers to select fairness criteria and operating points that reflect their domain needs; and (3) it embeds a differentially private auditing loop, giving humans the ability to review explanations and edge cases without compromising individual data security. Applied to benchmark datasets, FAIRPLAI consistently preserves strong privacy protections while reducing fairness disparities relative to automated baselines. More importantly, it provides a straightforward, interpretable process for practitioners to manage competing demands of accuracy, privacy, and fairness in socially impactful applications. By embedding human judgment where it matters most, FAIRPLAI offers a pathway to machine learning systems that are effective, responsible, and trustworthy in practice. GitHub: https://github.com/Li1Davey/Fairplai

[312] Rethinking Graph Super-resolution: Dual Frameworks for Topological Fidelity

Pragya Singh, Islem Rekik

Main category: cs.LG

TL;DR: Two GNN-agnostic frameworks for graph super-resolution: Bi-SR for structure-aware node super-resolution and DEFEND for edge inference via dual graph mapping, achieving SOTA on brain connectome data.

Details

Motivation: Existing GNN-based graph super-resolution methods have limitations: matrix-based approaches ignore graph structure and lack permutation invariance, while edge weight inference relies on node representations, limiting scalability and expressivity.

Method: Bi-SR uses bipartite graphs connecting LR and HR nodes for structure-aware node super-resolution. DEFEND learns edge representations by mapping HR edges to nodes of a dual graph, enabling edge inference via standard node-based GNNs.

Result: Achieved state-of-the-art performance across seven topological measures on real-world brain connectome dataset. Introduced twelve new simulated datasets for comprehensive benchmarking.

Conclusion: The proposed frameworks address key limitations in graph super-resolution, providing structure-aware and permutation-invariant approaches that outperform existing methods while enabling better generalization through comprehensive benchmarking datasets.

Abstract: Graph super-resolution, the task of inferring high-resolution (HR) graphs from low-resolution (LR) counterparts, is an underexplored yet crucial research direction that circumvents the need for costly data acquisition. This makes it especially desirable for resource-constrained fields such as the medical domain. While recent GNN-based approaches show promise, they suffer from two key limitations: (1) matrix-based node super-resolution that disregards graph structure and lacks permutation invariance; and (2) reliance on node representations to infer edge weights, which limits scalability and expressivity. In this work, we propose two GNN-agnostic frameworks to address these issues. First, Bi-SR introduces a bipartite graph connecting LR and HR nodes to enable structure-aware node super-resolution that preserves topology and permutation invariance. Second, DEFEND learns edge representations by mapping HR edges to nodes of a dual graph, allowing edge inference via standard node-based GNNs. We evaluate both frameworks on a real-world brain connectome dataset, where they achieve state-of-the-art performance across seven topological measures. To support generalization, we introduce twelve new simulated datasets that capture diverse topologies and LR-HR relationships. These enable comprehensive benchmarking of graph super-resolution methods.

[313] Benevolent Dictators? On LLM Agent Behavior in Dictator Games

Andreas Einwiller, Kanishka Ghosh Dastidar, Artur Romazanov, Annette Hautli-Janisz, Michael Granitzer, Florian Lemmerich

Main category: cs.LG

TL;DR: The paper proposes the LLM-ABS framework to study LLM agent behavior in dictator games, addressing prompt sensitivity issues and providing more reliable insights into fairness preferences.

Details

Motivation: To overcome limitations in previous studies that overlook system prompt influence and lack robustness when studying complex behavioral aspects of LLMs in games like the dictator game.

Method: Proposed LLM-ABS framework with three components: exploring system prompt influence, using neutral prompt variations for reliable insights, and analyzing linguistic features in responses to understand reasoning.

Result: Found strong preference for fairness in agents, significant impact of system prompts on behavior, and different linguistic expression patterns across models.

Conclusion: Prompt sensitivity remains challenging, but the LLM-ABS framework provides a robust foundation for studying LLM agent behavior, with code available for reproducibility.

Abstract: In behavioral sciences, experiments such as the ultimatum game are conducted to assess preferences for fairness or self-interest of study participants. In the dictator game, a simplified version of the ultimatum game where only one of two players makes a single decision, the dictator unilaterally decides how to split a fixed sum of money between themselves and the other player. Although recent studies have explored behavioral patterns of AI agents based on Large Language Models (LLMs) instructed to adopt different personas, we question the robustness of these results. In particular, many of these studies overlook the role of the system prompt - the underlying instructions that shape the model’s behavior - and do not account for how sensitive results can be to slight changes in prompts. However, a robust baseline is essential when studying highly complex behavioral aspects of LLMs. To overcome previous limitations, we propose the LLM agent behavior study (LLM-ABS) framework to (i) explore how different system prompts influence model behavior, (ii) get more reliable insights into agent preferences by using neutral prompt variations, and (iii) analyze linguistic features in responses to open-ended instructions by LLM agents to better understand the reasoning behind their behavior. We found that agents often exhibit a strong preference for fairness, as well as a significant impact of the system prompt on their behavior. From a linguistic perspective, we identify that models express their responses differently. Although prompt sensitivity remains a persistent challenge, our proposed framework demonstrates a robust foundation for LLM agent behavior studies. Our code artifacts are available at https://github.com/andreaseinwiller/LLM-ABS.

[314] Macroscopic Emission Modeling of Urban Traffic Using Probe Vehicle Data: A Machine Learning Approach

Mohammed Ali El Adlouni, Ling Jin, Xiaodan Xu, C. Anna Spurlock, Alina Lazar, Kaveh Farokhi Sadabadi, Mahyar Amirgholy, Mona Asudegi

Main category: cs.LG

TL;DR: This study uses machine learning with large-scale traffic and emission data to predict network-wide emission rates in US urban areas, creating data-driven emission fundamental diagrams (eMFDs) that help understand location-specific factors affecting emissions.

Details

Motivation: Urban congestion causes inefficient vehicle movement and increases greenhouse gas emissions and air pollution. Existing eMFD models are sparse due to data limitations, creating a need for better tools to monitor and reduce network-wide emissions.

Method: Leveraging large-scale granular traffic and emission data from probe vehicles, the study applies machine learning methods to predict the relationship between network-wide emission rates and traffic variables at a large scale in US urban areas.

Result: The analysis generates data-driven eMFDs and provides deeper understanding of how emissions depend on network characteristics, infrastructure, land use, and vehicle characteristics.

Conclusion: This framework enables transportation authorities to measure carbon emissions from urban transport for given travel demand and optimize location-specific traffic management and planning decisions to mitigate network-wide emissions.

Abstract: Urban congestions cause inefficient movement of vehicles and exacerbate greenhouse gas emissions and urban air pollution. Macroscopic emission fundamental diagram (eMFD)captures an orderly relationship among emission and aggregated traffic variables at the network level, allowing for real-time monitoring of region-wide emissions and optimal allocation of travel demand to existing networks, reducing urban congestion and associated emissions. However, empirically derived eMFD models are sparse due to historical data limitation. Leveraging a large-scale and granular traffic and emission data derived from probe vehicles, this study is the first to apply machine learning methods to predict the network wide emission rate to traffic relationship in U.S. urban areas at a large scale. The analysis framework and insights developed in this work generate data-driven eMFDs and a deeper understanding of their location dependence on network, infrastructure, land use, and vehicle characteristics, enabling transportation authorities to measure carbon emissions from urban transport of given travel demand and optimize location specific traffic management and planning decisions to mitigate network-wide emissions.

[315] Gromov-Wasserstein Graph Coarsening

Carlos A. Taveras, Santiago Segarra, César A. Uribe

Main category: cs.LG

TL;DR: Two graph coarsening algorithms using Gromov-Wasserstein geometry: GPC merges node pairs minimizing local distortion, KGPC uses clustering on pairwise distortion to merge node clusters.

Details

Motivation: To develop efficient graph coarsening methods within Gromov-Wasserstein geometry that can handle large-scale graphs while minimizing structural distortion.

Method: Proposed two algorithms: Greedy Pair Coarsening (GPC) that iteratively merges node pairs minimizing local distortion, and k-means Greedy Pair Coarsening (KGPC) that uses clustering on pairwise distortion metrics to merge node clusters.

Result: The methods outperform existing approaches on six large-scale datasets and a downstream clustering task across various parameters and scenarios.

Conclusion: The proposed GPC and KGPC algorithms provide effective graph coarsening within Gromov-Wasserstein geometry with proven optimality conditions and superior performance compared to existing methods.

Abstract: We study the problem of graph coarsening within the Gromov-Wasserstein geometry. Specifically, we propose two algorithms that leverage a novel representation of the distortion induced by merging pairs of nodes. The first method, termed Greedy Pair Coarsening (GPC), iteratively merges pairs of nodes that locally minimize a measure of distortion until the desired size is achieved. The second method, termed $k$-means Greedy Pair Coarsening (KGPC), leverages clustering based on pairwise distortion metrics to directly merge clusters of nodes. We provide conditions guaranteeing optimal coarsening for our methods and validate their performance on six large-scale datasets and a downstream clustering task. Results show that the proposed methods outperform existing approaches on a wide range of parameters and scenarios.

[316] Hey Pentti, We Did (More of) It!: A Vector-Symbolic Lisp With Residue Arithmetic

Connor Hanley, Eilene Tomkins-Flanaganm, Mary Alexandria Kelly

Main category: cs.LG

TL;DR: This paper extends Vector-Symbolic Architecture with arithmetic operations using FHRRs and RHC to encode Turing-complete syntax in high-dimensional vector spaces, making neural network states more expressive and interpretable.

Details

Motivation: To increase the expressivity of neural network states by enabling them to contain arbitrarily structured representations that are inherently interpretable, and to develop more general intelligent agents.

Method: Using Frequency-domain Holographic Reduced Representations (FHRRs) to extend a Vector-Symbolic Architecture encoding of Lisp 1.5 with primitives for arithmetic operations using Residue Hyperdimensional Computing (RHC).

Result: The encoding allows neural network states to contain arbitrarily structured representations that are inherently interpretable, increasing expressivity over high-dimensional vector spaces.

Conclusion: This approach has potential applications in machine learning tasks and is important for designing neural networks sensitive to structured representations, which could lead to more general intelligent agents.

Abstract: Using Frequency-domain Holographic Reduced Representations (FHRRs), we extend a Vector-Symbolic Architecture (VSA) encoding of Lisp 1.5 with primitives for arithmetic operations using Residue Hyperdimensional Computing (RHC). Encoding a Turing-complete syntax over a high-dimensional vector space increases the expressivity of neural network states, enabling network states to contain arbitrarily structured representations that are inherently interpretable. We discuss the potential applications of the VSA encoding in machine learning tasks, as well as the importance of encoding structured representations and designing neural networks whose behavior is sensitive to the structure of their representations in virtue of attaining more general intelligent agents than exist at present.

[317] A Generalized Bias-Variance Decomposition for Bregman Divergences

David Pfau

Main category: cs.LG

TL;DR: Generalizes bias-variance decomposition to Bregman divergences for maximum likelihood with exponential families, providing a clear derivation for pedagogical purposes.

Details

Motivation: The bias-variance decomposition is fundamental but typically limited to squared error; this work extends it to broader contexts relevant to maximum likelihood estimation.

Method: Presents a standalone derivation of the bias-variance decomposition generalized to Bregman divergences, building on existing but unclear prior work.

Result: Provides a clear, pedagogical derivation of the generalized bias-variance decomposition for Bregman divergences, with proper context and references.

Conclusion: This note fills a pedagogical gap by offering a clear derivation of the bias-variance decomposition for Bregman divergences, making the result more accessible to the community.

Abstract: The bias-variance decomposition is a central result in statistics and machine learning, but is typically presented only for the squared error. We present a generalization of the bias-variance decomposition where the prediction error is a Bregman divergence, which is relevant to maximum likelihood estimation with exponential families. While the result is already known, there was not previously a clear, standalone derivation, so we provide one for pedagogical purposes. A version of this note previously appeared on the author’s personal website without context. Here we provide additional discussion and references to the relevant prior literature.

[318] Enabling Agents to Communicate Entirely in Latent Space

Zhuoyun Du, Runze Wang, Huiyu Bai, Zouying Cao, Xiaoyong Zhu, Bo Zheng, Wei Chen, Haochao Ying

Main category: cs.LG

TL;DR: Interlat enables direct transmission of LLM hidden states between agents for more nuanced communication than natural language, improving collaborative problem-solving through latent space reasoning and compression.

Details

Motivation: Natural language communication between LLM agents inherently limits information transmission by downsampling rich internal states into discrete tokens, hindering collaborative problem-solving depth and nuance.

Method: Proposes Interlat paradigm using LLM’s last hidden states as mind representation for direct transmission (latent communication), with additional compression via latent space reasoning.

Result: Interlat outperforms fine-tuned chain-of-thought prompting and single-agent baselines, promotes exploratory behavior, enables genuine latent information utilization, and compression accelerates inference while maintaining competitive performance.

Conclusion: Interlat demonstrates feasibility of entirely latent space inter-agent communication, showing significant potential for future research in agent collaboration beyond natural language constraints.

Abstract: While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by human mind-reading, we propose Interlat (Inter-agent Latent Space Communication), a paradigm that leverages the last hidden states of an LLM as a representation of its mind for direct transmission (termed latent communication). An additional compression process further compresses latent communication via entirely latent space reasoning. Experiments demonstrate that Interlat outperforms both fine-tuned chain-of-thought (CoT) prompting and single-agent baselines, promoting more exploratory behavior and enabling genuine utilization of latent information. Further compression not only substantially accelerates inference but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research.

[319] BayesQ: Uncertainty-Guided Bayesian Quantization

Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

Main category: cs.LG

TL;DR: BayesQ is an uncertainty-guided post-training quantization framework that optimizes quantization under posterior expected loss, improving model performance at low bit rates compared to strong baselines.

Details

Motivation: To address the challenge of low-bit quantization by reframing it as uncertainty-aware risk minimization, leveraging Bayesian principles to guide the quantization process more effectively.

Method: Fits lightweight Gaussian posterior over weights, whitens by posterior covariance, designs codebooks to minimize posterior-expected distortion, allocates mixed precision via greedy knapsack algorithm, and uses optional calibration-only distillation.

Result: At 3.0/3.5/4.0 bits per weight, BayesQ outperforms GPTQ by +1.5/+0.7/+0.3 top-1 percentage points on ResNet-50 and +1.1/+0.4/+0.2 GLUE points on BERT-base, with comparable preprocessing requirements.

Conclusion: BayesQ successfully reframes low-bit quantization as uncertainty-aware risk minimization, providing a practical post-training pipeline that significantly improves quantization performance while maintaining efficiency.

Abstract: We present BayesQ, an uncertainty-guided post-training quantization framework that is the first to optimize quantization under the posterior expected loss. BayesQ fits a lightweight Gaussian posterior over weights (diagonal Laplace by default; optional K-FAC/low-rank), whitens by the posterior covariance, designs codebooks to minimize posterior-expected distortion, and allocates mixed precision via a greedy knapsack that maximizes marginal expected-loss reduction per bit under a global budget. For scalar quantizers, posterior-expected MSE yields closed-form tables; task-aware proxies are handled by short Monte Carlo on a small calibration set. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher. At matched average bits/weight of 3.0/3.5/4.0, BayesQ improves over strong PTQ baselines on ResNet-50 (ImageNet) and BERT-base (GLUE) e.g., vs. GPTQ by $+1.5/+0.7/+0.3$ top-1 percentage points on RN50 and $+1.1/+0.4/+0.2$ GLUE points on BERT, while requiring one-time preprocessing comparable to a GPTQ pass. BayesQ reframes low-bit quantization as uncertainty-aware risk minimization in a practical, post-training pipeline.

[320] Physics-Informed Machine Learning for Characterizing System Stability

Tomoki Koike, Elizabeth Qian

Main category: cs.LG

TL;DR: A physics-informed machine learning method called LyapInf that infers Lyapunov functions from trajectory data to estimate stability regions without requiring explicit system equations.

Details

Motivation: Traditional stability analysis methods require explicit knowledge of system governing equations, which is often unavailable for complex practical systems like aerospace applications.

Method: Proposes a quadratic Lyapunov function form and fits it to trajectory data by minimizing the residual of the Zubov equation, treating the system as a black box.

Result: Successfully characterizes near-maximal ellipsoidal stability regions on benchmark examples without knowledge of system governing equations.

Conclusion: LyapInf provides an effective data-driven approach for stability region estimation that bypasses the need for explicit system equations.

Abstract: In the design and operation of complex dynamical systems, it is essential to ensure that all state trajectories of the dynamical system converge to a desired equilibrium within a guaranteed stability region. Yet, for many practical systems – especially in aerospace – this region cannot be determined a priori and is often challenging to compute. One of the most common methods for computing the stability region is to identify a Lyapunov function. A Lyapunov function is a positive function whose time derivative along system trajectories is non-positive, which provides a sufficient condition for stability and characterizes an estimated stability region. However, existing methods of characterizing a stability region via a Lyapunov function often rely on explicit knowledge of the system governing equations. In this work, we present a new physics-informed machine learning method of characterizing an estimated stability region by inferring a Lyapunov function from system trajectory data that treats the dynamical system as a black box and does not require explicit knowledge of the system governing equations. In our presented Lyapunov function Inference method (LyapInf), we propose a quadratic form for the unknown Lyapunov function and fit the unknown quadratic operator to system trajectory data by minimizing the average residual of the Zubov equation, a first-order partial differential equation whose solution yields a Lyapunov function. The inferred quadratic Lyapunov function can then characterize an ellipsoidal estimate of the stability region. Numerical results on benchmark examples demonstrate that our physics-informed stability analysis method successfully characterizes a near-maximal ellipsoid of the system stability region associated with the inferred Lyapunov function without requiring knowledge of the system governing equations.

[321] TIGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations

Nikunj Gupta, Ludwika Twardecka, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: TIGER enhances multi-agent reinforcement learning by modeling evolving inter-agent coordination structures through dynamic temporal graphs and temporal attention-based encoding.

Details

Motivation: Most MARL approaches use static or per-step relational graphs, overlooking the temporal evolution of interactions that naturally occur as agents adapt and reorganize cooperation strategies.

Method: Constructs dynamic temporal graphs connecting current and historical interactions, then uses temporal attention-based encoder to aggregate information across structural and temporal neighborhoods.

Result: Consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency on coordination-intensive benchmarks.

Conclusion: TIGER demonstrates that capturing both structural and temporal factors through dynamic temporal graphs and temporal attention encoding jointly shapes effective policy learning in MARL.

Abstract: In this paper, we propose capturing and utilizing \textit{Temporal Information through Graph-based Embeddings and Representations} or \textbf{TIGER} to enhance multi-agent reinforcement learning (MARL). We explicitly model how inter-agent coordination structures evolve over time. While most MARL approaches rely on static or per-step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention-based encoder to aggregate information across these structural and temporal neighborhoods, yielding time-aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination-intensive benchmarks, we show that TIGER consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: https://github.com/Nikunj-Gupta/tiger-marl.

[322] Enhancing DPSGD via Per-Sample Momentum and Low-Pass Filtering

Xincheng Xu, Thilina Ranbaduge, Qing Wang, Thierry Rakotoarivelo, David Smith

Main category: cs.LG

TL;DR: DP-PMLF integrates per-sample momentum with low-pass filtering to simultaneously reduce DP noise and clipping bias in differentially private SGD, improving privacy-utility trade-off without extra privacy cost.

Details

Motivation: Existing DPSGD methods degrade model accuracy by introducing DP noise and clipping bias, but current techniques only address one issue at a time, creating a trade-off problem.

Method: Proposes DP-PMLF with per-sample momentum to smooth gradients before clipping (reducing variance) and post-processing low-pass filter to attenuate high-frequency DP noise without additional privacy budget.

Result: Theoretical analysis shows improved convergence rate under DP guarantees, and empirical evaluations demonstrate significant enhancement in privacy-utility trade-off compared to state-of-the-art DPSGD variants.

Conclusion: DP-PMLF effectively mitigates both DP noise and clipping bias simultaneously, providing better model accuracy while maintaining rigorous differential privacy guarantees.

Abstract: Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to train deep neural networks with formal privacy guarantees. However, the addition of differential privacy (DP) often degrades model accuracy by introducing both noise and bias. Existing techniques typically address only one of these issues, as reducing DP noise can exacerbate clipping bias and vice-versa. In this paper, we propose a novel method, \emph{DP-PMLF}, which integrates per-sample momentum with a low-pass filtering strategy to simultaneously mitigate DP noise and clipping bias. Our approach uses per-sample momentum to smooth gradient estimates prior to clipping, thereby reducing sampling variance. It further employs a post-processing low-pass filter to attenuate high-frequency DP noise without consuming additional privacy budget. We provide a theoretical analysis demonstrating an improved convergence rate under rigorous DP guarantees, and our empirical evaluations reveal that DP-PMLF significantly enhances the privacy-utility trade-off compared to several state-of-the-art DPSGD variants.

[323] On topological descriptors for graph products

Mattie Ji, Amauri H. Souza, Vikas Garg

Main category: cs.LG

TL;DR: This paper analyzes topological descriptors (Euler characteristic and persistent homology) on graph products, showing that persistent homology captures more information than individual graphs, while Euler characteristic does not. It provides algorithms for computing these descriptors and validates findings empirically.

Details

Motivation: To enhance the expressive power of topological descriptors for relational data by exploring graph product filtrations, addressing limitations of existing methods in capturing multiscale structural information.

Method: Theoretical analysis of various filtrations on graph products, characterization of Euler characteristic’s expressive power, development of algorithms for computing persistent homology diagrams on vertex- and edge-level filtrations of graph products.

Result: Persistent homology descriptors on graph products contain strictly more information than individual graphs, while Euler characteristic does not. Empirical validation shows improved expressivity and graph classification performance.

Conclusion: Graph product filtrations enable more powerful persistent descriptors for graph analysis, with persistent homology providing enhanced discriminative capabilities compared to traditional approaches.

Abstract: Topological descriptors have been increasingly utilized for capturing multiscale structural information in relational data. In this work, we consider various filtrations on the (box) product of graphs and the effect on their outputs on the topological descriptors - the Euler characteristic (EC) and persistent homology (PH). In particular, we establish a complete characterization of the expressive power of EC on general color-based filtrations. We also show that the PH descriptors of (virtual) graph products contain strictly more information than the computation on individual graphs, whereas EC does not. Additionally, we provide algorithms to compute the PH diagrams of the product of vertex- and edge-level filtrations on the graph product. We also substantiate our theoretical analysis with empirical investigations on runtime analysis, expressivity, and graph classification performance. Overall, this work paves way for powerful graph persistent descriptors via product filtrations. Code is available at https://github.com/Aalto-QuML/tda_graph_product.

[324] Decomposition of Small Transformer Models

Casper L. Christensen, Logan Riggs

Main category: cs.LG

TL;DR: Extends Stochastic Parameter Decomposition (SPD) to Transformer models with updated causal importance function and loss function, demonstrating successful decomposition of toy induction-head model and identification of interpretable concepts in GPT-2-small.

Details

Motivation: Bridge the gap between mechanistic interpretability methods tested on toy models and their application to real-world models like Transformers, enabling parameter-space analysis and intervention.

Method: Extended Stochastic Parameter Decomposition (SPD) with updated causal importance function for sequential data and new loss function, applied to Transformer models including toy induction-head model and GPT-2-small.

Result: Successfully decomposed toy induction-head model to recover expected 2-step circuit, and identified subcomponents corresponding to interpretable concepts like ‘golf’ and ‘basketball’ in GPT-2-small.

Conclusion: First step in extending SPD to modern models, showing the method can surface interpretable parameter-space mechanisms in real-world neural networks.

Abstract: Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to “real models” has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like “golf” and “basketball”. These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface interpretable parameter-space mechanisms.

[325] ForeSWE: Forecasting Snow-Water Equivalent with an Uncertainty-Aware Attention Model

Krishu K Thapa, Supriya Savalkar, Bhupinderjeet Singh, Trong Nghia Hoang, Kirti Rajagopalan, Ananth Kalyanaraman

Main category: cs.LG

TL;DR: ForeSWE: A probabilistic spatio-temporal forecasting model for Snow-Water Equivalent (SWE) that combines deep learning with Gaussian processes to improve accuracy and provide uncertainty estimates.

Details

Motivation: SWE forecasting is challenging due to spatio-temporal variability influenced by topography and environmental factors. Classical approaches fail to utilize spatial/temporal correlations and lack uncertainty estimates needed for water management decisions.

Method: Integrates deep learning with probabilistic techniques using attention mechanisms for spatiotemporal features and Gaussian processes for uncertainty quantification. Evaluated on 512 SNOTEL stations in Western US.

Result: Significant improvements in forecasting accuracy and prediction intervals compared to existing approaches. Effective uncertainty estimates demonstrated between different methods.

Conclusion: Provides a deployable platform for water management community with improved SWE forecasting and reliable uncertainty quantification.

Abstract: Various complex water management decisions are made in snow-dominant watersheds with the knowledge of Snow-Water Equivalent (SWE) – a key measure widely used to estimate the water content of a snowpack. However, forecasting SWE is challenging because SWE is influenced by various factors including topography and an array of environmental conditions, and has therefore been observed to be spatio-temporally variable. Classical approaches to SWE forecasting have not adequately utilized these spatial/temporal correlations, nor do they provide uncertainty estimates – which can be of significant value to the decision maker. In this paper, we present ForeSWE, a new probabilistic spatio-temporal forecasting model that integrates deep learning and classical probabilistic techniques. The resulting model features a combination of an attention mechanism to integrate spatiotemporal features and interactions, alongside a Gaussian process module that provides principled quantification of prediction uncertainty. We evaluate the model on data from 512 Snow Telemetry (SNOTEL) stations in the Western US. The results show significant improvements in both forecasting accuracy and prediction interval compared to existing approaches. The results also serve to highlight the efficacy in uncertainty estimates between different approaches. Collectively, these findings have provided a platform for deployment and feedback by the water management community.

[326] EEG-X: Device-Agnostic and Noise-Robust Foundation Model for EEG

Navid Mohammadi Foumani, Soheila Ghane, Nam Nguyen, Mahsa Salehi, Geoffrey I. Webb, Geoffrey Mackellar

Main category: cs.LG

TL;DR: EEG-X is a novel foundation model for EEG analysis that addresses device variability and noise issues through location-based channel embeddings and noise-aware training strategies, achieving state-of-the-art performance across diverse EEG tasks.

Details

Motivation: Current EEG foundation models face challenges with dataset variability from different recording devices and low signal-to-noise ratio where brain signals are buried under artifacts.

Method: Uses location-based channel embeddings for device-agnostic processing, noise-aware masking/reconstruction with denoised signals, and dictionary-inspired convolutional transformation (DiCT) layers for structured feature learning.

Result: Outperforms state-of-the-art methods across multiple EEG tasks and excels in cross-domain settings with different electrode layouts.

Conclusion: EEG-X provides an effective foundation model for EEG analysis that handles device variability and noise robustness, enabling better generalization across domains and tasks.

Abstract: Foundation models for EEG analysis are still in their infancy, limited by two key challenges: (1) variability across datasets caused by differences in recording devices and configurations, and (2) the low signal-to-noise ratio (SNR) of EEG, where brain signals are often buried under artifacts and non-brain sources. To address these challenges, we present EEG-X, a device-agnostic and noise-robust foundation model for EEG representation learning. EEG-X introduces a novel location-based channel embedding that encodes spatial information and improves generalization across domains and tasks by allowing the model to handle varying channel numbers, combinations, and recording lengths. To enhance robustness against noise, EEG-X employs a noise-aware masking and reconstruction strategy in both raw and latent spaces. Unlike previous models that mask and reconstruct raw noisy EEG signals, EEG-X is trained to reconstruct denoised signals obtained through an artifact removal process, ensuring that the learned representations focus on neural activity rather than noise. To further enhance reconstruction-based pretraining, EEG-X introduces a dictionary-inspired convolutional transformation (DiCT) layer that projects signals into a structured feature space before computing reconstruction (MSE) loss, reducing noise sensitivity and capturing frequency- and shape-aware similarities. Experiments on datasets collected from diverse devices show that EEG-X outperforms state-of-the-art methods across multiple downstream EEG tasks and excels in cross-domain settings where pre-trained and downstream datasets differ in electrode layouts. The models and code are available at: https://github.com/Emotiv/EEG-X

[327] Transformer-Based Sleep Stage Classification Enhanced by Clinical Information

Woosuk Chung, Seokwoo Hong, Wonhyeok Lee, Sangyoon Bae

Main category: cs.LG

TL;DR: A two-stage deep learning model for automated sleep staging that incorporates clinical metadata and expert event annotations to improve accuracy and align with human expert practices.

Details

Motivation: Manual sleep staging is labor-intensive and variable between experts, while current automated models ignore contextual cues that human experts use, such as clinical information and event annotations.

Method: Two-stage architecture combining Transformer-based per-epoch encoder with 1D CNN aggregator, systematically incorporating subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing).

Result: Substantial improvement over PSG-only baseline: macro-F1 increased from 0.7745 to 0.8031, micro-F1 from 0.8774 to 0.9051. Event annotations provided the largest performance gains, and feature fusion outperformed multi-task alternatives.

Conclusion: Augmenting learned representations with clinically meaningful features enhances both performance and interpretability without requiring additional sensors, supporting a practical path toward context-aware, expert-aligned sleep staging systems.

Abstract: Manual sleep staging from polysomnography (PSG) is labor-intensive and prone to inter-scorer variability. While recent deep learning models have advanced automated staging, most rely solely on raw PSG signals and neglect contextual cues used by human experts. We propose a two-stage architecture that combines a Transformer-based per-epoch encoder with a 1D CNN aggregator, and systematically investigates the effect of incorporating explicit context: subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing). Using the Sleep Heart Health Study (SHHS) cohort (n=8,357), we demonstrate that contextual fusion substantially improves staging accuracy. Compared to a PSG-only baseline (macro-F1 0.7745, micro-F1 0.8774), our final model achieves macro-F1 0.8031 and micro-F1 0.9051, with event annotations contributing the largest gains. Notably, feature fusion outperforms multi-task alternatives that predict the same auxiliary labels. These results highlight that augmenting learned representations with clinically meaningful features enhances both performance and interpretability, without modifying the PSG montage or requiring additional sensors. Our findings support a practical and scalable path toward context-aware, expert-aligned sleep staging systems.

[328] Covariance Scattering Transforms

Andrea Cavallo, Ayushman Raghuvanshi, Sundeep Prabhakar Chepuri, Elvin Isufi

Main category: cs.LG

TL;DR: Covariance Scattering Transforms (CSTs) are proposed as deep untrained networks that process data through covariance spectrum filters, providing stable and expressive representations without training, outperforming PCA in low-data settings.

Details

Motivation: Traditional covariance-based methods like PCA fail to capture low-variance information and are unstable with close eigenvalues. While VNNs improve stability, they require training and labeled data. CSTs aim to combine the benefits of both approaches.

Method: CSTs use deep untrained networks with covariance wavelets as filters localized in the covariance spectrum. They apply these filters sequentially to input data with nonlinearities, and include pruning for efficiency. The method is proven to be less sensitive to estimation errors from finite samples.

Result: Experiments on age prediction from cortical thickness in neurodegenerative disease datasets show CSTs produce stable representations in low-data settings, comparable to VNNs but without training, and achieve comparable or better predictions than more complex learning models.

Conclusion: CSTs provide a training-free alternative that combines the stability of VNNs with the unsupervised nature of PCA, offering improved performance in low-sample regimes and handling of close covariance eigenvalues.

Abstract: Machine learning and data processing techniques relying on covariance information are widespread as they identify meaningful patterns in unsupervised and unlabeled settings. As a prominent example, Principal Component Analysis (PCA) projects data points onto the eigenvectors of their covariance matrix, capturing the directions of maximum variance. This mapping, however, falls short in two directions: it fails to capture information in low-variance directions, relevant when, e.g., the data contains high-variance noise; and it provides unstable results in low-sample regimes, especially when covariance eigenvalues are close. CoVariance Neural Networks (VNNs), i.e., graph neural networks using the covariance matrix as a graph, show improved stability to estimation errors and learn more expressive functions in the covariance spectrum than PCA, but require training and operate in a labeled setup. To get the benefits of both worlds, we propose Covariance Scattering Transforms (CSTs), deep untrained networks that sequentially apply filters localized in the covariance spectrum to the input data and produce expressive hierarchical representations via nonlinearities. We define the filters as covariance wavelets that capture specific and detailed covariance spectral patterns. We improve CSTs’ computational and memory efficiency via a pruning mechanism, and we prove that their error due to finite-sample covariance estimations is less sensitive to close covariance eigenvalues compared to PCA, improving their stability. Our experiments on age prediction from cortical thickness measurements on 4 datasets collecting patients with neurodegenerative diseases show that CSTs produce stable representations in low-data settings, as VNNs but without any training, and lead to comparable or better predictions w.r.t. more complex learning models.

[329] Spectral Predictability as a Fast Reliability Indicator for Time Series Forecasting Model Selection

Oliver Wang, Pengrui Quan, Kang Yang, Mani Srivastava

Main category: cs.LG

TL;DR: Spectral predictability (Ω) is a fast metric that stratifies time series model performance, enabling efficient model selection by identifying when large foundation models outperform simpler alternatives.

Details

Motivation: Practitioners face computational challenges in validating multiple time series models, risking poor performance with wrong model choices. A simple metric is needed to guide efficient model selection.

Method: Used spectral predictability (Ω) - a signal processing metric - to analyze model performance across controlled experiments in four domains and expanded to 51 models and 28 datasets from GIFT-Eval benchmark.

Result: Large time series foundation models (TSFMs) systematically outperform lightweight baselines when Ω is high, but their advantage disappears as Ω drops. Computing Ω takes seconds per dataset.

Conclusion: Ω provides a practical first-pass filter that reduces validation costs while highlighting the need for models that excel on genuinely difficult (low-Ω) problems rather than optimizing easy ones.

Abstract: Practitioners deploying time series forecasting models face a dilemma: exhaustively validating dozens of models is computationally prohibitive, yet choosing the wrong model risks poor performance. We show that spectral predictability~$Ω$ – a simple signal processing metric – systematically stratifies model family performance, enabling fast model selection. We conduct controlled experiments in four different domains, then further expand our analysis to 51 models and 28 datasets from the GIFT-Eval benchmark. We find that large time series foundation models (TSFMs) systematically outperform lightweight task-trained baselines when $Ω$ is high, while their advantage vanishes as $Ω$ drops. Computing $Ω$ takes seconds per dataset, enabling practitioners to quickly assess whether their data suits TSFM approaches or whether simpler, cheaper models suffice. We demonstrate that $Ω$ stratifies model performance predictably, offering a practical first-pass filter that reduces validation costs while highlighting the need for models that excel on genuinely difficult (low-$Ω$) problems rather than merely optimizing easy ones.

[330] FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

Tianming Sha, Zechuan Chen, Zhan Cheng, Haotian Zhai, Xuwei Ding, Junnan Li, Haixiang Tang, Zaoting Sun, Yanchuan Tang, Yongzhe Yi, Yanjie Huang, Anhao Li, Yuan Gao, Keze Wang

Main category: cs.LG

TL;DR: FAST-CAD is a fair stroke diagnosis framework combining domain-adversarial training and group distributionally robust optimization to address demographic fairness issues in automated medical diagnosis.

Details

Motivation: Existing automated stroke diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. The need for timely and equitable stroke diagnosis motivates the development of fair AI systems.

Method: Combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO). Uses self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk across 12 demographic subgroups defined by age, gender, and posture.

Result: Extensive experiments show superior diagnostic performance while maintaining fairness across all demographic subgroups. The unified DAT + Group-DRO framework achieves robust performance across diverse patient populations.

Conclusion: FAST-CAD provides both practical advances and theoretical insights for fair medical AI systems, with convergence guarantees and fairness bounds supported by theoretical analysis. The framework effectively addresses demographic fairness in stroke diagnosis.

Abstract: Stroke is an acute cerebrovascular disease, and timely diagnosis significantly improves patient survival. However, existing automated diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. In this work we propose FAST-CAD, a theoretically grounded framework that combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO) for fair and accurate non-contact stroke diagnosis. Our approach is built on domain adaptation and minimax fairness theory and provides convergence guarantees and fairness bounds. We curate a multimodal dataset covering 12 demographic subgroups defined by age, gender, and posture. FAST-CAD employs self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk to ensure robust performance across all subgroups. Extensive experiments show that our method achieves superior diagnostic performance while maintaining fairness across demographic groups, and our theoretical analysis supports the effectiveness of the unified DAT + Group-DRO framework. This work provides both practical advances and theoretical insights for fair medical AI systems.

[331] Weaver: Kronecker Product Approximations of Spatiotemporal Attention for Traffic Network Forecasting

Christopher Cheong, Gary Davis, Seongjin Choi

Main category: cs.LG

TL;DR: Weaver is an efficient spatiotemporal forecasting model for transportation networks that uses Kronecker product approximations to decompose complex attention mechanisms, achieving competitive performance with lower computational overhead.

Details

Motivation: Existing Transformer-based models for traffic forecasting suffer from high computational complexity and poor interpretability, while transportation networks require efficient, robust, and interpretable forecasting models.

Method: Uses Kronecker product approximations to decompose spatiotemporal attention into separate temporal and spatial components, introduces Valence Attention with Tanimoto coefficient for negative edge modeling, and employs Traffic Phase Dictionary for self-conditioning.

Result: Achieves competitive performance on PEMS-BAY and METR-LA datasets while training more efficiently than existing approaches.

Conclusion: Weaver provides an efficient and effective solution for spatiotemporal forecasting on transportation networks with improved computational efficiency and modeling capabilities.

Abstract: Spatiotemporal forecasting on transportation networks is a complex task that requires understanding how traffic nodes interact within a dynamic, evolving system dictated by traffic flow dynamics and social behavioral patterns. The importance of transportation networks and ITS for modern mobility and commerce necessitates forecasting models that are not only accurate but also interpretable, efficient, and robust under structural or temporal perturbations. Recent approaches, particularly Transformer-based architectures, have improved predictive performance but often at the cost of high computational overhead and diminished architectural interpretability. In this work, we introduce Weaver, a novel attention-based model that applies Kronecker product approximations (KPA) to decompose the PN X PN spatiotemporal attention of O(P^2N^2) complexity into local P X P temporal and N X N spatial attention maps. This Kronecker attention map enables our Parallel-Kronecker Matrix-Vector product (P2-KMV) for efficient spatiotemporal message passing with O(P^2N + N^2P) complexity. To capture real-world traffic dynamics, we address the importance of negative edges in modeling traffic behavior by introducing Valence Attention using the continuous Tanimoto coefficient (CTC), which provides properties conducive to precise latent graph generation and training stability. To fully utilize the model’s learning capacity, we introduce the Traffic Phase Dictionary for self-conditioning. Evaluations on PEMS-BAY and METR-LA show that Weaver achieves competitive performance across model categories while training more efficiently.

[332] DeepDR: an integrated deep-learning model web server for drug repositioning

Shuting Jin, Yi Jiang, Yimin Liu, Tengfei Ma, Dongsheng Cao, Leyi Wei, Xiangrong Liu, Xiangxiang Zeng

Main category: cs.LG

TL;DR: DeepDR is an integrated platform that uses deep learning models for drug repositioning, combining multiple networks and a comprehensive knowledge graph to recommend candidate drugs with interpretability.

Details

Motivation: Drug repositioning is complex and time-consuming, requiring deep domain knowledge. Existing DL methods show promise but are difficult to implement without programming expertise.

Method: Leverages 15+ networks and a knowledge graph with 5.9M edges across 107 relationship types from 6 databases and 24M PubMed publications, using established DL models for disease- and target-specific repositioning.

Result: Provides systematic drug recommendations with detailed descriptions and visual interpretability through knowledge graphs.

Conclusion: DeepDR offers a free, easy-to-use automated platform for both experimental and computational scientists, requiring no registration.

Abstract: Background: Identifying new indications for approved drugs is a complex and time-consuming process that requires extensive knowledge of pharmacology, clinical data, and advanced computational methods. Recently, deep learning (DL) methods have shown their capability for the accurate prediction of drug repositioning. However, implementing DL-based modeling requires in-depth domain knowledge and proficient programming skills. Results: In this application, we introduce DeepDR, the first integrated platform that combines a variety of established DL-based models for disease- and target-specific drug repositioning tasks. DeepDR leverages invaluable experience to recommend candidate drugs, which covers more than 15 networks and a comprehensive knowledge graph that includes 5.9 million edges across 107 types of relationships connecting drugs, diseases, proteins/genes, pathways, and expression from six existing databases and a large scientific corpus of 24 million PubMed publications. Additionally, the recommended results include detailed descriptions of the recommended drugs and visualize key patterns with interpretability through a knowledge graph. Conclusion: DeepDR is free and open to all users without the requirement of registration. We believe it can provide an easy-to-use, systematic, highly accurate, and computationally automated platform for both experimental and computational scientists.

[333] Diffusion Policies with Value-Conditional Optimization for Offline Reinforcement Learning

Yunchang Ma, Tenglong Liu, Yixing Lan, Xin Yin, Changxin Zhang, Xinglong Zhang, Xin Xu

Main category: cs.LG

TL;DR: DIVO is a diffusion-based offline RL method that uses value-conditional optimization to balance conservatism and exploration, achieving state-of-the-art performance on D4RL benchmarks.

Details

Motivation: Address value overestimation from OOD actions in offline RL and overcome excessive conservatism in existing diffusion methods that apply indiscriminate regularization to redundant actions.

Method: Introduces DIVO with binary-weighted mechanism using advantage values to guide diffusion training, enabling precise dataset alignment while selectively expanding high-advantage action boundaries. Dynamically filters high-return-potential actions during policy improvement.

Result: Achieves superior performance on D4RL benchmark, with significant improvements in average returns across locomotion tasks and outperforms existing methods in challenging AntMaze domain with sparse rewards.

Conclusion: DIVO effectively balances conservatism and explorability in offline RL through value-conditional optimization of diffusion policies, demonstrating strong performance across diverse tasks.

Abstract: In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset’s distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.

[334] TransactionGPT

Yingtong Dou, Zhimeng Jiang, Tianyi Zhang, Mingzhi Hu, Zhichao Xu, Shubham Jain, Uday Singh Saini, Xiran Fan, Jiarui Sun, Menghai Pan, Junpeng Wang, Xin Dai, Liang Wang, Chin-Chia Michael Yeh, Yujie Fan, Vineeth Rakesh, Huiyuan Chen, Mangesh Bendre, Zhongfang Zhuang, Xiaoting Li, Prince Aboagye, Vivian Lai, Minghua Xu, Hao Yang, Yiwei Cai, Mahashweta Das, Yuzhong Chen

Main category: cs.LG

TL;DR: TransactionGPT (TGPT) is a foundation model for consumer transaction data that uses a novel 3D-Transformer architecture to understand and generate transaction trajectories while supporting various downstream tasks.

Details

Motivation: To create a specialized foundation model that can effectively handle the complex dynamics of payment transaction data, which existing models may not capture optimally.

Method: Developed a novel 3D-Transformer architecture with innovations for modality fusion and computational efficiency, trained on billion-scale real-world transactions, and incorporates LLM-derived embeddings.

Result: TGPT significantly improves downstream classification performance over production models, shows advantages in generating future transactions, and achieves superior predictive accuracy with faster training/inference compared to fine-tuned LLMs.

Conclusion: The architectural innovations and practical guidelines from TGPT advance foundation models for transaction-like data and will catalyze future research in this emerging field.

Abstract: We present TransactionGPT (TGPT), a foundation model for consumer transaction data within one of world’s largest payment networks. TGPT is designed to understand and generate transaction trajectories while simultaneously supporting a variety of downstream prediction and classification tasks. We introduce a novel 3D-Transformer architecture specifically tailored for capturing the complex dynamics in payment transaction data. This architecture incorporates design innovations that enhance modality fusion and computational efficiency, while seamlessly enabling joint optimization with downstream objectives. Trained on billion-scale real-world transactions, TGPT significantly improves downstream classification performance against a competitive production model and exhibits advantages over baselines in generating future transactions. We conduct extensive empirical evaluations utilizing a diverse collection of company transaction datasets spanning multiple downstream tasks, thereby enabling a thorough assessment of TGPT’s effectiveness and efficiency in comparison to established methodologies. Furthermore, we examine the incorporation of LLM-derived embeddings within TGPT and benchmark its performance against fine-tuned LLMs, demonstrating that TGPT achieves superior predictive accuracy as well as faster training and inference. We anticipate that the architectural innovations and practical guidelines from this work will advance foundation models for transaction-like data and catalyze future research in this emerging field.

[335] QIBONN: A Quantum-Inspired Bilevel Optimizer for Neural Networks on Tabular Classification

Pedro Chumpitaz-Flores, My Duong, Ying Mao, Kaixun Hua

Main category: cs.LG

TL;DR: QIBONN is a quantum-inspired bilevel optimizer for neural network hyperparameter optimization on tabular data, using qubit-based representation and balancing exploration-exploitation under noise.

Details

Motivation: Hyperparameter optimization for neural networks on tabular data is challenging due to large search spaces and high tuning costs, requiring efficient methods that can handle these constraints.

Method: Bilevel framework with unified qubit-based representation combining deterministic quantum-inspired rotations and stochastic qubit mutations guided by global attractor, tested under simulated quantum noise.

Result: Competitive performance with established classical and quantum-inspired methods across 13 real-world datasets under the same tuning budget, demonstrating robustness to noise.

Conclusion: QIBONN provides an effective quantum-inspired approach for neural network HPO that handles complex search spaces efficiently while maintaining performance comparable to state-of-the-art methods.

Abstract: Hyperparameter optimization (HPO) for neural networks on tabular data is critical to a wide range of applications, yet it remains challenging due to large, non-convex search spaces and the cost of exhaustive tuning. We introduce the Quantum-Inspired Bilevel Optimizer for Neural Networks (QIBONN), a bilevel framework that encodes feature selection, architectural hyperparameters, and regularization in a unified qubit-based representation. By combining deterministic quantum-inspired rotations with stochastic qubit mutations guided by a global attractor, QIBONN balances exploration and exploitation under a fixed evaluation budget. We conduct systematic experiments under single-qubit bit-flip noise (0.1%–1%) emulated by an IBM-Q backend. Results on 13 real-world datasets indicate that QIBONN is competitive with established methods, including classical tree-based methods and both classical/quantum-inspired HPO algorithms under the same tuning budget.

[336] Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Kazuki Iwahana, Yusuke Yamasaki, Akira Ito, Takayuki Miura, Toshiki Shibahara

Main category: cs.LG

TL;DR: A novel backdoor removal method that accurately reconstructs Trigger-Activated Changes (TAC) values through convex quadratic optimization to identify and remove backdoor neurons, achieving superior performance across multiple datasets and attack types.

Details

Motivation: Existing backdoor defense methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values, which are the activation differences between clean and poisoned data.

Method: Formulate minimal perturbation that forces clean data to be classified into specific classes as a convex quadratic optimization problem, use optimal solution as surrogate for TAC, identify poisoned class via small L2 norms of perturbations, and leverage perturbation in fine-tuning to remove backdoors.

Result: Experiments on CIFAR-10, GTSRB, and TinyImageNet show consistent superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

Conclusion: The proposed method effectively addresses the limitations of existing TAC-based defenses by accurately reconstructing TAC values through optimization, providing a robust solution for backdoor removal in machine learning models.

Abstract: Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

[337] Improving Conditional VAE with approximation using Normalizing Flows

Tuhin Subhra De

Main category: cs.LG

TL;DR: The paper explores conditional Variational Autoencoders (CVAE) for image generation, addressing blurry outputs and limited diversity by making the Gaussian decoder variance learnable and using normalizing flows to better estimate the latent space conditional distribution.

Details

Motivation: Traditional generative models like VAEs and GANs have been superseded by diffusion models, but the authors aim to improve CVAEs by addressing known issues like blurry images and incorrect assumptions about latent space conditional distributions.

Method: Proposes using learnable variance for the Gaussian decoder and employing normalizing flows to accurately estimate the conditional distribution of latent space given labels, rather than assuming it equals the prior distribution.

Result: The method achieves 5% reduction in FID score and 7.7% increase in log likelihood compared to previous CVAE approaches, demonstrating improved image generation quality.

Conclusion: By addressing key limitations in traditional CVAEs through learnable variance and proper conditional distribution estimation, the authors show that CVAEs can still achieve competitive performance in image generation tasks.

Abstract: Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 5% and increasing log likelihood by 7.7% than the previous case.

[338] Bayesian Mixture of Experts For Large Language Models

Maryam Dialameh, Hossein Rajabzadeh, Weiwei Zhang, Walid Ahmed, Hyock Ju Kwon

Main category: cs.LG

TL;DR: Bayesian Mixture of Experts (Bayesian-MoE) is a post-hoc uncertainty estimation framework for fine-tuned LLMs that uses structured Laplace approximation on MoE expert layers for calibrated uncertainty without changing training or adding parameters.

Details

Motivation: To provide reliable uncertainty estimation for fine-tuned large language models, especially Mixture-of-Experts models, without modifying the original training procedure or introducing additional parameters.

Method: Applies structured Laplace approximation to the second linear layer of each expert in MoE models, uses Kronecker-factored low-rank approximations for curvature modeling, and performs block-wise posterior estimation on existing expert pathways.

Result: Experiments on Qwen1.5-MoE and DeepSeek-MoE show improved expected calibration error (ECE) and negative log-likelihood (NLL) compared to baselines on common-sense reasoning benchmarks.

Conclusion: Bayesian-MoE effectively provides calibrated uncertainty estimation for MoE models, enhancing reliability for downstream decision-making while maintaining computational efficiency.

Abstract: We present Bayesian Mixture of Experts (Bayesian-MoE), a post-hoc uncertainty estimation framework for fine-tuned large language models (LLMs) based on Mixture-of-Experts architectures. Our method applies a structured Laplace approximation to the second linear layer of each expert, enabling calibrated uncertainty estimation without modifying the original training procedure or introducing new parameters. Unlike prior approaches, which apply Bayesian inference to added adapter modules, Bayesian-MoE directly targets the expert pathways already present in MoE models, leveraging their modular design for tractable block-wise posterior estimation. We use Kronecker-factored low-rank approximations to model curvature and derive scalable estimates of predictive uncertainty and marginal likelihood. Experiments on common-sense reasoning benchmarks with Qwen1.5-MoE and DeepSeek-MoE demonstrate that Bayesian-MoE improves both expected calibration error (ECE) and negative log-likelihood (NLL) over baselines, confirming its effectiveness for reliable downstream decision-making.

[339] Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan M. Nguyen, Toan Tran

Main category: cs.LG

TL;DR: Proposes Selective Sinkhorn Routing (SSR) for Sparse Mixture-of-Experts models, replacing auxiliary losses with lightweight optimal transport-based routing to achieve better performance and training efficiency.

Details

Motivation: Existing SMoE models rely on auxiliary losses and additional parameters for expert diversity, causing objective misalignment and increased complexity. Sinkhorn-based methods also suffer from high computational overhead.

Method: Formulates token-to-expert assignment as optimal transport problem, derives gating scores directly from transport map, and introduces SSR mechanism that uses lightweight Sinkhorn-based routing without auxiliary losses.

Result: SSR achieves faster training, higher accuracy, and greater robustness to input corruption across language modeling and image classification tasks, while promoting balanced token assignments.

Conclusion: SSR provides an effective alternative to auxiliary losses in SMoE models, enabling better performance and efficiency through optimal transport-based routing principles.

Abstract: Sparse Mixture-of-Experts (SMoE) has gained prominence as a scalable and computationally efficient architecture, enabling significant growth in model capacity without incurring additional inference costs. However, existing SMoE models often rely on auxiliary losses (e.g., z-loss, load balancing) and additional trainable parameters (e.g., noisy gating) to encourage expert diversity, leading to objective misalignment and increased model complexity. Moreover, existing Sinkhorn-based methods suffer from significant training overhead due to their heavy reliance on the computationally expensive Sinkhorn algorithm. In this work, we formulate token-to-expert assignment as an optimal transport problem, incorporating constraints to ensure balanced expert utilization. We demonstrate that introducing a minimal degree of optimal transport-based routing enhances SMoE performance without requiring auxiliary balancing losses. Unlike previous methods, our approach derives gating scores directly from the transport map, enabling more effective token-to-expert balancing, supported by both theoretical analysis and empirical results. Building on these insights, we propose Selective Sinkhorn Routing (SSR), a routing mechanism that replaces auxiliary loss with lightweight Sinkhorn-based routing. SSR promotes balanced token assignments while preserving flexibility in expert selection. Across both language modeling and image classification tasks, SSR achieves faster training, higher accuracy, and greater robustness to input corruption.

[340] Data reuse enables cost-efficient randomized trials of medical AI models

Michael Nercessian, Wenxin Zhang, Alexander Schubert, Daphne Yang, Maggie Chung, Ahmed Alaa, Adam Yala

Main category: cs.LG

TL;DR: BRIDGE is a data-reuse RCT design for AI risk models that recycles participant data from completed trials when legacy and updated AI models make concordant predictions, reducing enrollment requirements for subsequent trials.

Details

Motivation: Traditional RCTs for medical AI tools are costly and time-consuming, hindering timely validation as new AI models emerge rapidly, creating a need for more efficient trial designs.

Method: BRIDGE trials reuse participant-level data from completed trials when AI models make concordant predictions, with a practical checklist to ensure valid causal inference and preserve type I error.

Result: Real-world datasets showed up to 64.8% overlap in high-risk cohorts between successive AI models. Simulation studies demonstrated 46.6% reduction in required enrollment (saving over $2.8M) while maintaining 80% power.

Conclusion: BRIDGE makes Level I evidence generation feasible for every model iteration by transforming trials into adaptive, modular studies, accelerating cost-effective translation of AI into routine care.

Abstract: Randomized controlled trials (RCTs) are indispensable for establishing the clinical value of medical artificial-intelligence (AI) tools, yet their high cost and long timelines hinder timely validation as new models emerge rapidly. Here, we propose BRIDGE, a data-reuse RCT design for AI-based risk models. AI risk models support a broad range of interventions, including screening, treatment selection, and clinical alerts. BRIDGE trials recycle participant-level data from completed trials of AI models when legacy and updated models make concordant predictions, thereby reducing the enrollment requirement for subsequent trials. We provide a practical checklist for investigators to assess whether reusing data from previous trials allows for valid causal inference and preserves type I error. Using real-world datasets across breast cancer, cardiovascular disease, and sepsis, we demonstrate concordance between successive AI models, with up to 64.8% overlap in top 5% high-risk cohorts. We then simulate a series of breast cancer screening studies, where our design reduced required enrollment by 46.6%–saving over US$2.8 million–while maintaining 80% power. By transforming trials into adaptive, modular studies, our proposed design makes Level I evidence generation feasible for every model iteration, thereby accelerating cost-effective translation of AI into routine care.

[341] Fast $k$-means clustering in Riemannian manifolds via Fréchet maps: Applications to large-dimensional SPD matrices

Ji Shi, Nicolas Charon, Andreas Mang, Demetrio Labate, Robert Azencott

Main category: cs.LG

TL;DR: Novel framework for clustering high-dimensional manifold data using p-Fréchet maps to embed into Euclidean space, enabling efficient k-means clustering with significant speed improvements.

Details

Motivation: Standard intrinsic methods for clustering data on high-dimensional non-Euclidean manifolds face computational challenges that limit their practical application.

Method: Use p-Fréchet maps to embed manifold data into lower-dimensional Euclidean space using reference points, then apply standard Euclidean clustering techniques like k-means.

Result: Reduces runtime by up to two orders of magnitude compared to intrinsic manifold approaches while maintaining high clustering accuracy, even in challenging scenarios where other methods fail.

Conclusion: The p-Fréchet mapping framework provides an efficient and accurate alternative to intrinsic manifold clustering methods, making manifold data clustering practically feasible.

Abstract: We introduce a novel, efficient framework for clustering data on high-dimensional, non-Euclidean manifolds that overcomes the computational challenges associated with standard intrinsic methods. The key innovation is the use of the $p$-Fréchet map $F^p : \mathcal{M} \to \mathbb{R}^\ell$ – defined on a generic metric space $\mathcal{M}$ – which embeds the manifold data into a lower-dimensional Euclidean space $\mathbb{R}^\ell$ using a set of reference points ${r_i}_{i=1}^\ell$, $r_i \in \mathcal{M}$. Once embedded, we can efficiently and accurately apply standard Euclidean clustering techniques such as k-means. We rigorously analyze the mathematical properties of $F^p$ in the Euclidean space and the challenging manifold of $n \times n$ symmetric positive definite matrices $\mathit{SPD}(n)$. Extensive numerical experiments using synthetic and real $\mathit{SPD}(n)$ data demonstrate significant performance gains: our method reduces runtime by up to two orders of magnitude compared to intrinsic manifold-based approaches, all while maintaining high clustering accuracy, including scenarios where existing alternative methods struggle or fail.

[342] FLAD: Federated Learning for LLM-based Autonomous Driving in Vehicle-Edge-Cloud Networks

Tianao Xiang, Mingjian Zhi, Yuanguo Bi, Lin Cai, Yuhao Chen

Main category: cs.LG

TL;DR: FLAD is a federated learning framework for autonomous driving that enables collaborative training of LLMs across vehicles without sharing raw data, addressing computation, communication, and privacy challenges.

Details

Motivation: To overcome the challenges of training LLMs for autonomous driving, including high computation/transmission costs and privacy concerns with sensitive driving data, while leveraging distributed multimodal sensory data across autonomous vehicles.

Method: Uses a cloud-edge-vehicle collaborative architecture with intelligent parallelized collaborative training, communication scheduling mechanism, and knowledge distillation for personalizing LLMs to heterogeneous edge data.

Result: FLAD achieves superior end-to-end autonomous driving performance while efficiently utilizing distributed vehicular resources, as demonstrated through extensive experimental evaluation with NVIDIA Jetsons testbed.

Conclusion: FLAD opens up new possibilities for future collaborative autonomous driving model training and knowledge sharing by overcoming practical implementation challenges in resource-constrained environments.

Abstract: Large Language Models (LLMs) have impressive data fusion and reasoning capabilities for autonomous driving (AD). However, training LLMs for AD faces significant challenges including high computation transmission costs, and privacy concerns associated with sensitive driving data. Federated Learning (FL) is promising for enabling autonomous vehicles (AVs) to collaboratively train models without sharing raw data. We present Federated LLM-based Autonomous Driving (FLAD), an FL framework that leverages distributed multimodal sensory data across AVs in heterogeneous environment. FLAD has three key innovations: (1) a cloud-edge-vehicle collaborative architecture that reduces communication delay and preserving data privacy; (2) an intelligent parallelized collaborative training with a communication scheduling mechanism that optimizes training efficiency, leveraging end-devices otherwise having insufficient resources for model training; and (3) a knowledge distillation method that personalizes LLM according to heterogeneous edge data. In addition, we prototype FLAD in a testbed with NVIDIA Jetsons, overcoming practical implementation challenges including CPU/GPU memory sharing in resource-constrained devices, dynamic model partitions, and fault-tolerant training.Extensive experimental evaluation demonstrates that FLAD achieves superior end-to-end AD performance while efficiently utilizing distributed vehicular resources, opening up new possibilities for future collaborative AD model training and knowledge sharing.

[343] FedSDWC: Federated Synergistic Dual-Representation Weak Causal Learning for OOD

Zhenyuan Huang, Hui Zhang, Wenzhong Tang, Haijun Yang

Main category: cs.LG

TL;DR: FedSDWC is a causal inference method for federated learning that integrates invariant and variant features to handle data distribution shifts, improving generalization and OOD detection.

Details

Motivation: Address limitations of existing FL methods in handling covariate and semantic shifts across distributed clients, which severely affect reliability in real-world deployments.

Method: Proposes FedSDWC that infers causal semantic representations by modeling weak causal influence between invariant and variant features, overcoming limitations of existing invariant learning methods.

Result: Outperforms FedICON by average 3.04% on CIFAR-10 and 8.11% on CIFAR-100; establishes theoretical generalization error bound linked to client prior distributions.

Conclusion: FedSDWC significantly enhances FL’s generalization ability and OOD detection performance across multiple benchmark datasets with distribution shifts.

Abstract: Amid growing demands for data privacy and advances in computational infrastructure, federated learning (FL) has emerged as a prominent distributed learning paradigm. Nevertheless, differences in data distribution (such as covariate and semantic shifts) severely affect its reliability in real-world deployments. To address this issue, we propose FedSDWC, a causal inference method that integrates both invariant and variant features. FedSDWC infers causal semantic representations by modeling the weak causal influence between invariant and variant features, effectively overcoming the limitations of existing invariant learning methods in accurately capturing invariant features and directly constructing causal representations. This approach significantly enhances FL’s ability to generalize and detect OOD data. Theoretically, we derive FedSDWC’s generalization error bound under specific conditions and, for the first time, establish its relationship with client prior distributions. Moreover, extensive experiments conducted on multiple benchmark datasets validate the superior performance of FedSDWC in handling covariate and semantic shifts. For example, FedSDWC outperforms FedICON, the next best baseline, by an average of 3.04% on CIFAR-10 and 8.11% on CIFAR-100.

[344] Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection

Anushka Sanjay Shelke, Aditya Sneh, Arya Adyasha, Haroon R. Lone

Main category: cs.LG

TL;DR: FairM2S is a fairness-aware meta-learning framework for stress detection that addresses gender bias in AI models, achieving 78.1% accuracy while significantly improving fairness metrics in few-shot learning scenarios.

Details

Motivation: Existing AI-driven stress detection models exhibit gender bias, especially in data-scarce scenarios, which creates inequitable mental healthcare outcomes.

Method: Proposes FairM2S framework that integrates Equalized Odds constraints during meta-training and adaptation phases, using adversarial gradient masking and fairness-constrained meta-updates.

Result: Achieves 78.1% accuracy while reducing Equal Opportunity to 0.06, outperforming five state-of-the-art baselines with substantial fairness improvements.

Conclusion: FairM2S represents a state-of-the-art approach for equitable and scalable few-shot stress detection, accompanied by the release of SAVSD dataset to support fairness research in real-world contexts.

Abstract: Fairness in AI-driven stress detection is critical for equitable mental healthcare, yet existing models frequently exhibit gender bias, particularly in data-scarce scenarios. To address this, we propose FairM2S, a fairness-aware meta-learning framework for stress detection leveraging audio-visual data. FairM2S integrates Equalized Odds constraints during both meta-training and adaptation phases, employing adversarial gradient masking and fairness-constrained meta-updates to effectively mitigate bias. Evaluated against five state-of-the-art baselines, FairM2S achieves 78.1% accuracy while reducing the Equal Opportunity to 0.06, demonstrating substantial fairness gains. We also release SAVSD, a smartphone-captured dataset with gender annotations, designed to support fairness research in low-resource, real-world contexts. Together, these contributions position FairM2S as a state-of-the-art approach for equitable and scalable few-shot stress detection in mental health AI. We release our dataset and FairM2S publicly with this paper.

[345] GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs

Liangwei Yang, Jing Ma, Jianguo Zhang, Zhiwei Liu, Jielin Qiu, Shirley Kokane, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu, Huan Wang, Weiran Yao, Caiming Xiong, Shelby Heinecke

Main category: cs.LG

TL;DR: This paper addresses semantic drift in GNNs on text-attributed graphs by proposing Geodesic Aggregation, a manifold-aware mechanism that aggregates neighbor information along geodesics on the unit sphere to preserve semantic fidelity.

Details

Motivation: Linear aggregation in GNNs distorts the non-linear, geometrically structured representation spaces of modern PLMs, causing semantic drift where aggregated representations deviate from the intrinsic semantic manifold and lose expressive power.

Method: Proposes Geodesic Aggregation using log-exp mappings on the unit sphere to aggregate neighbor information along geodesics, and develops GeoGNN which integrates spherical attention with manifold interpolation.

Result: Extensive experiments across four benchmark datasets show GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold-aware aggregation.

Conclusion: Manifold-aware aggregation is crucial for text-attributed graph learning to preserve semantic fidelity and expressive power during message passing.

Abstract: Graph neural networks (GNNs) on text–attributed graphs (TAGs) typically encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation. However, the representation spaces of modern PLMs are highly non–linear and geometrically structured, where textual embeddings reside on curved semantic manifolds rather than flat Euclidean spaces. Linear aggregation on such manifolds inevitably distorts geometry and causes semantic drift–a phenomenon where aggregated representations deviate from the intrinsic manifold, losing semantic fidelity and expressive power. To quantitatively investigate this problem, this work introduces a local PCA–based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure. Building upon these insights, we propose Geodesic Aggregation, a manifold–aware mechanism that aggregates neighbor information along geodesics via log–exp mappings on the unit sphere, ensuring that representations remain faithful to the semantic manifold during message passing. We further develop GeoGNN, a practical instantiation that integrates spherical attention with manifold interpolation. Extensive experiments across four benchmark datasets and multiple text encoders show that GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold–aware aggregation in text–attributed graph learning.

[346] Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback

Shengbo Wang, Hong Sun, Ke Li

Main category: cs.LG

TL;DR: The paper introduces a model-free dueling bandit framework with feedback augmentation to address sparse human feedback in interactive preference elicitation, achieving competitive performance across multiple benchmarks.

Details

Motivation: Existing dueling bandit algorithms are inefficient with sparse human feedback, and parametric reward models are vulnerable to misspecification, motivating a model-free approach with feedback augmentation.

Method: Proposed augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyzed multi-factored performance trade-off via regret analysis.

Result: The prototype algorithm achieves competitive performance across several IPE benchmarks including recommendation, multi-objective optimization, and response optimization for large language models.

Conclusion: The approach demonstrates potential for provably efficient interactive preference elicitation in broader applications through model-free feedback augmentation.

Abstract: Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems. Dueling bandit (DB) algorithms enable optimal decision-making in IPE building on pairwise comparisons. However, they remain inefficient when human feedback is sparse. Existing methods address sparsity by heavily relying on parametric reward models, whose rigid assumptions are vulnerable to misspecification. In contrast, we explore an alternative perspective based on feedback augmentation, and introduce critical improvements to the model-free DB framework. Specifically, we introduce augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyze the multi-factored performance trade-off via regret analysis. Our prototype algorithm achieves competitive performance across several IPE benchmarks, including recommendation, multi-objective optimization, and response optimization for large language models, demonstrating the potential of our approach for provably efficient IPE in broader applications.

[347] Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks

Anthony Baez, Wang Zhang, Ziwen Ma, Lam Nguyen, Subhro Das, Luca Daniel

Main category: cs.LG

TL;DR: A projection method for PINNs that guarantees conservation of integral quantities, reducing conservation errors by 3-4 orders of magnitude while slightly improving PDE solution accuracy.

Details

Motivation: PINNs' soft constraints allow solutions to violate physical conservation laws, which is problematic for applications requiring strict conservation of integral quantities like mass, energy, or momentum.

Method: Developed a projection method that solves constrained non-linear optimization problems to enforce conservation of linear and quadratic integrals, creating PINN-Proj with guaranteed conservation properties.

Result: PINN-Proj reduced conservation errors by 3-4 orders of magnitude compared to soft constraints, marginally reduced PDE solution error, and improved convergence by conditioning the loss landscape.

Conclusion: The projection method provides a general framework to guarantee conservation of any integral quantity in PINNs when tractable solutions exist, enhancing physical consistency while maintaining solution accuracy.

Abstract: We propose a novel projection method that guarantees the conservation of integral quantities in Physics-Informed Neural Networks (PINNs). While the soft constraint that PINNs use to enforce the structure of partial differential equations (PDEs) enables necessary flexibility during training, it also permits the discovered solution to violate physical laws. To address this, we introduce a projection method that guarantees the conservation of the linear and quadratic integrals, both separately and jointly. We derived the projection formulae by solving constrained non-linear optimization problems and found that our PINN modified with the projection, which we call PINN-Proj, reduced the error in the conservation of these quantities by three to four orders of magnitude compared to the soft constraint and marginally reduced the PDE solution error. We also found evidence that the projection improved convergence through improving the conditioning of the loss landscape. Our method holds promise as a general framework to guarantee the conservation of any integral quantity in a PINN if a tractable solution exists.

[348] Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering

Mingjie Zhao, Zhanpei Huang, Yang Lu, Mengke Li, Yiqun Zhang, Weifeng Su, Yiu-ming Cheung

Main category: cs.LG

TL;DR: The paper proposes a method to learn customized distance metrics for categorical attributes by breaking fixed category relationships, enabling flexible clustering and extension to mixed datasets with superior accuracy.

Details

Motivation: Categorical attributes lack well-defined relationships between their values, limiting the exploration of compact clusters. Existing methods assume fixed topological relationships, which reduces adaptability and clustering performance.

Method: The method breaks intrinsic category relationships and learns customized distance metrics suitable for revealing various cluster distributions. The learned relationships are Euclidean distance metric-compatible, allowing extension to mixed datasets.

Result: Comparative experiments on 12 real benchmark datasets show superior clustering accuracy with an average ranking of 1.25, significantly higher than the 5.21 ranking of the current best-performing method.

Conclusion: The proposed method significantly enhances clustering algorithm fitting ability through learnable category relationships, achieving superior performance in categorical and mixed data clustering.

Abstract: Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

[349] Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies

Zhongnian Li, Lan Chen, Yixin Xu, Shi Xu, Xinzheng Xu

Main category: cs.LG

TL;DR: Proposes Human-Corrected Labels (HCL) to efficiently correct VLM-generated noisy labels by strategically using human correction only for instances with VLM discrepancies, achieving better quality annotations with reduced labor costs.

Details

Motivation: VLM-generated labels suffer from low quality (label noise) and lack error correction mechanisms, limiting their practical utility in data annotation.

Method: HCL strategically deploys human correction only for instances with VLM discrepancies, uses a risk-consistent estimator combining human-corrected labels and VLM predictions, and employs conditional probability to estimate label distribution.

Result: Extensive experiments show superior classification performance and robustness to label noise, validating HCL’s effectiveness in weak supervision scenarios.

Conclusion: HCL provides an efficient framework for improving VLM-generated labels through strategic human correction, achieving higher quality annotations with reduced labor costs.

Abstract: Vision-Language Models (VLMs), with their powerful content generation capabilities, have been successfully applied to data annotation processes. However, the VLM-generated labels exhibit dual limitations: low quality (i.e., label noise) and absence of error correction mechanisms. To enhance label quality, we propose Human-Corrected Labels (HCLs), a novel setting that efficient human correction for VLM-generated noisy labels. As shown in Figure 1(b), HCL strategically deploys human correction only for instances with VLM discrepancies, achieving both higher-quality annotations and reduced labor costs. Specifically, we theoretically derive a risk-consistent estimator that incorporates both human-corrected labels and VLM predictions to train classifiers. Besides, we further propose a conditional probability method to estimate the label distribution using a combination of VLM outputs and model predictions. Extensive experiments demonstrate that our approach achieves superior classification performance and is robust to label noise, validating the effectiveness of HCL in practical weak supervision scenarios. Code https://github.com/Lilianach24/HCL.git

[350] Factorization-in-Loop: Proximal Fill-in Minimization for Sparse Matrix Reordering

Ziwei Li, Shuzi Niu, Tao Yuan, Huiyuan Li, Wenjia Wu

Main category: cs.LG

TL;DR: Proposes a neural network approach to reduce fill-ins in LU factorization by minimizing l1 norm of triangular factors, achieving 20% fill-in reduction and 17.8% factorization time improvement over state-of-the-art methods.

Details

Motivation: Fill-ins during LU factorization increase memory usage and computational time for large sparse matrices. Existing methods use surrogate objectives without theoretical guarantees to the golden criterion of minimal fill-ins.

Method: Uses a graph encoder to predict node scores, reparameterization to obtain permutation matrices, and optimizes l1 norm of triangular factors using ADMM and proximal gradient descent.

Result: Achieves 20% reduction in fill-in number and 17.8% reduction in LU factorization time compared to state-of-the-art baselines on SuiteSparse benchmark.

Conclusion: The proposed method effectively reduces fill-ins and improves LU factorization efficiency by directly optimizing triangular factor sparsity through neural network-based reordering.

Abstract: Fill-ins are new nonzero elements in the summation of the upper and lower triangular factors generated during LU factorization. For large sparse matrices, they will increase the memory usage and computational time, and be reduced through proper row or column arrangement, namely matrix reordering. Finding a row or column permutation with the minimal fill-ins is NP-hard, and surrogate objectives are designed to derive fill-in reduction permutations or learn a reordering function. However, there is no theoretical guarantee between the golden criterion and these surrogate objectives. Here we propose to learn a reordering network by minimizing (l_1) norm of triangular factors of the reordered matrix to approximate the exact number of fill-ins. The reordering network utilizes a graph encoder to predict row or column node scores. For inference, it is easy and fast to derive the permutation from sorting algorithms for matrices. For gradient based optimization, there is a large gap between the predicted node scores and resultant triangular factors in the optimization objective. To bridge the gap, we first design two reparameterization techniques to obtain the permutation matrix from node scores. The matrix is reordered by multiplying the permutation matrix. Then we introduce the factorization process into the objective function to arrive at target triangular factors. The overall objective function is optimized with the alternating direction method of multipliers and proximal gradient descent. Experimental results on benchmark sparse matrix collection SuiteSparse show the fill-in number and LU factorization time reduction of our proposed method is 20% and 17.8% compared with state-of-the-art baselines.

[351] FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters

Hiro Ishii, Kenta Niwa, Hiroshi Sawada, Akinori Fujino, Noboru Harada, Rio Yokota

Main category: cs.LG

TL;DR: FedPM is a novel Federated Learning method that uses second-order optimization with preconditioned parameter mixing to address local preconditioner drift in heterogeneous data settings.

Details

Motivation: Existing second-order FL methods suffer from drift in local preconditioners, which disrupts convergence, especially with heterogeneous client data.

Method: Decomposes ideal second-order updates into server-side preconditioned parameter mixing and client-side local updates, mitigating preconditioner drift.

Result: Theoretical analysis shows superlinear convergence for strongly convex objectives with single local update. Experiments demonstrate significant test accuracy improvements over conventional methods.

Conclusion: FedPM effectively leverages second-order optimization potential in FL by addressing preconditioner drift through refined update rules and preconditioned mixing.

Abstract: We propose Federated Preconditioned Mixing (FedPM), a novel Federated Learning (FL) method that leverages second-order optimization. Prior methods–such as LocalNewton, LTDA, and FedSophia–have incorporated second-order optimization in FL by performing iterative local updates on clients and applying simple mixing of local parameters on the server. However, these methods often suffer from drift in local preconditioners, which significantly disrupts the convergence of parameter training, particularly in heterogeneous data settings. To overcome this issue, we refine the update rules by decomposing the ideal second-order update–computed using globally preconditioned global gradients–into parameter mixing on the server and local parameter updates on clients. As a result, our FedPM introduces preconditioned mixing of local parameters on the server, effectively mitigating drift in local preconditioners. We provide a theoretical convergence analysis demonstrating a superlinear rate for strongly convex objectives in scenarios involving a single local update. To demonstrate the practical benefits of FedPM, we conducted extensive experiments. The results showed significant improvements with FedPM in the test accuracy compared to conventional methods incorporating simple mixing, fully leveraging the potential of second-order optimization.

[352] Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

Main category: cs.LG

TL;DR: Theoretical analysis of minimum-cost data poisoning attacks on LLMs during RLHF/DPO alignment by flipping preference labels, with convex optimization formulation and cost bounds.

Details

Motivation: Understanding theoretical foundations of data poisoning vulnerabilities in LLM alignment pipelines, as empirical studies exist but theoretical analysis is lacking.

Method: Formulate label-flipping poisoning as convex optimization with linear constraints, derive lower/upper bounds on minimum attack cost, and propose cost-minimization post-processing for existing attacks.

Result: Derived theoretical bounds on poisoning costs and showed post-processing can significantly reduce label flips required, especially when reward model feature dimension is small relative to dataset size.

Conclusion: Reveals fundamental vulnerabilities in RLHF/DPO pipelines and provides tools to evaluate robustness against low-cost poisoning attacks.

Abstract: Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM’s policy toward an attacker’s target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model’s feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

[353] Towards a Generalisable Cyber Defence Agent for Real-World Computer Networks

Tim Dudman, Martyn Bull

Main category: cs.LG

TL;DR: TERLA enables RL agents to defend networks of varying topology/size without retraining using graph neural networks and fixed action spaces.

Details

Motivation: Current RL agents for cyber defense require retraining for different network topologies/sizes, making them impractical for real-world networks that change over time.

Method: Use heterogeneous graph neural networks to create fixed-size latent embeddings of network state, combined with reduced fixed-size interpretable action space, applied to PPO agents in CAGE Challenge 4 environment.

Result: TERLA agents maintain defensive performance of vanilla PPO agents while improving action efficiency, and demonstrate generalizability across different network topologies/sizes without retraining.

Conclusion: TERLA provides a network-agnostic approach that enables single agents to defend varying network segments effectively, bridging the sim-to-real gap for autonomous cyber defense.

Abstract: Recent advances in deep reinforcement learning for autonomous cyber defence have resulted in agents that can successfully defend simulated computer networks against cyber-attacks. However, many of these agents would need retraining to defend networks with differing topology or size, making them poorly suited to real-world networks where topology and size can vary over time. In this research we introduce a novel set of Topological Extensions for Reinforcement Learning Agents (TERLA) that provide generalisability for the defence of networks with differing topology and size, without the need for retraining. Our approach involves the use of heterogeneous graph neural network layers to produce a fixed-size latent embedding representing the observed network state. This representation learning stage is coupled with a reduced, fixed-size, semantically meaningful and interpretable action space. We apply TERLA to a standard deep reinforcement learning Proximal Policy Optimisation (PPO) agent model, and to reduce the sim-to-real gap, conduct our research using Cyber Autonomy Gym for Experimentation (CAGE) Challenge 4. This Cyber Operations Research Gym environment has many of the features of a real-world network, such as realistic Intrusion Detection System (IDS) events and multiple agents defending network segments of differing topology and size. TERLA agents retain the defensive performance of vanilla PPO agents whilst showing improved action efficiency. Generalisability has been demonstrated by showing that all TERLA agents have the same network-agnostic neural network architecture, and by deploying a single TERLA agent multiple times to defend network segments with differing topology and size, showing improved defensive performance and efficiency.

[354] Trusted Multi-view Learning for Long-tailed Classification

Chuanqing Tang, Yifei Shi, Guanghao Lin, Lei Xing, Long Shi

Main category: cs.LG

TL;DR: TMLC is a trusted multi-view long-tailed classification framework that addresses class imbalance in multi-view scenarios through group consensus opinion aggregation and uncertainty-guided pseudo-data generation.

Details

Motivation: Class imbalance has been well-studied in single-view scenarios but remains challenging in multi-view contexts, especially for trustworthy solutions in long-tailed classification problems.

Method: Proposes TMLC framework with two key components: 1) group consensus opinion aggregation inspired by Social Identity Theory, and 2) uncertainty-guided pseudo-data generation using a novel distance metric to adapt SMOTE for multi-view scenarios.

Result: Extensive experiments on long-tailed multi-view datasets demonstrate superior performance compared to existing methods.

Conclusion: TMLC effectively mitigates the adverse effects of class imbalance in multi-view long-tailed classification through innovative opinion aggregation and data generation techniques.

Abstract: Class imbalance has been extensively studied in single-view scenarios; however, addressing this challenge in multi-view contexts remains an open problem, with even scarcer research focusing on trustworthy solutions. In this paper, we tackle a particularly challenging class imbalance problem in multi-view scenarios: long-tailed classification. We propose TMLC, a Trusted Multi-view Long-tailed Classification framework, which makes contributions on two critical aspects: opinion aggregation and pseudo-data generation. Specifically, inspired by Social Identity Theory, we design a group consensus opinion aggregation mechanism that guides decision making toward the direction favored by the majority of the group. In terms of pseudo-data generation, we introduce a novel distance metric to adapt SMOTE for multi-view scenarios and develop an uncertainty-guided data generation module that produces high-quality pseudo-data, effectively mitigating the adverse effects of class imbalance. Extensive experiments on long-tailed multi-view datasets demonstrate that our model is capable of achieving superior performance. The code is released at https://github.com/cncq-tang/TMLC.

[355] Practical Global and Local Bounds in Gaussian Process Regression via Chaining

Junyi Liu, Stanley Kok

Main category: cs.LG

TL;DR: A chaining-based framework for estimating bounds on expected extreme values in Gaussian process regression, providing global and local uncertainty quantification without requiring specific input features or posterior estimates.

Details

Motivation: Existing uncertainty bounds in GPR require specific input features, rely on posterior mean/variance estimates, or need hyperparameter tuning, limiting robustness and failing to capture global behavior in expectation.

Method: Proposed chaining-based framework for estimating upper/lower bounds on expected extreme values over unseen data, with kernel-specific refinements for RBF and Matérn kernels, avoiding analytical relaxations for numerical tightness, and developing local uncertainty quantification using chaining geometry through partition diameters.

Result: Theoretical bounds are tighter than generic constructions for common kernels, and experimental results show the method outperforms existing approaches on both synthetic and real-world datasets.

Conclusion: The proposed framework provides robust uncertainty quantification for GPR without requiring specific input features or posterior estimates, with improved tightness and adaptability to local structure.

Abstract: Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches require access to specific input features and rely on posterior mean and variance estimates or tuning hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation. To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input locations. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structure without relying on posterior variance scaling. Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

[356] Unsupervised Feature Selection Through Group Discovery

Shira Lifshitz, Ofir Lindenbaum, Gal Mishne, Ron Meir, Hadas Benisty

Main category: cs.LG

TL;DR: GroupFS is an unsupervised feature selection framework that jointly discovers latent feature groups and selects informative groups without predefined partitions or labels, outperforming state-of-the-art methods across diverse datasets.

Details

Motivation: Most unsupervised feature selection methods evaluate features individually, but informative signals often emerge from groups of related features. Existing group-based methods rely on predefined partitions or label supervision, limiting their applicability.

Method: GroupFS is an end-to-end differentiable framework that enforces Laplacian smoothness on feature and sample graphs and applies group sparsity regularization to learn compact, structured representations while jointly discovering latent feature groups.

Result: Across nine benchmarks spanning images, tabular data, and biological datasets, GroupFS consistently outperforms state-of-the-art unsupervised feature selection methods in clustering and selects groups of features that align with meaningful patterns.

Conclusion: GroupFS effectively addresses the limitations of individual feature evaluation by discovering and selecting meaningful feature groups without requiring predefined partitions or supervision, demonstrating superior performance across diverse data types.

Abstract: Unsupervised feature selection (FS) is essential for high-dimensional learning tasks where labels are not available. It helps reduce noise, improve generalization, and enhance interpretability. However, most existing unsupervised FS methods evaluate features in isolation, even though informative signals often emerge from groups of related features. For example, adjacent pixels, functionally connected brain regions, or correlated financial indicators tend to act together, making independent evaluation suboptimal. Although some methods attempt to capture group structure, they typically rely on predefined partitions or label supervision, limiting their applicability. We propose GroupFS, an end-to-end, fully differentiable framework that jointly discovers latent feature groups and selects the most informative groups among them, without relying on fixed a priori groups or label supervision. GroupFS enforces Laplacian smoothness on both feature and sample graphs and applies a group sparsity regularizer to learn a compact, structured representation. Across nine benchmarks spanning images, tabular data, and biological datasets, GroupFS consistently outperforms state-of-the-art unsupervised FS in clustering and selects groups of features that align with meaningful patterns.

[357] Compact Memory for Continual Logistic Regression

Yohan Jung, Hyungi Lee, Wenlong Chen, Thomas Möllenhoff, Yingzhen Li, Juho Lee, Mohammad Emtiyaz Khan

Main category: cs.LG

TL;DR: A new method for building compact memory in continual learning for logistic regression that uses Hessian-matching and probabilistic PCA to estimate optimal memory, significantly outperforming Experience Replay.

Details

Motivation: Continual learning still underperforms batch training due to catastrophic forgetting. Current methods lack clear solutions for building compact memory of essential past knowledge, even for shallow neural networks.

Method: Formulates memory search as Hessian-matching and uses probabilistic PCA to estimate optimal memory based on Khan and Swaroop’s existence proof of optimal memory for logistic regression models.

Result: Achieves 60% accuracy on Split-ImageNet vs 30% with Experience Replay (0.3% memory size). With 2% memory size, accuracy reaches 74%, closing the gap to batch accuracy of 77.6%.

Conclusion: The work opens a new direction for building compact memory that could be useful for future continual deep learning applications.

Abstract: Despite recent progress, continual learning still does not match the performance of batch training. To avoid catastrophic forgetting, we need to build compact memory of essential past knowledge, but no clear solution has yet emerged, even for shallow neural networks with just one or two layers. In this paper, we present a new method to build compact memory for logistic regression. Our method is based on a result by Khan and Swaroop [2021] who show the existence of optimal memory for such models. We formulate the search for the optimal memory as Hessian-matching and propose a probabilistic PCA method to estimate them. Our approach can drastically improve accuracy compared to Experience Replay. For instance, on Split-ImageNet, we get 60% accuracy compared to 30% obtained by replay with memory-size equivalent to 0.3% of the data size. Increasing the memory size to 2% further boosts the accuracy to 74%, closing the gap to the batch accuracy of 77.6% on this task. Our work opens a new direction for building compact memory that can also be useful in the future for continual deep learning.

[358] Data Fusion-Enhanced Decision Transformer for Stable Cross-Domain Generalization

Guojian Wang, Quinson Hon, Xuyang Chen, Lin Zhao

Main category: cs.LG

TL;DR: DFDT addresses cross-domain policy adaptation challenges for Decision Transformers by fusing target data with filtered source fragments using MMD and OT measures, replacing RTG tokens with advantage-conditioned tokens, and applying Q-guided regularization to improve stitchability and continuity.

Details

Motivation: Cross-domain shifts challenge DT policies due to poor stitchability of source trajectory fragments - state structure misalignment, incomparable RTG tokens when reward/horizon changes, and action jumps at junctions, compromising DT's inference ability.

Method: Two-level data filter using MMD for state-structure alignment and OT for action feasibility; trains on feasibility-weighted fusion distribution; replaces RTG tokens with advantage-conditioned tokens; applies Q-guided regularizer to suppress junction jumps.

Result: DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, with gains confirmed through token-stitching and sequence-semantics stability analyses.

Conclusion: DFDT effectively addresses cross-domain adaptation challenges by restoring stitchability through data fusion, token replacement, and regularization, with theoretical bounds showing performance improvement as MMD-mismatch and OT-deviation measures shrink.

Abstract: Cross-domain shifts present a significant challenge for decision transformer (DT) policies. Existing cross-domain policy adaptation methods typically rely on a single simple filtering criterion to select source trajectory fragments and stitch them together. They match either state structure or action feasibility. However, the selected fragments still have poor stitchability: state structures can misalign, the return-to-go (RTG) becomes incomparable when the reward or horizon changes, and actions may jump at trajectory junctions. As a result, RTG tokens lose continuity, which compromises DT’s inference ability. To tackle these challenges, we propose Data Fusion-Enhanced Decision Transformer (DFDT), a compact pipeline that restores stitchability. Particularly, DFDT fuses scarce target data with selectively trusted source fragments via a two-level data filter, maximum mean discrepancy (MMD) mismatch for state-structure alignment, and optimal transport (OT) deviation for action feasibility. It then trains on a feasibility-weighted fusion distribution. Furthermore, DFDT replaces RTG tokens with advantage-conditioned tokens, which improves the continuity of the semantics in the token sequence. It also applies a $Q$-guided regularizer to suppress junction value and action jumps. Theoretically, we provide bounds that tie state value and policy performance gaps to the MMD-mismatch and OT-deviation measures, and show that the bounds tighten as these two measures shrink. We show that DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, and further corroborate these gains with token-stitching and sequence-semantics stability analyses.

[359] FSampler: Training Free Acceleration of Diffusion Sampling via Epsilon Extrapolation

Michael A. Vladimir

Main category: cs.LG

TL;DR: FSampler is a training-free, sampler-agnostic execution layer that accelerates diffusion sampling by reducing function evaluations through epsilon prediction and substitution.

Details

Motivation: To accelerate diffusion sampling by reducing the number of function evaluations (NFE) without requiring retraining or modifying sampler formulas.

Method: Maintains history of denoising signals (epsilon), extrapolates next epsilon using finite difference predictors (2nd-4th order), substitutes predicted epsilon on selected steps, and uses stabilizers to correct drift and local curvature.

Result: Reduces time by 8-22% and model calls by 15-25% at high fidelity (SSIM 0.95-0.99), with aggressive settings reaching 45-50% fewer model calls at lower fidelity (SSIM 0.73-0.74).

Conclusion: FSampler effectively accelerates diffusion sampling across multiple samplers while maintaining high fidelity, requiring no training or sampler modifications.

Abstract: FSampler is a training free, sampler agnostic execution layer that accelerates diffusion sampling by reducing the number of function evaluations (NFE). FSampler maintains a short history of denoising signals (epsilon) from recent real model calls and extrapolates the next epsilon using finite difference predictors at second order, third order, or fourth order, falling back to lower order when history is insufficient. On selected steps the predicted epsilon substitutes the model call while keeping each sampler’s update rule unchanged. Predicted epsilons are validated for finiteness and magnitude; a learning stabilizer rescales predictions on skipped steps to correct drift, and an optional gradient estimation stabilizer compensates local curvature. Protected windows, periodic anchors, and a cap on consecutive skips bound deviation over the trajectory. Operating at the sampler level, FSampler integrates with Euler/DDIM, DPM++ 2M/2S, LMS/AB2, and RES family exponential multistep methods and drops into standard workflows. FLUX.1 dev, Qwen Image, and Wan 2.2, FSampler reduces time by 8 to 22% and model calls by 15 to 25% at high fidelity (Structural Similarity Index (SSIM) 0.95 to 0.99), without altering sampler formulas. With an aggressive adaptive gate, reductions can reach 45 to 50% fewer model calls at lower fidelity (SSIM 0.73 to 0.74).

[360] Iterated Population Based Training with Task-Agnostic Restarts

Alexander Chebykin, Tanja Alderliesten, Peter A. N. Bosman

Main category: cs.LG

TL;DR: IPBT automatically adjusts the number of steps between hyperparameter updates in Population Based Training via restarts and time-varying Bayesian optimization, outperforming previous PBT variants without requiring budget increases.

Details

Motivation: The number of steps between hyperparameter updates is a critical meta-hyperparameter in PBT that significantly affects performance, but no efficient method exists for setting its value automatically.

Method: IPBT uses restarts that reuse weight information task-agnostically and leverages time-varying Bayesian optimization to reinitialize hyperparameters, automatically adjusting the update frequency.

Result: Evaluation on 8 image classification and reinforcement learning tasks shows IPBT matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3) on average.

Conclusion: IPBT provides an effective solution for automatically tuning the critical update frequency hyperparameter in PBT, achieving superior performance without requiring additional budget or manual hyperparameter changes.

Abstract: Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important meta-HP of all PBT variants that can substantially affect their performance. Yet, no method or intuition is available for efficiently setting its value. We introduce Iterated Population Based Training (IPBT), a novel PBT variant that automatically adjusts this HP via restarts that reuse weight information in a task-agnostic way and leverage time-varying Bayesian optimization to reinitialize HPs. Evaluation on 8 image classification and reinforcement learning tasks shows that, on average, our algorithm matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3), without requiring a budget increase or any changes to its HPs. The source code is available at https://github.com/AwesomeLemon/IPBT.

[361] Sure! Here’s a short and concise title for your paper: “Contamination in Generated Text Detection Benchmarks”

Philipp Dingfelder, Christian Riess

Main category: cs.LG

TL;DR: The paper addresses quality issues in AI-generated text detection datasets, particularly the DetectRL benchmark, by identifying and removing simple patterns that detectors exploit as shortcuts, making them vulnerable to spoofing attacks.

Details

Motivation: To improve the reliability of AI-generated text detectors by ensuring high-quality benchmark datasets that don't contain exploitable patterns, as current datasets like DetectRL exhibit simple AI-generation patterns that enable spoofing attacks.

Method: Reprocessed the DetectRL dataset through several cleansing operations to remove identifiable patterns such as introductory phrases and task rejections that detectors use as shortcuts.

Result: Data cleansing made direct spoofing attacks on trained detectors more difficult, demonstrating improved robustness of detectors trained on the cleaned dataset.

Conclusion: Proper data cleansing is crucial for developing reliable AI-generated text detectors, and the reprocessed dataset is publicly available to support future research.

Abstract: Large language models are increasingly used for many applications. To prevent illicit use, it is desirable to be able to detect AI-generated text. Training and evaluation of such detectors critically depend on suitable benchmark datasets. Several groups took on the tedious work of collecting, curating, and publishing large and diverse datasets for this task. However, it remains an open challenge to ensure high quality in all relevant aspects of such a dataset. For example, the DetectRL benchmark exhibits relatively simple patterns of AI-generation in 98.5% of the Claude-LLM data. These patterns may include introductory words such as “Sure! Here is the academic article abstract:”, or instances where the LLM rejects the prompted task. In this work, we demonstrate that detectors trained on such data use such patterns as shortcuts, which facilitates spoofing attacks on the trained detectors. We consequently reprocessed the DetectRL dataset with several cleansing operations. Experiments show that such data cleansing makes direct attacks more difficult. The reprocessed dataset is publicly available.

[362] Stochastic Mean-Shift Clustering

Itshak Lapidot, Yann Sepulcre, Tom Trigano

Main category: cs.LG

TL;DR: Stochastic version of mean-shift clustering using random data point sequences with partial gradient ascent, showing better performance than standard mean-shift in most cases.

Details

Motivation: To improve the mean-shift clustering algorithm by introducing stochastic elements for potentially better performance and efficiency.

Method: Randomly chosen sequence of data points move according to partial gradient ascent steps of the objective function.

Result: Outperforms standard mean-shift clustering in most cases on synthesized 2D Gaussian mixture samples and successfully applied to speaker clustering.

Conclusion: Stochastic mean-shift clustering is an effective improvement over the standard version, demonstrating superior performance in most scenarios.

Abstract: We present a stochastic version of the mean-shift clustering algorithm. In this stochastic version a randomly chosen sequence of data points move according to partial gradient ascent steps of the objective function. Theoretical results illustrating the convergence of the proposed approach, and its relative performances is evaluated on synthesized 2-dimensional samples generated by a Gaussian mixture distribution and compared with state-of-the-art methods. It can be observed that in most cases the stochastic mean-shift clustering outperforms the standard mean-shift. We also illustrate as a practical application the use of the presented method for speaker clustering.

[363] CoCo-MILP: Inter-Variable Contrastive and Intra-Constraint Competitive MILP Solution Prediction

Tianle Pu, Jianing Li, Yingying Gao, Shixuan Liu, Zijie Geng, Haoyang Liu, Chao Chen, Changjun Fan

Main category: cs.LG

TL;DR: CoCo-MILP introduces contrastive learning and competitive GNN architecture to better align with MILP structure, significantly outperforming existing methods by modeling variable relationships and constraint competitions.

Details

Motivation: Existing GNN-based MILP solvers misalign with problem structure by treating variables independently and smoothing representations, missing competitive relationships within constraints.

Method: Proposes inter-variable contrastive loss to maximize embedding margin between 0/1 variables, and intra-constraint competitive GNN layer that differentiates competing variables instead of homogenizing features.

Result: Reduces solution gap by up to 68.12% compared to traditional solvers on standard benchmarks, significantly outperforming existing learning-based approaches.

Conclusion: Explicitly modeling contrast and competition in MILP structure through novel objectives and architecture leads to substantial performance improvements in solution prediction.

Abstract: Mixed-Integer Linear Programming (MILP) is a cornerstone of combinatorial optimization, yet solving large-scale instances remains a significant computational challenge. Recently, Graph Neural Networks (GNNs) have shown promise in accelerating MILP solvers by predicting high-quality solutions. However, we identify that existing methods misalign with the intrinsic structure of MILP problems at two levels. At the leaning objective level, the Binary Cross-Entropy (BCE) loss treats variables independently, neglecting their relative priority and yielding plausible logits. At the model architecture level, standard GNN message passing inherently smooths the representations across variables, missing the natural competitive relationships within constraints. To address these challenges, we propose CoCo-MILP, which explicitly models inter-variable Contrast and intra-constraint Competition for advanced MILP solution prediction. At the objective level, CoCo-MILP introduces the Inter-Variable Contrastive Loss (VCL), which explicitly maximizes the embedding margin between variables assigned one versus zero. At the architectural level, we design an Intra-Constraint Competitive GNN layer that, instead of homogenizing features, learns to differentiate representations of competing variables within a constraint, capturing their exclusionary nature. Experimental results on standard benchmarks demonstrate that CoCo-MILP significantly outperforms existing learning-based approaches, reducing the solution gap by up to 68.12% compared to traditional solvers. Our code is available at https://github.com/happypu326/CoCo-MILP.

[364] AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting

Renda Li, Hailang Huang, Fei Wei, Feng Xiong, Yong Wang, Xiangxiang Chu

Main category: cs.LG

TL;DR: AdaCuRL is an adaptive curriculum reinforcement learning framework that addresses gradient starvation and policy degradation in LLM reasoning by dynamically aligning data difficulty with model capability and preventing catastrophic forgetting.

Details

Motivation: Existing RL methods for LLM reasoning suffer from gradient starvation and policy degradation when training on mixed-difficulty samples, while prior approaches using CoT data are labor-intensive and curriculum learning methods face difficulty mismatch, manual design requirements, and catastrophic forgetting.

Method: AdaCuRL integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling, dynamically aligns data difficulty with model capability, incorporates data revisitation to mitigate catastrophic forgetting, and employs adaptive reference and sparse KL strategies to prevent policy degradation.

Result: Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.

Conclusion: AdaCuRL effectively addresses the limitations of existing RL methods for reasoning tasks by providing an adaptive curriculum framework that dynamically matches data difficulty to model capability while preventing common training issues.

Abstract: Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.

[365] Parameter-Free Clustering via Self-Supervised Consensus Maximization (Extended Version)

Lijun Zhang, Suyuan Liu, Siwei Wang, Shengju Yu, Xueling Zhu, Miaomiao Li, Xinwang Liu

Main category: cs.LG

TL;DR: SCMax is a fully parameter-free clustering framework that uses self-supervised consensus maximization to automatically determine the optimal number of clusters through hierarchical agglomerative clustering.

Details

Motivation: Most existing clustering methods require hyperparameters like the number of clusters, limiting their real-world applicability. This paper addresses the challenge of parameter-free clustering.

Method: Performs hierarchical agglomerative clustering with integrated cluster evaluation. Creates structure-aware data representations through self-supervised learning guided by current clustering structure, and uses nearest neighbor consensus score to measure agreement between original and self-supervised representations.

Result: Extensive experiments show SCMax outperforms existing clustering approaches designed for scenarios with unknown number of clusters.

Conclusion: The proposed framework successfully addresses parameter-free clustering by using self-supervised consensus maximization to automatically determine optimal clustering structure.

Abstract: Clustering is a fundamental task in unsupervised learning, but most existing methods heavily rely on hyperparameters such as the number of clusters or other sensitive settings, limiting their applicability in real-world scenarios. To address this long-standing challenge, we propose a novel and fully parameter-free clustering framework via Self-supervised Consensus Maximization, named SCMax. Our framework performs hierarchical agglomerative clustering and cluster evaluation in a single, integrated process. At each step of agglomeration, it creates a new, structure-aware data representation through a self-supervised learning task guided by the current clustering structure. We then introduce a nearest neighbor consensus score, which measures the agreement between the nearest neighbor-based merge decisions suggested by the original representation and the self-supervised one. The moment at which consensus maximization occurs can serve as a criterion for determining the optimal number of clusters. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing clustering approaches designed for scenarios with an unknown number of clusters.

[366] Controllable protein design through Feynman-Kac steering

Erik Hartman, Jonas Wallin, Johan Malmström, Jimmy Olsson

Main category: cs.LG

TL;DR: Extends Feynman-Kac steering to diffusion-based protein design, enabling guided generation toward functional objectives like binding affinity while maintaining diversity.

Details

Motivation: Current diffusion models generate realistic protein structures but lack ability to steer outcomes toward specific functional objectives like binding affinity or sequence composition.

Method: Couples Feynman-Kac steering framework with diffusion-based structure generation, using rewards computed on ProteinMPNN-refined models and all-atom relaxation for simultaneous sequence and structure property generation.

Result: Consistently improves predicted interface energetics in binder design across diverse targets with minimal computational overhead.

Conclusion: Feynman-Kac control generalizes diffusion-based protein design to arbitrary, non-differentiable objectives, providing a unified framework for guided molecular generation.

Abstract: Diffusion-based models have recently enabled the generation of realistic and diverse protein structures, yet they remain limited in their ability to steer outcomes toward specific functional or biochemical objectives, such as binding affinity or sequence composition. Here we extend the Feynman-Kac (FK) steering framework, an inference-time control approach, to diffusion-based protein design. By coupling FK steering with structure generation, the method guides sampling toward desirable structural or energetic features while maintaining the diversity of the underlying diffusion process. To enable simultaneous generation of both sequence and structure properties, rewards are computed on models refined through ProteinMPNN and all-atom relaxation. Applied to binder design, FK steering consistently improves predicted interface energetics across diverse targets with minimal computational overhead. More broadly, this work demonstrates that inference-time FK control generalizes diffusion-based protein design to arbitrary, non-differentiable, and reward-agnostic objectives, providing a unified and model-independent framework for guided molecular generation.

[367] Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

Paul Strang, Zacharie Alès, Côme Bissuel, Safia Kedad-Sidhoum, Emmanuel Rachelson

Main category: cs.LG

TL;DR: PlanB&B is a model-based reinforcement learning agent that uses a learned internal model of branch-and-bound dynamics to improve branching strategies in mixed-integer linear programming, outperforming previous RL methods.

Details

Motivation: Traditional variable selection heuristics in MILP solvers are static and hand-crafted. Recent RL approaches aim to learn better branching strategies, but existing methods don't leverage environment simulators like MCTS which have shown success in board games.

Method: PlanB&B uses model-based reinforcement learning with a learned internal model of B&B dynamics, inspired by Monte Carlo Tree Search approaches that have succeeded in board games.

Result: The MBRL branching agent outperforms previous state-of-the-art RL methods across four standard MILP benchmarks in computational experiments.

Conclusion: Model-based reinforcement learning with learned internal models can effectively discover improved branching strategies for MILP problems, demonstrating the value of planning-based approaches in combinatorial optimization.

Abstract: Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.

[368] A Distributed Training Architecture For Combinatorial Optimization

Yuyao Long

Main category: cs.LG

TL;DR: A distributed GNN framework for combinatorial optimization that partitions large graphs into subgraphs, trains them locally, and uses RL to handle cross-node constraints, achieving better scalability and performance than existing methods.

Details

Motivation: Existing GNN methods for combinatorial optimization suffer from limited accuracy on complex graphs and poor scalability due to memory constraints when loading entire graphs.

Method: Partition large graphs into subgraphs, fully train individual subgraphs, and use reinforcement learning to handle cross-node constraints based on GNN outputs.

Result: Outperforms state-of-the-art approaches in solution quality and computational efficiency on both real social networks and synthetic graphs, with validated scalability on large instances.

Conclusion: The proposed distributed GNN framework effectively addresses scalability and performance limitations in combinatorial optimization problems on large graphs.

Abstract: In recent years, graph neural networks (GNNs) have been widely applied in tackling combinatorial optimization problems. However, existing methods still suffer from limited accuracy when addressing that on complex graphs and exhibit poor scalability, since full training requires loading the whole adjacent matrix and all embeddings at a time, the it may results in out of memory of a single machine. This limitation significantly restricts their applicability to large-scale scenarios. To address these challenges, we propose a distributed GNN-based training framework for combinatorial optimization. In details, firstly, large graph is partition into several small subgraphs. Then the individual subgraphs are full trained, providing a foundation for efficient local optimization. Finally, reinforcement learning (RL) are employed to take actions according to GNN output, to make sure the restrictions between cross nodes can be learned. Extensive experiments are conducted on both real large-scale social network datasets (e.g., Facebook, Youtube) and synthetically generated high-complexity graphs, which demonstrate that our framework outperforms state-of-the-art approaches in both solution quality and computational efficiency. Moreover, the experiments on large graph instances also validate the scalability of the model.

Yanli Li, Yanan Zhou, Zhongliang Guo, Nan Yang, Yuning Zhang, Huaming Chen, Dong Yuan, Weiping Ding, Witold Pedrycz

Main category: cs.LG

TL;DR: The paper introduces Dual-Facet Attack (DFA) that simultaneously degrades both utility and fairness in federated learning, and proposes GuardFed defense framework to counter these attacks.

Details

Motivation: Federated learning is vulnerable to adversarial attacks, but existing research focuses on attacks targeting either utility or fairness separately, leaving simultaneous attacks unexplored.

Method: Proposed Dual-Facet Attack (DFA) with two variants (Synchronous and Split), and developed GuardFed defense that uses fairness-aware reference model and dual-perspective trust scoring for client updates.

Result: Existing FL defenses fail against DFAs, while GuardFed consistently preserves accuracy and fairness under diverse non-IID and adversarial conditions, achieving state-of-the-art performance.

Conclusion: DFA poses significant threat to FL systems, and GuardFed provides effective defense by jointly evaluating utility and fairness degradation in client updates.

Abstract: Federated learning (FL) enables privacy-preserving collaborative model training but remains vulnerable to adversarial behaviors that compromise model utility or fairness across sensitive groups. While extensive studies have examined attacks targeting either objective, strategies that simultaneously degrade both utility and fairness remain largely unexplored. To bridge this gap, we introduce the Dual-Facet Attack (DFA), a novel threat model that concurrently undermines predictive accuracy and group fairness. Two variants, Synchronous DFA (S-DFA) and Split DFA (Sp-DFA), are further proposed to capture distinct real-world collusion scenarios. Experimental results show that existing robust FL defenses, including hybrid aggregation schemes, fail to resist DFAs effectively. To counter these threats, we propose GuardFed, a self-adaptive defense framework that maintains a fairness-aware reference model using a small amount of clean server data augmented with synthetic samples. In each training round, GuardFed computes a dual-perspective trust score for every client by jointly evaluating its utility deviation and fairness degradation, thereby enabling selective aggregation of trustworthy updates. Extensive experiments on real-world datasets demonstrate that GuardFed consistently preserves both accuracy and fairness under diverse non-IID and adversarial conditions, achieving state-of-the-art performance compared with existing robust FL methods.

[370] Multi-step Predictive Coding Leads To Simplicity Bias

Aviv Ratzon, Omri Barak

Main category: cs.LG

TL;DR: Deep networks with multi-step prediction horizons consistently recover latent structure in predictive coding tasks, explained through OLS estimator structure and learning dynamics biases.

Details

Motivation: To understand when and why predictive coding forms structured internal representations that mirror environmental latent structure, as current conditions for such emergence remain unclear.

Method: Used minimal abstract setting with deep networks trained with multi-step prediction horizons, extended to nonlinear networks and complex datasets including piecewise linear functions, MNIST, multiple latent states, and higher dimensional geometries.

Result: Sufficiently deep networks with multi-step prediction horizons consistently recover the underlying latent structure across various settings.

Conclusion: Provides principled understanding of when and why predictive coding induces structured representations, bridging empirical observations with theoretical foundations.

Abstract: Predictive coding is a framework for understanding the formation of low-dimensional internal representations mirroring the environment’s latent structure. The conditions under which such representations emerge remain unclear. In this work, we investigate how the prediction horizon and network depth shape the solutions of predictive coding tasks. Using a minimal abstract setting inspired by prior work, we show empirically and theoretically that sufficiently deep networks trained with multi-step prediction horizons consistently recover the underlying latent structure, a phenomenon explained through the Ordinary Least Squares estimator structure and biases in learning dynamics. We then extend these insights to nonlinear networks and complex datasets, including piecewise linear functions, MNIST, multiple latent states and higher dimensional state geometries. Our results provide a principled understanding of when and why predictive coding induces structured representations, bridging the gap between empirical observations and theoretical foundations.

[371] Efficiently Transforming Neural Networks into Decision Trees: A Path to Ground Truth Explanations with RENTT

Helena Monke, Benjamin Fresz, Marco Bernreuther, Yilin Chen, Marco F. Huber

Main category: cs.LG

TL;DR: RENTT algorithm transforms neural networks into exact equivalent decision trees to provide faithful explanations, overcoming limitations of existing XAI methods.

Details

Motivation: Neural networks lack interpretability, and current explainable AI methods often provide unfaithful explanations that don't align with the model's actual decision logic.

Method: Proposed RENTT algorithm that computes exact equivalent decision tree representations of neural networks, supporting CNNs, RNNs, non-ReLU activations, and bias terms with rigorous proofs.

Result: RENTT achieves runtime and memory efficiency, provides ground truth feature importance (global, regional, local), and outperforms approximation methods like LIME and SHAP in faithfulness.

Conclusion: RENTT enables exact, scalable, and interpretable neural network explanations through decision tree transformations, verified by computational efficiency and superior explanation quality.

Abstract: Although neural networks are a powerful tool, their widespread use is hindered by the opacity of their decisions and their black-box nature, which result in a lack of trustworthiness. To alleviate this problem, methods in the field of explainable Artificial Intelligence try to unveil how such automated decisions are made. But explainable AI methods are often plagued by missing faithfulness/correctness, meaning that they sometimes provide explanations that do not align with the neural network’s decision and logic. Recently, transformations to decision trees have been proposed to overcome such problems. Unfortunately, they typically lack exactness, scalability, or interpretability as the size of the neural network grows. Thus, we generalize these previous results, especially by considering convolutional neural networks, recurrent neural networks, non-ReLU activation functions, and bias terms. Our findings are accompanied by rigorous proofs and we present a novel algorithm RENTT (Runtime Efficient Network to Tree Transformation) designed to compute an exact equivalent decision tree representation of neural networks in a manner that is both runtime and memory efficient. The resulting decision trees are multivariate and thus, possibly too complex to understand. To alleviate this problem, we also provide a method to calculate the ground truth feature importance for neural networks via the equivalent decision trees - for entire models (global), specific input regions (regional), or single decisions (local). All theoretical results are supported by detailed numerical experiments that emphasize two key aspects: the computational efficiency and scalability of our algorithm, and that only RENTT succeeds in uncovering ground truth explanations compared to conventional approximation methods like LIME and SHAP. All code is available at https://github.com/HelenaM23/RENTT .

[372] Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier

Xinpeng Li, Kai Ming Ting

Main category: cs.LG

TL;DR: This paper introduces a formal definition for feature attribution in AI explainability and proposes DFAX, a model-agnostic method that explains classifier predictions based on the underlying data distribution, outperforming existing methods.

Details

Motivation: The proliferation of complex black-box AI models has created a need for explanation techniques, but feature attribution methods lack formal problem definitions and many existing approaches fail to properly account for the underlying data distribution.

Method: The authors propose Distributional Feature Attribution eXplanations (DFAX), a novel model-agnostic method that explains classifier predictions directly based on the data distribution represented by the dataset.

Result: Extensive experiments show that DFAX is more effective and efficient than state-of-the-art baseline methods for feature attribution.

Conclusion: DFAX successfully addresses the formal definition gap in feature attribution by providing explanations supported by the underlying probability distribution, overcoming limitations of existing methods.

Abstract: The proliferation of complex, black-box AI models has intensified the need for techniques that can explain their decisions. Feature attribution methods have become a popular solution for providing post-hoc explanations, yet the field has historically lacked a formal problem definition. This paper addresses this gap by introducing a formal definition for the problem of feature attribution, which stipulates that explanations be supported by an underlying probability distribution represented by the given dataset. Our analysis reveals that many existing model-agnostic methods fail to meet this criterion, while even those that do often possess other limitations. To overcome these challenges, we propose Distributional Feature Attribution eXplanations (DFAX), a novel, model-agnostic method for feature attribution. DFAX is the first feature attribution method to explain classifier predictions directly based on the data distribution. We show through extensive experiments that DFAX is more effective and efficient than state-of-the-art baselines.

[373] A Tensor Residual Circuit Neural Network Factorized with Matrix Product Operation

Andi Chen

Main category: cs.LG

TL;DR: Proposes Tensor Circuit Neural Network (TCNN) combining tensor networks and quantum circuit models for improved generalization, robustness, and low complexity with novel activation operations and information fusion.

Details

Motivation: Address the challenge of reducing neural network complexity while maintaining generalization and robustness, overcoming limitations of existing quantum-inspired and tensor network approaches.

Method: Combines tensor neural networks with residual circuit models using complex number field operations, parallel circuit architecture, and information fusion layer to merge real and imaginary parameter features.

Result: TCNN achieves 2%-3% higher average accuracy on various datasets than state-of-the-art models, maintains learning capability under noise attacks, and prevents gradient explosion while being comparable in parameter count and runtime.

Conclusion: TCNN successfully integrates tensor networks and circuit models to achieve superior generalization, robustness, and efficiency, with ablation studies confirming the value of its novel components.

Abstract: It is challenging to reduce the complexity of neural networks while maintaining their generalization ability and robustness, especially for practical applications. Conventional solutions for this problem incorporate quantum-inspired neural networks with Kronecker products and hybrid tensor neural networks with MPO factorization and fully-connected layers. Nonetheless, the generalization power and robustness of the fully-connected layers are not as outstanding as circuit models in quantum computing. In this paper, we propose a novel tensor circuit neural network (TCNN) that takes advantage of the characteristics of tensor neural networks and residual circuit models to achieve generalization ability and robustness with low complexity. The proposed activation operation and parallelism of the circuit in complex number field improves its non-linearity and efficiency for feature learning. Moreover, since the feature information exists in the parameters in both the real and imaginary parts in TCNN, an information fusion layer is proposed for merging features stored in those parameters to enhance the generalization capability. Experimental results confirm that TCNN showcases more outstanding generalization and robustness with its average accuracies on various datasets 2%-3% higher than those of the state-of-the-art compared models. More significantly, while other models fail to learn features under noise parameter attacking, TCNN still showcases prominent learning capability owing to its ability to prevent gradient explosion. Furthermore, it is comparable to the compared models on the number of trainable parameters and the CPU running time. An ablation study also indicates the advantage of the activation operation, the parallelism architecture and the information fusion layer.

[374] Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Jiajie Su, Zihan Nan, Yunshan Ma, Xiaobo Xia, Xiaohua Feng, Weiming Liu, Xiaolin Zheng, Chaochao Chen

Main category: cs.LG

TL;DR: Proposes CREAT, a constrained reinforcement learning attack for sequential recommenders that achieves targeted profile pollution with minimal detectability by balancing adversarial efficacy and stealthiness.

Details

Motivation: Existing profile pollution attacks on sequential recommenders have limitations: over-reliance on sequence horizon impact restricts fine-grained perturbations, and holistic modifications cause detectable distribution shifts.

Method: Uses constrained reinforcement learning with bi-level optimization and multi-reward system. Features Pattern Balanced Rewarding Policy (pattern inversion + distribution consistency rewards) and Constrained Group Relative Reinforcement Learning (dynamic barrier constraints + group-shared experience replay).

Result: Extensive experiments demonstrate the effectiveness of CREAT in achieving targeted pollution with minimal detectability.

Conclusion: CREAT successfully addresses limitations of previous PPA methods by enabling fine-grained perturbations while maintaining stealthiness through constrained reinforcement learning.

Abstract: Sequential Recommenders, which exploit dynamic user intents through interaction sequences, is vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT.

[375] Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Tong Wu, Yutong He, Bin Wang, Kun Yuan

Main category: cs.LG

TL;DR: Mixture-of-Channels (MoC) is a novel FFN architecture that reduces activation memory in LLMs by selectively activating only the Top-K most relevant channels per token using SwiGLU’s gating mechanism, achieving significant memory savings and throughput gains while maintaining performance.

Details

Motivation: Large language models face substantial memory overhead from activation memory, particularly from feed-forward networks (FFNs), which has become the critical bottleneck especially when FlashAttention is implemented. FFN activations were identified as the predominant source of activation memory overhead.

Method: Introduced Mixture-of-Channels (MoC) FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU’s native gating mechanism. This reduces activation memory during pre-training and improves inference efficiency through partial weight loading into GPU SRAM.

Result: Extensive experiments showed that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance compared to standard approaches.

Conclusion: MoC effectively addresses the activation memory bottleneck in LLMs by leveraging selective channel activation, providing an efficient solution that maintains model quality while substantially reducing memory requirements and improving computational efficiency.

Abstract: Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU’s native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

[376] Spatio-Temporal Graph Unlearning

Qiming Guo, Wenbo Sun, Wenlu Wang

Main category: cs.LG

TL;DR: CallosumNet is a spatio-temporal graph unlearning framework that achieves complete data removal with minimal performance loss using biologically-inspired subgraph construction and global dependency reconstruction.

Details

Motivation: Stringent privacy regulations require complete unlearning of unauthorized data from spatio-temporal graphs, but existing methods designed for static graphs are inefficient for spatio-temporal data where nodes diffuse information globally across spatial and temporal dimensions.

Method: Proposes CallosumNet with two novel techniques: Enhanced Subgraph Construction (ESC) that adaptively constructs localized subgraphs using virtual ganglions, and Global Ganglion Bridging (GGB) that reconstructs global spatio-temporal dependencies from these subgraphs.

Result: Empirical results on four real-world datasets show CallosumNet achieves complete unlearning with only 1%-2% relative MAE loss compared to the gold model, significantly outperforming state-of-the-art baselines.

Conclusion: CallosumNet provides an effective solution for spatio-temporal graph unlearning that addresses privacy compliance requirements while maintaining model performance, with ablation studies confirming the effectiveness of both proposed techniques.

Abstract: Spatio-temporal graphs are widely used in modeling complex dynamic processes such as traffic forecasting, molecular dynamics, and healthcare monitoring. Recently, stringent privacy regulations such as GDPR and CCPA have introduced significant new challenges for existing spatio-temporal graph models, requiring complete unlearning of unauthorized data. Since each node in a spatio-temporal graph diffuses information globally across both spatial and temporal dimensions, existing unlearning methods primarily designed for static graphs and localized data removal cannot efficiently erase a single node without incurring costs nearly equivalent to full model retraining. Therefore, an effective approach for complete spatio-temporal graph unlearning is a pressing need. To address this, we propose CallosumNet, a divide-and-conquer spatio-temporal graph unlearning framework inspired by the corpus callosum structure that facilitates communication between the brain’s two hemispheres. CallosumNet incorporates two novel techniques: (1) Enhanced Subgraph Construction (ESC), which adaptively constructs multiple localized subgraphs based on several factors, including biologically-inspired virtual ganglions; and (2) Global Ganglion Bridging (GGB), which reconstructs global spatio-temporal dependencies from these localized subgraphs, effectively restoring the full graph representation. Empirical results on four diverse real-world datasets show that CallosumNet achieves complete unlearning with only 1%-2% relative MAE loss compared to the gold model, significantly outperforming state-of-the-art baselines. Ablation studies verify the effectiveness of both proposed techniques.

[377] MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Mohsen Amiri, Konstantin Avrachenkov, Ibtihal El Mimouni, Sindri Magnússon

Main category: cs.LG

TL;DR: MARBLE extends RMABs with latent Markov states to handle nonstationary environments, introducing the MAI criterion and proving that QWI converges to optimal policies despite unobserved regime switches.

Details

Motivation: Classical RMABs assume fixed dynamics, which is often violated in nonstationary environments. This work addresses the challenge of decision-making when arm dynamics change over time due to latent environmental states.

Method: Introduces MARBLE framework with latent Markov states, proposes Markov-Averaged Indexability (MAI) criterion, and develops synchronous Q-learning with Whittle Indices (QWI) algorithm.

Result: Theoretical proof that QWI converges almost surely to optimal Q-function and Whittle indices under MAI criterion. Empirical validation on recommender system digital twin shows QWI adapts to shifting latent states and converges to optimal policy.

Conclusion: MARBLE successfully handles nonstationary RMABs with latent Markov states, with both theoretical guarantees and empirical validation showing effective adaptation to changing environments.

Abstract: Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

[378] LLM-Guided Dynamic-UMAP for Personalized Federated Graph Learning

Sai Puppala, Ismail Hossain, Md Jahangir Alam, Tanzim Ahad, Sajedul Talukder

Main category: cs.LG

TL;DR: A method using LLMs to enhance graph ML with personalization and privacy, combining data augmentation, prompt tuning, and in-context learning for federated learning on sparse graphs.

Details

Motivation: To address challenges in graph machine learning under personalization and privacy constraints, particularly for sparse graphs in low-resource settings.

Method: Combines data augmentation for sparse graphs, prompt/instruction tuning to adapt foundation models, in-context learning for few-shot reasoning, and Bayesian variational objective for personalized federated learning with Dynamic UMAP manifold.

Result: Supports node classification and link prediction in low-resource settings, aligns language model representations with graph structure, and includes privacy protection with differential privacy threat model.

Conclusion: The method provides a comprehensive framework for LLM-assisted graph machine learning with applications to knowledge graphs, recommendation systems, and citation/product graphs, along with evaluation considerations for benchmarking.

Abstract: We propose a method that uses large language models to assist graph machine learning under personalization and privacy constraints. The approach combines data augmentation for sparse graphs, prompt and instruction tuning to adapt foundation models to graph tasks, and in-context learning to supply few-shot graph reasoning signals. These signals parameterize a Dynamic UMAP manifold of client-specific graph embeddings inside a Bayesian variational objective for personalized federated learning. The method supports node classification and link prediction in low-resource settings and aligns language model latent representations with graph structure via a cross-modal regularizer. We outline a convergence argument for the variational aggregation procedure, describe a differential privacy threat model based on a moments accountant, and present applications to knowledge graph completion, recommendation-style link prediction, and citation and product graphs. We also discuss evaluation considerations for benchmarking LLM-assisted graph machine learning.

[379] GAMMA_FLOW: Guided Analysis of Multi-label spectra by MAtrix Factorization for Lightweight Operational Workflows

Viola Rädle, Tilman Hartwig, Benjamin Oesen, Emily Alice Kröger, Julius Vogt, Eike Gericke, Martin Baron

Main category: cs.LG

TL;DR: GAMMA_FLOW is an open-source Python package for real-time spectral analysis using supervised NMF for efficient classification, denoising, decomposition, and outlier detection.

Details

Motivation: To provide a fast, efficient, and adaptable alternative to computationally intensive models and proprietary software for spectral data analysis.

Method: Uses supervised non-negative matrix factorization (NMF) for dimensionality reduction to enable real-time analysis of single- and multi-component spectra.

Result: Achieves classification accuracies above 90% and enables reliable automated spectral interpretation with reduced computational costs.

Conclusion: GAMMA_FLOW is a flexible open-source solution applicable to various one-dimensional spectral data types, supporting diverse research and industry applications.

Abstract: GAMMA_FLOW is an open-source Python package for real-time analysis of spectral data. It supports classification, denoising, decomposition, and outlier detection of both single- and multi-component spectra. Instead of relying on large, computationally intensive models, it employs a supervised approach to non-negative matrix factorization (NMF) for dimensionality reduction. This ensures a fast, efficient, and adaptable analysis while reducing computational costs. gamma_flow achieves classification accuracies above 90% and enables reliable automated spectral interpretation. Originally developed for gamma-ray spectra, it is applicable to any type of one-dimensional spectral data. As an open and flexible alternative to proprietary software, it supports various applications in research and industry.

[380] How does the Performance of the Data-driven Traffic Flow Forecasting Models deteriorate with Increasing Forecasting Horizon? An Extensive Approach Considering Statistical, Machine Learning and Deep Learning Models

Amanta Sherfenaz, Nazmul Haque, Protiva Sadhukhan Prova, Md Asif Raihan, Md. Hadiuzzaman

Main category: cs.LG

TL;DR: This study evaluates statistical, ML, and DL models for traffic forecasting using real-world data from California’s Harbor Freeway. ANFIS-GP performs best for short-term predictions while Bi-LSTM excels in medium-term forecasting due to its ability to capture long-range temporal dependencies.

Details

Motivation: With rapid urbanization increasing traffic congestion, Intelligent Transportation Systems (ITS) have become essential for managing traffic within existing infrastructure. Traffic forecasting enables proactive measures like ramp metering and dynamic routing.

Method: Evaluated statistical, machine learning, and deep learning models using real-world traffic data from Caltrans PeMS. Models were tested over 20 forecasting windows (up to 1 hour 40 minutes) using RMSE, MAE, and R-Square metrics. Performance degradation was quantified using logarithmic transformation.

Result: ANFIS-GP performed best at early windows (RMSE: 0.038, MAE: 0.0276, R-Square: 0.9983), while Bi-LSTM was more robust for medium-term prediction (RMSE: 0.1863, MAE: 0.0833, R-Square: 0.987). Bi-LSTM showed the flattest performance degradation slope (0.0454 RMSE, 0.0545 MAE).

Conclusion: Hybrid models are identified as a promising future direction for traffic forecasting, combining the strengths of different approaches for improved performance across various time horizons.

Abstract: With rapid urbanization in recent decades, traffic congestion has intensified due to increased movement of people and goods. As planning shifts from demand-based to supply-oriented strategies, Intelligent Transportation Systems (ITS) have become essential for managing traffic within existing infrastructure. A core ITS function is traffic forecasting, enabling proactive measures like ramp metering, signal control, and dynamic routing through platforms such as Google Maps. This study assesses the performance of statistical, machine learning (ML), and deep learning (DL) models in forecasting traffic speed and flow using real-world data from California’s Harbor Freeway, sourced from the Caltrans Performance Measurement System (PeMS). Each model was evaluated over 20 forecasting windows (up to 1 hour 40 minutes) using RMSE, MAE, and R-Square metrics. Results show ANFIS-GP performs best at early windows with RMSE of 0.038, MAE of 0.0276, and R-Square of 0.9983, while Bi-LSTM is more robust for medium-term prediction due to its capacity to model long-range temporal dependencies, achieving RMSE of 0.1863, MAE of 0.0833, and R-Square of 0.987 at a forecasting of 20. The degradation in model performance was quantified using logarithmic transformation, with slope values used to measure robustness. Among DL models, Bi-LSTM had the flattest slope (0.0454 RMSE, 0.0545 MAE for flow), whereas ANFIS-GP had 0.1058 for RMSE and 0.1037 for flow MAE. The study concludes by identifying hybrid models as a promising future direction.

[381] From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm

Alexander Nadel, Ron Wettenstein

Main category: cs.LG

TL;DR: WOODELF is a new SHAP algorithm that computes Background SHAP values in linear time using a unified framework combining decision trees, game theory, and Boolean logic, achieving significant speedups over existing methods.

Details

Motivation: SHAP is widely used for interpreting decision tree ensembles but existing methods have computational limitations, especially for large datasets with background samples.

Method: WOODELF constructs pseudo-Boolean formulas that capture feature values, decision tree structure, and background datasets, then computes Background SHAP in linear time. It supports multiple game-theoretic values and runs efficiently on both CPU and GPU.

Result: On a dataset with 3M rows, 5M background samples, and 127 features, WOODELF computed Background SHAP values in 162 seconds (CPU) and 16 seconds (GPU), achieving 16x and 165x speedups respectively over the best existing method.

Conclusion: WOODELF provides a highly efficient and scalable solution for computing SHAP and other game-theoretic values, enabling large-scale interpretability analysis with significant performance improvements.

Abstract: SHapley Additive exPlanations (SHAP) is a key tool for interpreting decision tree ensembles by assigning contribution values to features. It is widely used in finance, advertising, medicine, and other domains. Two main approaches to SHAP calculation exist: Path-Dependent SHAP, which leverages the tree structure for efficiency, and Background SHAP, which uses a background dataset to estimate feature distributions. We introduce WOODELF, a SHAP algorithm that integrates decision trees, game theory, and Boolean logic into a unified framework. For each consumer, WOODELF constructs a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset. It then leverages this representation to compute Background SHAP in linear time. WOODELF can also compute Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values. WOODELF is designed to run efficiently on CPU and GPU hardware alike. Available via the WOODELF Python package, it is implemented using NumPy, SciPy, and CuPy without relying on custom C++ or CUDA code. This design enables fast performance and seamless integration into existing frameworks, supporting large-scale computation of SHAP and other game-theoretic values in practice. For example, on a dataset with 3,000,000 rows, 5,000,000 background samples, and 127 features, WOODELF computed all Background Shapley values in 162 seconds on CPU and 16 seconds on GPU - compared to 44 minutes required by the best method on any hardware platform, representing 16x and 165x speedups, respectively.

[382] Diffusion-based Sinogram Interpolation for Limited Angle PET

Rüveyda Yilmaz, Julian Thull, Johannes Stegmaier, Volkmar Schulz

Main category: cs.LG

TL;DR: Using conditional diffusion models to interpolate missing PET sinogram data for cost-efficient and patient-friendly PET geometries.

Details

Motivation: PET imaging needs methods for unconstrained detector layouts with gaps and open sides that create severely undersampled sinograms, instead of constraining hardware to complete cylinders.

Method: Treat missing lines-of-responses as learnable prior using conditional diffusion models to interpolate sparsely sampled sinograms.

Result: Proposed approach enables recovery of missing information in PET sinograms.

Conclusion: Method paves way for novel, cost-efficient, and patient-friendly PET geometries in real clinical settings.

Abstract: Accurate PET imaging increasingly requires methods that support unconstrained detector layouts from walk-through designs to long-axial rings where gaps and open sides lead to severely undersampled sinograms. Instead of constraining the hardware to form complete cylinders, we propose treating the missing lines-of-responses as a learnable prior. Data-driven approaches, particularly generative models, offer a promising pathway to recover this missing information. In this work, we explore the use of conditional diffusion models to interpolate sparsely sampled sinograms, paving the way for novel, cost-efficient, and patient-friendly PET geometries in real clinical settings.

[383] Abstract Gradient Training: A Unified Certification Framework for Data Poisoning, Unlearning, and Differential Privacy

Philip Sosnin, Matthew Wicker, Josh Collyer, Calvin Tsay

Main category: cs.LG

TL;DR: AGT provides a unified framework for certifying model robustness against training data perturbations like data poisoning, unlearning, and differential privacy by establishing provable parameter-space bounds.

Details

Motivation: Training data perturbations (adversarial poisoning, machine unlearning, differential privacy) are under-explored compared to inference-time perturbations, requiring new certification methods.

Method: Abstract Gradient Training (AGT) bounds the reachable set of parameters during first-order optimization to analyze model behavior under training data perturbations.

Result: AGT enables formal certification of model robustness against bounded perturbations, data removal, and new sample additions during training.

Conclusion: AGT offers a principled approach to training-time robustness certification, addressing gaps in current machine learning security and privacy guarantees.

Abstract: The impact of inference-time data perturbation (e.g., adversarial attacks) has been extensively studied in machine learning, leading to well-established certification techniques for adversarial robustness. In contrast, certifying models against training data perturbations remains a relatively under-explored area. These perturbations can arise in three critical contexts: adversarial data poisoning, where an adversary manipulates training samples to corrupt model performance; machine unlearning, which requires certifying model behavior under the removal of specific training data; and differential privacy, where guarantees must be given with respect to substituting individual data points. This work introduces Abstract Gradient Training (AGT), a unified framework for certifying robustness of a given model and training procedure to training data perturbations, including bounded perturbations, the removal of data points, and the addition of new samples. By bounding the reachable set of parameters, i.e., establishing provable parameter-space bounds, AGT provides a formal approach to analyzing the behavior of models trained via first-order optimization methods.

[384] Probing then Editing: A Push-Pull Framework for Retain-Free Machine Unlearning in Industrial IoT

Jiao Chen, Weihua Li, Jianhua Tang

Main category: cs.LG

TL;DR: PTE is a retain-free unlearning framework that uses probe-edit process with push-pull optimization to selectively forget outdated knowledge in IIoT environments without needing retain data.

Details

Motivation: Existing unlearning methods rely on retain data which increases computational/energy burdens and conflicts with industrial data silos and privacy requirements in dynamic IIoT environments.

Method: Probe-edit process: gradient ascent probing of decision boundary, generating editing instructions from model predictions, then push-pull optimization - push branch dismantles target class region, pull branch uses masked knowledge distillation to anchor retained classes.

Result: PTE achieves excellent balance between unlearning effectiveness and model utility across multiple benchmarks including CWRU and SCUT-FD, using only to-be-forgotten data and original model.

Conclusion: PTE provides efficient and balanced knowledge editing without retain data, addressing computational burdens and privacy compliance issues in IIoT environments.

Abstract: In dynamic Industrial Internet of Things (IIoT) environments, models need the ability to selectively forget outdated or erroneous knowledge. However, existing methods typically rely on retain data to constrain model behavior, which increases computational and energy burdens and conflicts with industrial data silos and privacy compliance requirements. To address this, we propose a novel retain-free unlearning framework, referred to as Probing then Editing (PTE). PTE frames unlearning as a probe-edit process: first, it probes the decision boundary neighborhood of the model on the to-be-forgotten class via gradient ascent and generates corresponding editing instructions using the model’s own predictions. Subsequently, a push-pull collaborative optimization is performed: the push branch actively dismantles the decision region of the target class using the editing instructions, while the pull branch applies masked knowledge distillation to anchor the model’s knowledge on retained classes to their original states. Benefiting from this mechanism, PTE achieves efficient and balanced knowledge editing using only the to-be-forgotten data and the original model. Experimental results demonstrate that PTE achieves an excellent balance between unlearning effectiveness and model utility across multiple general and industrial benchmarks such as CWRU and SCUT-FD.

[385] Transformer Semantic Genetic Programming for d-dimensional Symbolic Regression Problems

Philipp Anthes, Dominik Sobania, Franz Rothlauf

Main category: cs.LG

TL;DR: TSGP uses a pre-trained transformer as a semantic variation operator to generate offspring programs with controlled semantic similarity, outperforming other GP methods on symbolic regression benchmarks while producing compact solutions.

Details

Motivation: To overcome limitations of traditional semantic GP approaches that rely on fixed syntactic transformations, by learning diverse structural variations that maintain semantic similarity.

Method: Uses a pre-trained transformer model trained on millions of programs as a variation operator, with target semantic distance (SD_t) parameter to control semantic similarity between parent and offspring.

Result: Achieved average rank of 1.58 across 24 real-world and synthetic datasets, significantly outperforming standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP while producing more compact solutions than SLIM_GSGP.

Conclusion: TSGP effectively generalizes across symbolic regression problems, and the SD_t parameter provides a mechanism for balancing exploration (large SD_t) and exploitation (small SD_t) in semantic space.

Abstract: Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with controlled semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance $\mathrm{SD}_t$ is able to control the step size in the semantic space: small values of $\mathrm{SD}_t$ enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, $\mathrm{SD}_t$ provides an effective mechanism for balancing exploration and exploitation.

[386] Several Supporting Evidences for the Adaptive Feature Program

Yicheng Li, Qian Lin

Main category: cs.LG

TL;DR: The paper introduces the feature error measure (FEM) to analyze neural network feature learning and shows FEM decreases during training in various models, supporting the adaptive feature program framework.

Details

Motivation: To theoretically explore the advantages of neural networks and simplify the analysis of training dynamics using the adaptive feature program framework.

Method: Introduce feature error measure (FEM) and apply over-parametrized sequence models to analyze training dynamics in linear regression, single/multiple index models, and other adaptive feature models.

Result: The FEM is shown to decrease during the training process across various adaptive feature models, indicating improved feature learning quality.

Conclusion: The decreasing FEM provides evidence supporting the potential success of the adaptive feature program for understanding neural network feature learning.

Abstract: Theoretically exploring the advantages of neural networks might be one of the most challenging problems in the AI era. An adaptive feature program has recently been proposed to analyze the feature learning characteristic property of neural networks in a more abstract way. Motivated by the celebrated Le Cam equivalence, we advocate the over-parametrized sequence models to further simplify the analysis of the training dynamics of adaptive feature program and present several supporting evidences for the adaptive feature program. More precisely, after having introduced the feature error measure (FEM) to characterize the quality of the learned feature, we show that the FEM is decreasing during the training process of several concrete adaptive feature models including linear regression, single/multiple index models, etc. We believe that this hints at the potential successes of the adaptive feature program.

[387] Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

Ege Erdogan, Ana Lucic

Main category: cs.LG

TL;DR: Adaptively equivariant sparse autoencoders that incorporate group symmetries outperform regular SAEs in downstream tasks by discovering more useful features.

Details

Motivation: Sparse autoencoders are effective for interpretability in language models but face challenges when applied to scientific data with group symmetries. Incorporating these symmetries can improve feature utility.

Method: Developed adaptively equivariant SAEs that can adapt to the base model’s level of equivariance. Trained autoencoders on synthetic images and discovered a single matrix explaining activation transformations during image rotation.

Result: Adaptive SAEs discovered features that led to superior probing performance compared to regular SAEs, demonstrating the value of incorporating symmetries.

Conclusion: Incorporating group symmetries into sparse autoencoders yields more useful features for downstream tasks, enhancing the effectiveness of mechanistic interpretability tools.

Abstract: Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as scientific data with group symmetries, introduces challenges that can hinder their effectiveness. We show that incorporating such group symmetries into the SAEs yields features more useful in downstream tasks. More specifically, we train autoencoders on synthetic images and find that a single matrix can explain how their activations transform as the images are rotated. Building on this, we develop adaptively equivariant SAEs that can adapt to the base model’s level of equivariance. These adaptive SAEs discover features that lead to superior probing performance compared to regular SAEs, demonstrating the value of incorporating symmetries in mechanistic interpretability tools.

[388] Enhancing Explainability in Solar Energetic Particle Event Prediction: A Global Feature Mapping Approach

Anli Ji, Pranjal Patil, Chetraj Pandey, Manolis K. Georgoulis, Berkay Aydin

Main category: cs.LG

TL;DR: A novel framework integrating global explanations and ad-hoc feature mapping to enhance transparency in solar energetic particle (SEP) event prediction, addressing the black-box nature of existing models.

Details

Motivation: Most existing SEP prediction methods are black-box models, making it difficult for solar physicists to interpret results and understand the underlying physical causes of SEP events beyond just obtaining predictions.

Method: Proposed framework combines global explanations and ad-hoc feature mapping to enhance model transparency. Validated using a dataset of 341 SEP events (244 significant proton events) spanning solar cycles 22, 23, and 24.

Result: The approach improves explainability and facilitates physics-informed understanding of SEP event prediction, as demonstrated through explainability-focused case studies of major SEP events.

Conclusion: The proposed framework successfully addresses the interpretability challenge in SEP prediction, enabling better understanding of the decision-making process and underlying physical mechanisms driving solar energetic particle events.

Abstract: Solar energetic particle (SEP) events, as one of the most prominent manifestations of solar activity, can generate severe hazardous radiation when accelerated by solar flares or shock waves formed aside from coronal mass ejections (CMEs). However, most existing data-driven methods used for SEP predictions are operated as black-box models, making it challenging for solar physicists to interpret the results and understand the underlying physical causes of such events rather than just obtain a prediction. To address this challenge, we propose a novel framework that integrates global explanations and ad-hoc feature mapping to enhance model transparency and provide deeper insights into the decision-making process. We validate our approach using a dataset of 341 SEP events, including 244 significant (>=10 MeV) proton events exceeding the Space Weather Prediction Center S1 threshold, spanning solar cycles 22, 23, and 24. Furthermore, we present an explainability-focused case study of major SEP events, demonstrating how our method improves explainability and facilitates a more physics-informed understanding of SEP event prediction.

[389] Latent Planning via Embedding Arithmetic: A Contrastive Approach to Strategic Reasoning

Andrew Hamara, Greg Hamerly, Pablo Rivas, Andrew C. Freeman

Main category: cs.LG

TL;DR: SOLIS learns an evaluation-aligned embedding space using supervised contrastive learning, where planning is reduced to vector operations by ranking actions based on alignment with a global advantage direction from losing to winning regions.

Details

Motivation: To investigate whether planning can be carried out directly in learned representations rather than training policies or value heads, offering a lightweight alternative to traditional dynamics models or policy learning.

Method: Uses supervised contrastive learning to create an embedding space where outcome similarity is captured by proximity, with a single global advantage vector orienting the space from losing to winning regions. Candidate actions are ranked by their alignment with this direction.

Result: Demonstrated in chess where SOLIS uses only shallow search guided by the learned embedding to reach competitive strength under constrained conditions.

Conclusion: Evaluation-aligned latent planning offers a promising lightweight alternative to traditional planning approaches using dynamics models or policy learning.

Abstract: Planning in high-dimensional decision spaces is increasingly being studied through the lens of learned representations. Rather than training policies or value heads, we investigate whether planning can be carried out directly in an evaluation-aligned embedding space. We introduce SOLIS, which learns such a space using supervised contrastive learning. In this representation, outcome similarity is captured by proximity, and a single global advantage vector orients the space from losing to winning regions. Candidate actions are then ranked according to their alignment with this direction, reducing planning to vector operations in latent space. We demonstrate this approach in chess, where SOLIS uses only a shallow search guided by the learned embedding to reach competitive strength under constrained conditions. More broadly, our results suggest that evaluation-aligned latent planning offers a lightweight alternative to traditional dynamics models or policy learning.

[390] PDAC: Efficient Coreset Selection for Continual Learning via Probability Density Awareness

Junqi Gao, Zhichang Guo, Dazhi Zhang, Yao Li, Yi Ran, Biqing Qi

Main category: cs.LG

TL;DR: Proposes PDAC and SPDAC methods for efficient coreset selection in rehearsal-based continual learning by prioritizing high probability density samples to reduce computational costs while maintaining performance.

Details

Motivation: Current coreset selection methods rely on computationally expensive bi-level optimization, limiting practical efficiency. The paper aims to develop a more efficient selection scheme by analyzing sample contributions to error suppression.

Method: Analyzes MSE between buffer-trained and Bayes-optimal models, identifies high probability density samples as key for error suppression, proposes PDAC using Projected Gaussian Mixture model for density estimation, and SPDAC with streaming EM for streaming scenarios.

Result: Extensive experiments show proposed methods outperform other baselines across various continual learning settings while maintaining favorable computational efficiency.

Conclusion: Probability density-aware coreset selection provides an efficient alternative to computationally intensive bi-level optimization methods, achieving better performance with reduced computational overhead in continual learning.

Abstract: Rehearsal-based Continual Learning (CL) maintains a limited memory buffer to store replay samples for knowledge retention, making these approaches heavily reliant on the quality of the stored samples. Current Rehearsal-based CL methods typically construct the memory buffer by selecting a representative subset (referred to as coresets), aiming to approximate the training efficacy of the full dataset with minimal storage overhead. However, mainstream Coreset Selection (CS) methods generally formulate the CS problem as a bi-level optimization problem that relies on numerous inner and outer iterations to solve, leading to substantial computational cost thus limiting their practical efficiency. In this paper, we aim to provide a more efficient selection logic and scheme for coreset construction. To this end, we first analyze the Mean Squared Error (MSE) between the buffer-trained model and the Bayes-optimal model through the perspective of localized error decomposition to investigate the contribution of samples from different regions to MSE suppression. Further theoretical and experimental analyses demonstrate that samples with high probability density play a dominant role in error suppression. Inspired by this, we propose the Probability Density-Aware Coreset (PDAC) method. PDAC leverages the Projected Gaussian Mixture (PGM) model to estimate each sample’s joint density, enabling efficient density-prioritized buffer selection. Finally, we introduce the streaming Expectation Maximization (EM) algorithm to enhance the adaptability of PGM parameters to streaming data, yielding Streaming PDAC (SPDAC) for streaming scenarios. Extensive comparative experiments show that our methods outperforms other baselines across various CL settings while ensuring favorable efficiency.

[391] AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search

Shuzhen Bi, Chang Song, Siyu Song, Jinze Lv, Jian Chen, Xinyun Wang, Aimin Zhou, Hao Hao

Main category: cs.LG

TL;DR: AutoSynth automates synthetic data generation for LLM fine-tuning without needing labeled datasets, using Monte Carlo Tree Search guided by dataset-free hybrid rewards to solve the cold start problem in subjective tasks.

Details

Motivation: Manual curation of high-quality datasets for LLM fine-tuning is expensive, while existing automated methods require labeled datasets for reward modeling, creating a cold start problem especially for subjective tasks with no objective ground truth.

Method: Reframes workflow discovery as Monte Carlo Tree Search guided by a novel dataset-free hybrid reward with two LLM-as-judge components: one evaluates sample quality using dynamic task-specific metrics, and another assesses workflow code and prompt quality.

Result: On subjective educational tasks, AutoSynth-generated data dramatically outperforms baselines (40-51% vs 2-5%) and matches expert workflows on certain metrics, while reducing human effort from 5-7 hours to 30 minutes (>90% reduction).

Conclusion: AutoSynth effectively tackles the cold start issue in data-centric AI, providing a scalable and cost-effective method for subjective LLM tasks by discovering quality dimensions beyond human intuition.

Abstract: Supervised fine-tuning (SFT) of large language models (LLMs) for specialized tasks requires high-quality datasets, but manual curation is prohibitively expensive. Synthetic data generation offers scalability, but its effectiveness relies on complex, multi-stage workflows, integrating prompt engineering and model orchestration. Existing automated workflow methods face a cold start problem: they require labeled datasets for reward modeling, which is especially problematic for subjective, open-ended tasks with no objective ground truth. We introduce AutoSynth, a framework that automates workflow discovery and optimization without reference datasets by reframing the problem as a Monte Carlo Tree Search guided by a novel dataset-free hybrid reward. This reward enables meta-learning through two LLM-as-judge components: one evaluates sample quality using dynamically generated task-specific metrics, and another assesses workflow code and prompt quality. Experiments on subjective educational tasks show that while expert-designed workflows achieve higher human preference rates (96-99% win rates vs. AutoSynth’s 40-51%), models trained on AutoSynth-generated data dramatically outperform baselines (40-51% vs. 2-5%) and match or surpass expert workflows on certain metrics, suggesting discovery of quality dimensions beyond human intuition. These results are achieved while reducing human effort from 5-7 hours to just 30 minutes (>90% reduction). AutoSynth tackles the cold start issue in data-centric AI, offering a scalable, cost-effective method for subjective LLM tasks. Code: https://github.com/bisz9918-maker/AutoSynth.

[392] Quasi-Newton Compatible Actor-Critic for Deterministic Policies

Arash Bahari Kordabad, Dean Brandner, Sebastien Gros, Sergio Lucia, Sadegh Soudjani

Main category: cs.LG

TL;DR: A second-order deterministic actor-critic framework that uses quadratic critics to exploit curvature information for faster convergence in reinforcement learning.

Details

Motivation: To extend classical deterministic policy gradient methods by incorporating second-order curvature information to improve convergence speed and performance over first-order methods.

Method: Proposes a quadratic critic that preserves true policy gradient and approximates performance Hessian, with least-squares temporal difference learning for efficient parameter estimation, enabling quasi-Newton actor updates.

Result: Numerical examples show improved convergence and performance compared to standard deterministic actor-critic baselines.

Conclusion: The framework successfully leverages second-order information through quadratic critics to achieve faster convergence while maintaining generality for any differentiable policy class.

Abstract: In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

[393] GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

Jingquan Yan, Yuwei Miao, Lei Yu, Yuzhi Guo, Xue Xiao, Lin Xu, Junzhou Huang

Main category: cs.LG

TL;DR: GenePheno is the first interpretable multi-label framework that predicts knockout-induced phenotypic abnormalities directly from gene sequences, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Existing methods either focus on limited phenotype sets or rely on curated genetic information, limiting scalability and generalizability for broadly predicting multiple phenotype abnormalities from gene sequences.

Method: Uses contrastive multi-label learning with inter-phenotype correlation capture, exclusive regularization for biological consistency, and a gene function bottleneck layer for interpretability.

Result: Achieves state-of-the-art gene-centric Fmax and phenotype-centric AUC across four curated datasets, with case studies demonstrating ability to reveal gene functional mechanisms.

Conclusion: GenePheno successfully bridges the modality gap between sequences and phenotypes, providing an interpretable framework for scalable prediction of knockout-induced phenotypic abnormalities directly from gene sequences.

Abstract: Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric Fmax and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

[394] Event-Driven Digital-Time-Domain Inference Architectures for Tsetlin Machines

Tian Lan, Rishad Shafik, Alex Yakovlev

Main category: cs.LG

TL;DR: A digital-time-domain computing approach for Tsetlin machine inference that uses delay accumulation and Winner-Takes-All schemes to significantly improve energy efficiency and throughput compared to conventional digital implementations.

Details

Motivation: Traditional machine learning models require extensive arithmetic computations during inference, leading to high latency and power consumption. This paper aims to address these challenges specifically for Tsetlin machines.

Method: Proposes a digital-time-domain computing approach using delay accumulation mechanism to replace arithmetic sums, Winner-Takes-All scheme instead of magnitude comparators, Hamming distance-driven time-domain scheme for multi-class TMs, and differential delay paths with leading-ones-detector logarithmic delay compression for coalesced TMs.

Result: The proposed architecture demonstrates orders-of-magnitude improvements in energy efficiency and throughput compared to functionally equivalent post-implementation digital TM architecture baseline.

Conclusion: Digital-time-domain computing effectively addresses the computational overhead in Tsetlin machine inference, achieving substantial performance gains through innovative delay-based computation schemes.

Abstract: Machine learning fits model parameters to approximate input-output mappings, predicting unknown samples. However, these models often require extensive arithmetic computations during inference, increasing latency and power consumption. This paper proposes a digital-time-domain computing approach for Tsetlin machine (TM) inference process to address these challenges. This approach leverages a delay accumulation mechanism to mitigate the costly arithmetic sums of classes and employs a Winner-Takes-All scheme to replace conventional magnitude comparators. Specifically, a Hamming distance-driven time-domain scheme is implemented for multi-class TMs. Furthermore, differential delay paths, combined with a leading-ones-detector logarithmic delay compression digital-time-domain scheme, are utilised for the coalesced TMs, accommodating both binary-signed and exponential-scale delay accumulation issues. Compared to the functionally equivalent, post-implementation digital TM architecture baseline, the proposed architecture demonstrates orders-of-magnitude improvements in energy efficiency and throughput.

[395] SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar

Main category: cs.LG

TL;DR: SiDGen is a protein-conditioned diffusion framework that generates chemically valid ligands using masked SMILES generation with lightweight folding-derived features for pocket awareness, balancing efficiency and structural compatibility.

Details

Motivation: Existing approaches for ligand design either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability in computational drug discovery.

Method: Uses protein-conditioned diffusion with masked SMILES generation and lightweight folding-derived features. Supports two conditioning pathways: streamlined mode with pooled structural signals and full mode with localized pairwise biases. Implements coarse-stride folding with nearest-neighbor upsampling to reduce memory costs.

Result: Generates ligands with high validity, uniqueness, and novelty. Achieves competitive performance in docking-based evaluations while maintaining reasonable molecular properties.

Conclusion: SiDGen provides a practical route to scalable, pocket-aware molecular design for high-throughput drug discovery by balancing expressivity with efficiency.

Abstract: Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

[396] NSL-MT: Linguistically Informed Negative Samples for Efficient Machine Translation in Low-Resource Languages

Mamadou K. Keita, Christopher Homan, Huy Le

Main category: cs.LG

TL;DR: NSL-MT is a machine translation training method that teaches models what not to generate by adding penalties for linguistically invalid outputs, improving performance and data efficiency.

Details

Motivation: To address limited parallel data in machine translation by explicitly teaching models to avoid linguistically invalid outputs through constraint-based penalties.

Method: Encodes linguistic constraints as severity-weighted penalties in the loss function, generates synthetic violations of target language grammar, and penalizes models for assigning high probability to invalid outputs.

Result: Achieves 3-12% BLEU gains for well-performing models, 56-89% gains for poorly-performing models, and provides 5x data efficiency - training with 1,000 examples matches/exceeds normal training with 5,000 examples.

Conclusion: NSL-MT offers a data-efficient alternative training method for machine translation in settings with limited annotated parallel corpora.

Abstract: We introduce Negative Space Learning MT (NSL-MT), a training method that teaches models what not to generate by encoding linguistic constraints as severity-weighted penalties in the loss function. NSL-MT increases limited parallel data with synthetically generated violations of target language grammar, explicitly penalizing the model when it assigns high probability to these linguistically invalid outputs. We demonstrate that NSL-MT delivers improvements across all architectures: 3-12% BLEU gains for well-performing models and 56-89% gains for models lacking descent initial support. Furthermore, NSL-MT provides a 5x data efficiency multiplier – training with 1,000 examples matches or exceeds normal training with 5,000 examples. Thus, NSL-MT provides a data-efficient alternative training method for settings where there is limited annotated parallel corporas.

[397] Beyond the Hype: Embeddings vs. Prompting for Multiclass Classification Tasks

Marios Kokkodis, Richard Demsyn-Jones, Vijay Raghavan

Main category: cs.LG

TL;DR: Embeddings-based softmax models outperform LLM prompting for multiclass classification on home-service project data, achieving 49.5% higher accuracy, better calibration, faster speed, and lower cost.

Details

Motivation: To challenge the AI hype around LLMs and demonstrate that traditional classification approaches can still be superior for certain multiclass classification problems using proprietary datasets.

Method: Built embeddings-based softmax models using text and images from home-service project descriptions, then compared against state-of-the-art LLM prompts for the same classification task.

Result: Embeddings approach outperformed LLM prompting with 49.5% higher accuracy, better calibration, 14-81x faster processing, and up to 10x lower cost. Performance was consistent across text-only, image-only, and multimodal inputs.

Conclusion: For multiclass classification problems leveraging proprietary datasets, embeddings-based approaches can yield unequivocally better results than LLM prompting, providing a practical alternative beyond AI hype.

Abstract: Are traditional classification approaches irrelevant in this era of AI hype? We show that there are multiclass classification problems where predictive models holistically outperform LLM prompt-based frameworks. Given text and images from home-service project descriptions provided by Thumbtack customers, we build embeddings-based softmax models that predict the professional category (e.g., handyman, bathroom remodeling) associated with each problem description. We then compare against prompts that ask state-of-the-art LLM models to solve the same problem. We find that the embeddings approach outperforms the best LLM prompts in terms of accuracy, calibration, latency, and financial cost. In particular, the embeddings approach has 49.5% higher accuracy than the prompting approach, and its superiority is consistent across text-only, image-only, and text-image problem descriptions. Furthermore, it yields well-calibrated probabilities, which we later use as confidence signals to provide contextualized user experience during deployment. On the contrary, prompting scores are overly uninformative. Finally, the embeddings approach is 14 and 81 times faster than prompting in processing images and text respectively, while under realistic deployment assumptions, it can be up to 10 times cheaper. Based on these results, we deployed a variation of the embeddings approach, and through A/B testing we observed performance consistent with our offline analysis. Our study shows that for multiclass classification problems that can leverage proprietary datasets, an embeddings-based approach may yield unequivocally better results. Hence, scientists, practitioners, engineers, and business leaders can use our study to go beyond the hype and consider appropriate predictive models for their classification use cases.

[398] GraphRAFT: Retrieval Augmented Fine-Tuning for Knowledge Graphs on Graph Databases

Alfred Clemedtson, Borun Shi

Main category: cs.LG

TL;DR: GraphRAFT is a framework that fine-tunes LLMs to generate correct Cypher queries for retrieving subgraph contexts from knowledge graphs, enabling accurate answers to multi-hop questions without hallucinations.

Details

Motivation: Existing GraphRAG methods either ignore retrieval or use inefficient ad hoc processes, preventing adoption with graph databases that support query languages like Cypher.

Method: Fine-tune LLMs to generate provably correct Cypher queries that retrieve high-quality subgraph contexts, then use these contexts to produce accurate answers through a retrieve-and-reason framework.

Result: Achieves significantly better results than all state-of-the-art models across four standard metrics on two challenging Q&A tasks with large text-attributed knowledge graphs.

Conclusion: GraphRAFT is the first off-the-shelf solution for knowledge graphs in native graph databases, is sample-efficient, scales with training data, and outperforms existing methods.

Abstract: Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM’s context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q&As on large text-attributed KGs.

[399] Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Main category: cs.LG

TL;DR: TBRM is a value-based RL method for LLM reasoning that uses trajectory-level Bellman residual minimization with model logits as Q-values, outperforming policy-based methods like PPO with lower computational overhead.

Details

Motivation: Policy-based methods dominate RL for LLM reasoning, leaving value-based approaches largely unexplored despite their potential benefits.

Method: Trajectory Bellman Residual Minimization (TBRM) - an off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model’s own logits as Q-values, eliminating need for critics, importance sampling, or clipping.

Result: TBRM consistently outperforms policy-based baselines (PPO, GRPO) on mathematical reasoning benchmarks with comparable or lower computational and memory overhead.

Conclusion: Value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

Abstract: Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model’s own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

[400] Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle

Main category: cs.LG

TL;DR: PRISM is a new framework for automated interpretability that identifies polysemantic features in LLMs, addressing limitations of current methods that assume single-concept neurons.

Details

Motivation: Current automated neuron-level feature description methods have limited robustness and assume monosemanticity, which restricts their ability to capture the full complexity of model behaviors, especially given evidence of polysemanticity in LLMs.

Method: Introduces PRISM framework that produces nuanced feature descriptions accounting for both monosemantic and polysemantic behavior, unlike traditional single-description-per-neuron approaches.

Result: PRISM produces more accurate and faithful feature descriptions, improving both overall description quality and ability to capture distinct concepts when polysemanticity is present, as demonstrated through extensive benchmarking.

Conclusion: PRISM effectively addresses the limitations of current automated interpretability methods by better capturing the complexity of features in LLMs, particularly handling polysemantic behavior that traditional approaches miss.

Abstract: Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

[401] CSAI: Conditional Self-Attention Imputation for Healthcare Time-series

Linglong Qian, Joseph Arul Raj, Hugh Logan Ellis, Ao Zhang, Yuezhou Zhang, Tao Wang, Richard JB Dobson, Zina Ibrahim

Main category: cs.LG

TL;DR: CSAI is a novel recurrent neural network for imputing missing data in EHR time series, featuring attention-based initialization, clinical decay patterns, and non-uniform masking to handle complex missingness.

Details

Motivation: To address complex missing data patterns in EHR time series that existing methods struggle with, by developing an approach that aligns better with clinical realities.

Method: Uses conditional self-attention with three key modifications: attention-based hidden state initialization, domain-informed temporal decay, and non-uniform masking strategy calibrated to EHR data characteristics.

Result: Demonstrated superior performance across four EHR benchmark datasets compared to state-of-the-art methods in both data restoration and downstream tasks.

Conclusion: CSAI significantly advances neural network imputation for EHRs by better modeling clinical data realities and is available as part of the open-source PyPOTS toolbox.

Abstract: We introduce the Conditional Self-Attention Imputation (CSAI) model, a novel recurrent neural network architecture designed to address the challenges of complex missing data patterns in multivariate time series derived from hospital electronic health records (EHRs). CSAI extends state-of-the-art neural network-based imputation by introducing key modifications specific to EHR data: a) attention-based hidden state initialisation to capture both long- and short-range temporal dependencies prevalent in EHRs, b) domain-informed temporal decay to mimic clinical data recording patterns, and c) a non-uniform masking strategy that models non-random missingness by calibrating weights according to both temporal and cross-sectional data characteristics. Comprehensive evaluation across four EHR benchmark datasets demonstrates CSAI’s effectiveness compared to state-of-the-art architectures in data restoration and downstream tasks. CSAI is integrated into PyPOTS, an open-source Python toolbox designed for machine learning tasks on partially observed time series. This work significantly advances the state of neural network imputation applied to EHRs by more closely aligning algorithmic imputation with clinical realities.

[402] An Information Theoretic Evaluation Metric For Strong Unlearning

Dongjae Jeon, Wonje Jeung, Taeheon Kim, Albert No, Jonghyun Choi

Main category: cs.LG

TL;DR: The paper introduces IDI, a white-box metric based on information theory to evaluate machine unlearning by measuring retained information in intermediate layers, addressing limitations of existing black-box methods.

Details

Motivation: Current black-box metrics for machine unlearning (like membership inference attacks) fail to capture residual information in intermediate layers, making it challenging to properly evaluate strong unlearning where models should be indistinguishable from retrained ones.

Method: Proposed Information Difference Index (IDI) - a white-box metric that quantifies retained information by measuring mutual information between intermediate features and labels to be forgotten.

Result: Experiments show IDI effectively measures unlearning degree across various datasets and architectures, providing reliable evaluation of strong unlearning in DNNs.

Conclusion: IDI offers a comprehensive assessment of unlearning efficacy and serves as a reliable tool for evaluating strong machine unlearning in deep neural networks.

Abstract: Machine unlearning (MU) aims to remove the influence of specific data from trained models, addressing privacy concerns and ensuring compliance with regulations such as the ``right to be forgotten.’’ Evaluating strong unlearning, where the unlearned model is indistinguishable from one retrained without the forgetting data, remains a significant challenge in deep neural networks (DNNs). Common black-box metrics, such as variants of membership inference attacks and accuracy comparisons, primarily assess model outputs but often fail to capture residual information in intermediate layers. To bridge this gap, we introduce the Information Difference Index (IDI), a novel white-box metric inspired by information theory. IDI quantifies retained information in intermediate features by measuring mutual information between those features and the labels to be forgotten, offering a more comprehensive assessment of unlearning efficacy. Our experiments demonstrate that IDI effectively measures the degree of unlearning across various datasets and architectures, providing a reliable tool for evaluating strong unlearning in DNNs.

[403] Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

Main category: cs.LG

TL;DR: THR is a token-level metric that quantifies each token’s influence on correct responses under GRPO. Training is dominated by tokens with high absolute THR values, where positive THR favors exploitation and negative THR enables exploration. A THR-guided reweighting algorithm can bias training toward exploitation or exploration, improving performance on math reasoning benchmarks.

Details

Motivation: While reinforcement learning with verifiable rewards has advanced LLM reasoning, how to explicitly steer training toward exploration or exploitation remains an open problem. Current methods lack fine-grained control over these fundamental RL trade-offs.

Method: Introduces Token Hidden Reward (THR) - a token-level metric quantifying each token’s influence on correct response likelihood under GRPO. Develops a THR-guided reweighting algorithm that modulates GRPO’s learning signals to bias training toward exploitation (amplifying positive THR tokens) or exploration (weakening negative THR tokens).

Result: The algorithm improves greedy-decoding accuracy when favoring exploitation, and yields consistent gains in Pass@K accuracy when favoring exploration. It integrates seamlessly with other RL objectives like GSPO and generalizes across architectures including Llama.

Conclusion: THR provides a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, offering new tools for targeted fine-tuning in reasoning-intensive applications.

Abstract: Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

[404] TeVAE: A Variational Autoencoder Approach for Discrete Online Anomaly Detection in Variable-state Multivariate Time-series Data

Lucas Correia, Jan-Christoph Goos, Philipp Klein, Thomas Bäck, Anna V. Kononova

Main category: cs.LG

TL;DR: Proposes TeVAE, a temporal variational autoencoder for automatic online anomaly detection in automotive testing data, achieving 65% anomaly detection with 6% false positives on real-world industrial data.

Details

Motivation: Manual evaluation of automotive testing data is reaching limits, creating need for automatic online anomaly detection to handle complex real-world data and model testee behavior.

Method: Uses temporal variational autoencoder (TeVAE) trained on unlabelled data, avoids bypass phenomenon, introduces window-to-time-series remapping method, and includes metrics for detection delay and root-cause evaluation.

Result: When properly configured, TeVAE achieves 65% anomaly detection rate with only 6% false positives, and shows potential for good performance with smaller training subsets.

Conclusion: TeVAE effectively addresses automotive testing anomaly detection needs but requires sophisticated threshold estimation methods for optimal performance.

Abstract: As attention to recorded data grows in the realm of automotive testing and manual evaluation reaches its limits, there is a growing need for automatic online anomaly detection. This real-world data is complex in many ways and requires the modelling of testee behaviour. To address this, we propose a temporal variational autoencoder (TeVAE) that can detect anomalies with minimal false positives when trained on unlabelled data. Our approach also avoids the bypass phenomenon and introduces a new method to remap individual windows to a continuous time series. Furthermore, we propose metrics to evaluate the detection delay and root-cause capability of our approach and present results from experiments on a real-world industrial data set. When properly configured, TeVAE flags anomalies only 6% of the time wrongly and detects 65% of anomalies present. It also has the potential to perform well with a smaller training and validation subset but requires a more sophisticated threshold estimation method.

[405] Conditional Distribution Learning for Graph Classification

Jie Chen, Hua Mao, Chuanbin Liu, Zhu Wang, Xi Peng

Main category: cs.LG

TL;DR: Proposes conditional distribution learning (CDL) to resolve conflict between GNN message-passing and contrastive learning, enabling effective graph representation learning while preserving semantic information.

Details

Motivation: Address the conflict between graph neural network message-passing mechanisms (which produce similar embeddings) and contrastive learning (which aims to increase dissimilarity), while leveraging diverse graph augmentations without losing semantic information.

Method: End-to-end graph representation learning model that aligns conditional distributions of weakly and strongly augmented features over original features, using positive pairs of node representations to measure similarity between original and weakly augmented features.

Result: Extensive experiments on benchmark graph datasets demonstrate the effectiveness of the proposed CDL method for semi-supervised graph classification.

Conclusion: The CDL method successfully resolves the conflict between message-passing and contrastive learning, enabling effective graph representation learning while preserving intrinsic semantic information from diverse augmentations.

Abstract: Leveraging the diversity and quantity of data provided by various graph-structured data augmentations while preserving intrinsic semantic information is challenging. Additionally, successive layers in graph neural network (GNN) tend to produce more similar node embeddings, while graph contrastive learning aims to increase the dissimilarity between negative pairs of node embeddings. This inevitably results in a conflict between the message-passing mechanism (MPM) of GNNs and the contrastive learning (CL) of negative pairs via intraviews. In this paper, we propose a conditional distribution learning (CDL) method that learns graph representations from graph-structured data for semisupervised graph classification. Specifically, we present an end-to-end graph representation learning model to align the conditional distributions of weakly and strongly augmented features over the original features. This alignment enables the CDL model to effectively preserve intrinsic semantic information when both weak and strong augmentations are applied to graph-structured data. To avoid the conflict between the MPM and the CL of negative pairs, positive pairs of node representations are retained for measuring the similarity between the original features and the corresponding weakly augmented features. Extensive experiments with several benchmark graph datasets demonstrate the effectiveness of the proposed CDL method.

[406] Certified Training with Branch-and-Bound for Lyapunov-stable Neural Control

Zhouxing Shi, Haoyu Li, Cho-Jui Hsieh, Huan Zhang

Main category: cs.LG

TL;DR: CT-BaB is a certified training framework that optimizes certified bounds for Lyapunov-stable neural controllers, reducing training-verification gap and enabling efficient verification with larger regions-of-attraction.

Details

Motivation: Previous works used counterexample-guided training without considering verification computation during training, leading to discrepancies between training and test-time verification that computes certified bounds.

Method: Introduces Certified Training with Branch-and-Bound (CT-BaB) that maintains dynamic training datasets and adaptively splits hard input subregions to tighten certified bounds. Training-time BaB informs test-time verification for efficiency.

Result: CT-BaB reduces verification time by over 11X relative to previous state-of-the-art while achieving 164X larger ROA on 2D Quadrotor system. Produces verification-friendly models with stronger verifiable guarantees.

Conclusion: CT-BaB framework successfully bridges training-verification gap, enabling more efficient verification and stronger stability guarantees for neural controllers through certified bound optimization and adaptive subregion splitting.

Abstract: We study the problem of learning verifiably Lyapunov-stable neural controllers that provably satisfy the Lyapunov asymptotic stability condition within a region-of-attraction (ROA). Unlike previous works that adopted counterexample-guided training without considering the computation of verification in training, we introduce Certified Training with Branch-and-Bound (CT-BaB), a new certified training framework that optimizes certified bounds, thereby reducing the discrepancy between training and test-time verification that also computes certified bounds. To achieve a relatively global guarantee on an entire input region-of-interest, we propose a training-time BaB technique that maintains a dynamic training dataset and adaptively splits hard input subregions into smaller ones, to tighten certified bounds and ease the training. Meanwhile, subregions created by the training-time BaB also inform test-time verification, for a more efficient training-aware verification. We demonstrate that CT-BaB yields verification-friendly models that can be more efficiently verified at test time while achieving stronger verifiable guarantees with larger ROA. On the largest output-feedback 2D Quadrotor system experimented, CT-BaB reduces verification time by over 11X relative to the previous state-of-the-art baseline while achieving 164X larger ROA.

[407] Contextual Thompson Sampling via Generation of Missing Data

Kelly W. Zhang, Tiffany Tianhui Cai, Hongseok Namkoong, Daniel Russo

Main category: cs.LG

TL;DR: A novel Thompson sampling framework for contextual bandits that uses offline-learned generative models to impute missing outcomes, treating uncertainty as stemming from unobservable outcomes rather than latent parameters.

Details

Motivation: To reframe uncertainty in contextual bandits as arising from missing outcomes rather than unobservable latent parameters, enabling the use of generative models learned offline to probabilistically impute these outcomes for better decision-making.

Method: At each decision point, the algorithm uses a generative model to probabilistically impute missing outcomes (both future and counterfactual), fits an ‘oracle’ policy on the complete imputed dataset, and uses this policy to select actions.

Result: The algorithm is formally shown to be a generative formulation of Thompson sampling and achieves a state-of-the-art regret bound that depends only on the generative model’s offline prediction loss quality.

Conclusion: This generative approach to Thompson sampling provides a flexible framework where regret depends on generative model quality rather than specific policy fitting methods, offering a new perspective on uncertainty quantification in contextual bandits.

Abstract: We introduce a framework for Thompson sampling (TS) contextual bandit algorithms, in which the algorithm’s ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable outcomes (including both future and counterfactual outcomes). If these outcomes were all observed, one could simply make decisions using an “oracle” policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of TS and establish a state-of-the-art regret bound. Notably, our regret bound depends on the generative model only through the quality of its offline prediction loss, and applies to any method of fitting the “oracle” policy.

[408] A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, Satya Narayan Shukla

Main category: cs.LG

TL;DR: LOOP is a novel RL method for diffusion model fine-tuning that combines REINFORCE’s variance reduction techniques with PPO’s robustness and sample efficiency, achieving better computational efficiency-performance balance.

Details

Motivation: PPO is effective but computationally expensive and hyperparameter-sensitive, while REINFORCE is computationally simpler but suffers from high variance and sample inefficiency. There's a need for a method that balances efficiency and effectiveness.

Method: LOOP combines variance reduction techniques from REINFORCE (multiple actions per prompt, baseline correction) with PPO’s robustness mechanisms (clipping and importance sampling) to create a hybrid approach.

Result: LOOP effectively improves diffusion models on various black-box objectives and achieves better balance between computational efficiency and performance compared to PPO and REINFORCE.

Conclusion: LOOP successfully addresses the efficiency-effectiveness trade-off in RL-based diffusion model fine-tuning by integrating strengths from both REINFORCE and PPO approaches.

Abstract: Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

[409] What’s Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan

Main category: cs.LG

TL;DR: The paper introduces ‘steerability’ as a new metric for evaluating generative models, focusing on whether users can achieve specific goals rather than just model output quality. It presents a benchmark using reproduction tasks and shows current models perform poorly on steerability despite good output quality.

Details

Motivation: Current evaluation metrics for generative models focus mainly on producibility (output quality and diversity), but fail to measure whether users can actually steer models to achieve their specific goals, which is crucial for practical value.

Method: The authors mathematically decompose steerability from producibility and create a benchmark where users must reproduce outputs sampled from generative models. They implement this in user studies for text-to-image and large language models.

Result: Despite models producing high-quality outputs, they perform poorly on steerability. Simple image-based steering mechanisms achieved more than 2x improvement on the benchmark.

Conclusion: Current generative models need significant improvement in steerability, and the proposed benchmark provides a way to measure and guide such improvements, with initial results showing steerability can be substantially enhanced.

Abstract: How should we evaluate the quality of generative models? Many existing metrics focus on a model’s producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical decomposition for quantifying steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user’s goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in user studies of text-to-image and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerability. These results suggest that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: simple image-based steering mechanisms achieve more than 2x improvement on this benchmark.

[410] Evolutionary Policy Optimization

Jianren Wang, Yifan Su, Abhinav Gupta, Deepak Pathak

Main category: cs.LG

TL;DR: EPO is a hybrid algorithm combining Evolutionary Algorithms’ scalability and diversity with policy gradients’ performance and stability, achieving superior results in sample efficiency, asymptotic performance, and scalability across various tasks.

Details

Motivation: To address the limitations of on-policy RL algorithms (poor scalability with batch size due to redundant data) and Evolutionary Algorithms (sample inefficiency) by creating a hybrid approach that leverages the strengths of both methods.

Method: Maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent.

Result: Outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability across dexterous manipulation, legged locomotion, and classic control tasks.

Conclusion: EPO successfully combines the scalability and diversity of Evolutionary Algorithms with the performance and stability of policy gradients, demonstrating superior performance across multiple domains.

Abstract: On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

[411] ReactionTeam: Teaming Experts for Divergent Thinking Beyond Typical Reaction Patterns

Taicheng Guo, Changsheng Ma, Xiuying Chen, Bozhao Nan, Kehan Guo, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

Main category: cs.LG

TL;DR: ReactionTeam uses multiple expert models to predict diverse chemical reaction outcomes, addressing limitations of single-model approaches that miss rare but important reaction patterns.

Details

Motivation: Traditional generative models overlook the stochastic nature of chemical reactions and only predict common outcomes, missing rare but potentially innovative reaction patterns that could advance synthesis techniques.

Method: A team of specialized expert models, each trained to capture distinct electron redistribution patterns, combined with a ranking expert that evaluates and orders the generated predictions.

Result: Experimental results across two widely used datasets show significantly better performance compared to existing state-of-the-art approaches in various data settings.

Conclusion: The proposed ReactionTeam framework effectively captures diverse plausible reaction outcomes by mimicking chemists’ divergent thinking, overcoming limitations of single-model prediction systems.

Abstract: Reaction prediction, a critical task in synthetic chemistry, is to predict the outcome of a reaction based on given reactants. Generative models like Transformer have typically been employed to predict the reaction product. However, these likelihood-maximization models overlooked the inherent stochastic nature of chemical reactions, such as the multiple ways electrons can be redistributed among atoms during the reaction process. In scenarios where similar reactants could follow different electron redistribution patterns, these models typically predict the most common outcomes, neglecting less frequent but potentially crucial reaction patterns. These overlooked patterns, though rare, can lead to innovative methods for designing synthetic routes and significantly advance synthesis techniques. To address these limitations, we build a team of expert models to capture diverse plausible reaction outcomes for the same reactants, mimicking the divergent thinking of chemists. The proposed framework, ReactionTeam, is composed of specialized expert models, each trained to capture a distinct type of electron redistribution pattern in reaction, and a ranking expert that evaluates and orders the generated predictions. Experimental results across two widely used datasets and different data settings demonstrate that our proposed method achieves significantly better performance compared to existing state-of-the-art approaches.

[412] A Causal Framework to Measure and Mitigate Non-binary Treatment Discrimination

Ayan Majumdar, Deborah D. Kanubala, Kavya Gupta, Isabel Valera

Main category: cs.LG

TL;DR: This paper proposes a causal framework for fairness analysis that incorporates non-binary treatment decisions (e.g., bail conditions, loan terms) rather than just binary classifications, and demonstrates its effectiveness in mitigating treatment discrimination in loan approval datasets.

Details

Motivation: Current fairness studies oversimplify complex decision processes as binary classification tasks, ignoring non-binary treatment decisions that influence downstream outcomes and should be central to fairness analyses.

Method: A causal framework that distinguishes between decision-subjects’ covariates and treatment decisions, enabling measurement of treatment disparity and counterfactual reasoning to mitigate unfair treatment decisions.

Result: Empirical analysis of four loan approval datasets revealed potential disparity in non-binary treatment decisions and their discriminatory impact, while interventions using the framework effectively mitigated treatment discrimination.

Conclusion: Treatment decisions should be incorporated in fairness assessments, and the proposed framework enables fair risk score estimation and non-binary decision-making processes that benefit all stakeholders.

Abstract: Fairness studies of algorithmic decision-making systems often simplify complex decision processes, such as bail or loan approvals, into binary classification tasks. However, these approaches overlook that such decisions are not inherently binary (e.g., approve or not approve bail or loan); they also involve non-binary treatment decisions (e.g., bail conditions or loan terms) that can influence the downstream outcomes (e.g., loan repayment or reoffending). In this paper, we argue that non-binary treatment decisions are integral to the decision process and controlled by decision-makers and, therefore, should be central to fairness analyses in algorithmic decision-making. We propose a causal framework that extends fairness analyses and explicitly distinguishes between decision-subjects’ covariates and the treatment decisions. This specification allows decision-makers to use our framework to (i) measure treatment disparity and its downstream effects in historical data and, using counterfactual reasoning, (ii) mitigate the impact of past unfair treatment decisions when automating decision-making. We use our framework to empirically analyze four widely used loan approval datasets to reveal potential disparity in non-binary treatment decisions and their discriminatory impact on outcomes, highlighting the need to incorporate treatment decisions in fairness assessments. Moreover, by intervening in treatment decisions, we show that our framework effectively mitigates treatment discrimination from historical data to ensure fair risk score estimation and (non-binary) decision-making processes that benefit all stakeholders.

[413] Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Qiwei Di, Jiafan He, Quanquan Gu

Main category: cs.LG

TL;DR: The paper proposes a robust contextual dueling bandits algorithm that handles adversarial feedback in preference learning, achieving nearly minimax optimal regret bounds and outperforming state-of-the-art methods.

Details

Motivation: Human feedback is crucial for aligning generative models like LLMs, but adversaries can provide misleading preferences to manipulate outputs in harmful directions. Existing methods are vulnerable to such adversarial attacks.

Method: Proposes robust contextual dueling bandits algorithm based on uncertainty-weighted maximum likelihood estimation. For sigmoid link functions, develops a novel method that considers local derivatives in MLE analysis to eliminate κ dependence in the leading term.

Result: Achieves Õ(d√T/κ + dC/κ) regret bound where T is rounds, d is context dimension, κ is link function derivative lower bound, and C is adversarial feedback count. Proves this is nearly optimal. Experimental results show superiority over state-of-the-art algorithms.

Conclusion: The work provides the first nearly minimax optimal regret bound for dueling bandits with adversarial feedback, with significant improvements in handling adversarial manipulation of preference learning for model alignment.

Abstract: Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain–contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm, namely robust contextual dueling bandits, which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}/κ+dC/κ)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, $κ$ is the lower bound of the derivative of the link function, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, for the sigmoid link function, we develop a novel algorithm that takes into account the effect of local derivatives in maximum likelihood estimation (MLE) analysis through a refined method for estimating the link function’s derivative. This method helps us to eliminate the $κ$ dependence in the leading term with respect to $T$, which reduces the exponential dependence on the parameter radius $B$ to a polynomial dependence. We conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

[414] Adaptive Data Analysis for Growing Data

Neil G. Marchant, Benjamin I. P. Rubinstein

Main category: cs.LG

TL;DR: First generalization bounds for adaptive analysis on dynamic data, allowing queries to be scheduled based on current data size and incorporating time-varying accuracy bounds.

Details

Motivation: Address the gap in existing work that assumes static data, enabling adaptive analysis on growing datasets while maintaining statistical validity.

Method: Allow analysts to adaptively schedule queries conditioned on current data size, previous queries and responses, using time-varying empirical accuracy bounds and mechanisms.

Result: Asymptotic data requirements grow with square-root of number of adaptive queries, matching prior improvements over data splitting for static settings.

Conclusion: The approach empirically outperforms static baselines when instantiated with statistical queries using clipped Gaussian mechanism, providing tighter guarantees as data accumulates.

Abstract: Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis on dynamic data. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works’ improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds.

[415] Mixture of Scope Experts at Test: Generalizing Deeper Graph Neural Networks with Shallow Variants

Gangda Deng, Hongkuan Zhou, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: Proposes Moscat, a method to improve deeper GNN performance on heterophilous graphs by mixing scope experts at test time, addressing the generalization disparity across nodes with different homophily levels.

Details

Motivation: Heterophilous graphs challenge GNNs as dissimilar nodes connect. While deeper GNNs can find homophily in higher-order neighborhoods, they suffer from performance degradation despite better expressivity.

Method: Moscat (Mixture of scope experts at test) improves deeper GNN generalization by combining models with different receptive field scopes, addressing the disparity in generalization patterns across nodes with varying homophily levels.

Result: Experimental results show Moscat significantly improves accuracy across various GNNs and datasets while maintaining flexibility.

Conclusion: Moscat effectively enhances deeper GNN performance on heterophilous graphs by leveraging the complementary generalization patterns of models with different depths.

Abstract: Heterophilous graphs, where dissimilar nodes tend to connect, pose a challenge for graph neural networks (GNNs). Increasing the GNN depth can expand the scope (i.e., receptive field), potentially finding homophily from the higher-order neighborhoods. However, GNNs suffer from performance degradation as depth increases. Despite having better expressivity, state-of-the-art deeper GNNs achieve only marginal improvements compared to their shallow variants. Through theoretical and empirical analysis, we systematically demonstrate a shift in GNN generalization preferences across nodes with different homophily levels as depth increases. This creates a disparity in generalization patterns between GNN models with varying depth. Based on these findings, we propose to improve deeper GNN generalization while maintaining high expressivity by Mixture of scope experts at test (Moscat). Experimental results show that Moscat works flexibly with various GNNs across a wide range of datasets while significantly improving accuracy. Our code is available at (https://github.com/Hydrapse/moscat).

[416] Repetitive Contrastive Learning Enhances Mamba’s Selectivity in Time Series Prediction

Wenbo Yan, Hanzhong Cao, Ying Tan

Main category: cs.LG

TL;DR: RCL is a token-level contrastive pretraining framework that enhances Mamba’s selective capabilities for long sequence prediction by using sequence augmentation and contrastive learning to prioritize information-rich time steps while suppressing noise.

Details

Motivation: Mamba-based models struggle with insufficient focus on critical time steps and incomplete noise suppression due to limited selective abilities, which hinders their performance in long sequence prediction tasks.

Method: Repetitive Contrastive Learning (RCL) pretrains a single Mamba block using sequence augmentation with Gaussian noise and applies inter-sequence and intra-sequence contrastive learning to enhance selective capabilities, then transfers these pretrained parameters to initialize Mamba blocks in backbone models.

Result: Extensive experiments show RCL consistently boosts backbone model performance, surpassing existing methods and achieving state-of-the-art results in temporal prediction tasks.

Conclusion: RCL effectively enhances Mamba’s selective capabilities, with proposed metrics providing theoretical, qualitative, and quantitative evidence for the improvements in long sequence prediction performance.

Abstract: Long sequence prediction is a key challenge in time series forecasting. While Mamba-based models have shown strong performance due to their sequence selection capabilities, they still struggle with insufficient focus on critical time steps and incomplete noise suppression, caused by limited selective abilities. To address this, we introduce Repetitive Contrastive Learning (RCL), a token-level contrastive pretraining framework aimed at enhancing Mamba’s selective capabilities. RCL pretrains a single Mamba block to strengthen its selective abilities and then transfers these pretrained parameters to initialize Mamba blocks in various backbone models, improving their temporal prediction performance. RCL uses sequence augmentation with Gaussian noise and applies inter-sequence and intra-sequence contrastive learning to help the Mamba module prioritize information-rich time steps while ignoring noisy ones. Extensive experiments show that RCL consistently boosts the performance of backbone models, surpassing existing methods and achieving state-of-the-art results. Additionally, we propose two metrics to quantify Mamba’s selective capabilities, providing theoretical, qualitative, and quantitative evidence for the improvements brought by RCL.

[417] ExDBN: Learning Dynamic Bayesian Networks using Extended Mixed-Integer Programming Formulations

Pavel Rytir, Ales Wodecki, Georgios Korpas, Jakub Marecek

Main category: cs.LG

TL;DR: A novel score-based learning algorithm for dynamic causal Bayesian networks using mixed-integer quadratic programming with branch-and-cut methods to avoid exponential acyclicity constraints.

Details

Motivation: To extend causal learning from static to dynamic settings by capturing temporal dependencies in time series data, addressing the need for accurate causal discovery in domains like bioscience and finance.

Method: Formulated a mixed-integer quadratic program for score-based learning of dynamic Bayesian networks, using branch-and-cut methods to efficiently handle acyclicity constraints without pre-generating exponentially many constraints.

Result: The proposed approach produces more accurate results than state-of-the-art methods on synthetic instances with up to 80 time series, and demonstrates practical utility in bioscience and finance applications.

Conclusion: The method provides a highly accurate, globally convergent solver for dynamic causal learning that handles modest-sized instances effectively, showing importance for real-world applications.

Abstract: Causal learning from data has received much attention recently. Bayesian networks can be used to capture causal relationships. There, one recovers a weighted directed acyclic graph in which random variables are represented by vertices, and the weights associated with each edge represent the strengths of the causal relationships between them. This concept is extended to capture dynamic effects by introducing a dependency on past data, which may be captured by the structural equation model. This formalism is utilized in the present contribution to propose a score-based learning algorithm. A mixed-integer quadratic program is formulated and an algorithmic solution proposed, in which the pre-generation of exponentially many acyclicity constraints is avoided by utilizing the so-called branch-and-cut (``lazy constraint’’) method. Comparing the novel approach to the state-of-the-art, we show that the proposed approach turns out to produce more accurate results when applied to small and medium-sized synthetic instances containing up to 80 time series. Lastly, two interesting applications in bioscience and finance, to which the method is directly applied, further stress the importance of developing highly accurate, globally convergent solvers that can handle instances of modest size.

[418] A Physics-Constrained Neural Differential Equation Framework for Data-Driven Snowpack Simulation

Andrew Charbonneau, Katherine Deck, Tapio Schneider

Main category: cs.LG

TL;DR: Physics-constrained neural differential equation framework for snow depth prediction that achieves under 9% median error and generalizes to unseen sites while satisfying physical constraints.

Details

Motivation: To develop a snow depth parameterization model that can generalize to new sites not seen during training, which traditional calibrated snow models often fail to do, while maintaining physical consistency.

Method: Physics-constrained neural differential equation framework trained on multiple SNOTEL sites data to model seasonal snow depth evolution given hydrometeorological forcings.

Result: Predicts daily snow depth with under 9% median error and Nash Sutcliffe Efficiencies over 0.94 across various snow climates. Generalizes to new unseen sites. Adding snow water equivalent prediction increases error to ~12%.

Conclusion: The approach guarantees physical constraint satisfaction, enables constraint enforcement during training, allows modeling at different temporal resolutions without retraining, and has potential for climate modeling and other dynamical systems with physical constraints.

Abstract: This paper presents a physics-constrained neural differential equation framework for parameterization, and employs it to model the time evolution of seasonal snow depth given hydrometeorological forcings. When trained on data from multiple SNOTEL sites, the parameterization predicts daily snow depth with under 9% median error and Nash Sutcliffe Efficiencies over 0.94 across a wide variety of snow climates. The parameterization also generalizes to new sites not seen during training, which is not often true for calibrated snow models. Requiring the parameterization to predict snow water equivalent in addition to snow depth only increases error to ~12%. The structure of the approach guarantees the satisfaction of physical constraints, enables these constraints during model training, and allows modeling at different temporal resolutions without additional retraining of the parameterization. These benefits hold potential in climate modeling, and could extend to other dynamical systems with physical constraints.

[419] Trustworthy Transfer Learning: A Survey

Jun Wu, Jingrui He

Main category: cs.LG

TL;DR: This paper provides a comprehensive review of trustworthy transfer learning, focusing on knowledge transferability measurement/enhancement and trustworthiness aspects like adversarial robustness, fairness, and privacy.

Details

Motivation: To understand transfer learning from perspectives of knowledge transferability and trustworthiness, addressing how to quantitatively measure/enhance transferability and whether we can trust transferred knowledge.

Method: Comprehensive review approach covering problem definitions, theoretical analysis, empirical algorithms, and real-world applications; summarizes recent theories and algorithms for understanding knowledge transferability under IID and non-IID assumptions.

Result: The review synthesizes current advancements in trustworthy transfer learning, including theories for knowledge transferability and impacts of trustworthiness factors like adversarial robustness, algorithmic fairness, and privacy constraints.

Conclusion: Beyond current advancements, the paper highlights open questions and future directions for understanding transfer learning in a reliable and trustworthy manner.

Abstract: Transfer learning aims to transfer knowledge or information from a source domain to a relevant target domain. In this paper, we understand transfer learning from the perspectives of knowledge transferability and trustworthiness. This involves two research questions: How is knowledge transferability quantitatively measured and enhanced across domains? Can we trust the transferred knowledge in the transfer learning process? To answer these questions, this paper provides a comprehensive review of trustworthy transfer learning from various aspects, including problem definitions, theoretical analysis, empirical algorithms, and real-world applications. Specifically, we summarize recent theories and algorithms for understanding knowledge transferability under (within-domain) IID and non-IID assumptions. In addition to knowledge transferability, we review the impact of trustworthiness on transfer learning, e.g., whether the transferred knowledge is adversarially robust or algorithmically fair, how to transfer the knowledge under privacy-preserving constraints, etc. Beyond discussing the current advancements, we highlight the open questions and future directions for understanding transfer learning in a reliable and trustworthy manner.

[420] AutoG: Towards automatic graph construction from tabular data

Zhikai Chen, Han Xie, Jian Zhang, Xiang song, Jiliang Tang, Huzefa Rangwala, George Karypis

Main category: cs.LG

TL;DR: This paper addresses the understudied problem of constructing graphs from tabular data for graph machine learning, proposing AutoG - an LLM-based solution that automatically generates high-quality graph schemas without human intervention.

Details

Motivation: Graph machine learning has focused on developing powerful models but overlooked the crucial initial step of constructing suitable graphs from common data formats like tabular data. This construction process remains understudied and lacks formalization.

Method: Proposes AutoG, an LLM-based solution that automatically generates high-quality graph schemas from tabular data without human intervention. Also introduces dedicated datasets to formalize and evaluate graph construction methods.

Result: Experimental results show that graph quality is critical to downstream task performance, and AutoG can generate high-quality graphs that rival those produced by human experts.

Conclusion: The paper successfully formalizes the graph construction problem and demonstrates that automated graph construction from tabular data is feasible and effective, with AutoG matching human expert performance.

Abstract: Recent years have witnessed significant advancements in graph machine learning (GML), with its applications spanning numerous domains. However, the focus of GML has predominantly been on developing powerful models, often overlooking a crucial initial step: constructing suitable graphs from common data formats, such as tabular data. This construction process is fundamental to applying graph-based models, yet it remains largely understudied and lacks formalization. Our research aims to address this gap by formalizing the graph construction problem and proposing an effective solution. We identify two critical challenges to achieve this goal: 1. The absence of dedicated datasets to formalize and evaluate the effectiveness of graph construction methods, and 2. Existing automatic construction methods can only be applied to some specific cases, while tedious human engineering is required to generate high-quality graphs. To tackle these challenges, we present a two-fold contribution. First, we introduce a set of datasets to formalize and evaluate graph construction methods. Second, we propose an LLM-based solution, AutoG, automatically generating high-quality graph schemas without human intervention. The experimental results demonstrate that the quality of constructed graphs is critical to downstream task performance, and AutoG can generate high-quality graphs that rival those produced by human experts. Our code can be accessible from https://github.com/amazon-science/Automatic-Table-to-Graph-Generation.

Md Atik Ahamed, Qiang Ye, Qiang Cheng

Main category: cs.LG

TL;DR: RefiDiff is a novel framework for imputing missing values in high-dimensional mixed-type datasets under MNAR conditions, combining local ML predictions with a Mamba-based denoising network to capture long-range dependencies efficiently.

Details

Motivation: Existing methods struggle with MNAR mechanisms and high-dimensional data, failing to integrate local and global characteristics effectively, which limits performance in complex missingness scenarios.

Method: Combines local machine learning predictions with a Mamba-based denoising network that efficiently captures long-range dependencies among features and samples. Uses pre-refinement for initial imputations and post-refinement to polish results, encoding mixed-type data into unified tokens.

Result: Outperforms state-of-the-art methods across various missing-value settings, shows strong performance in MNAR scenarios, and demonstrates superior out-of-sample generalization on nine real-world datasets.

Conclusion: RefiDiff provides a robust, scalable, and effective solution for handling complex missingness patterns in high-dimensional mixed-type data, bridging predictive and generative imputation paradigms without requiring architectural or hyperparameter tuning.

Abstract: Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network efficiently capturing long-range dependencies among features and samples with low computational complexity. RefiDiff bridges the predictive and generative paradigms of imputation, leveraging pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOTA) methods across missing-value settings, demonstrating strong performance in MNAR settings and superior out-of-sample generalization. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.

[422] Graph Contrastive Learning for Connectome Classification

Martín Schmidt, Sara Silva, Federico Larroca, Gonzalo Mateos, Pablo Musé

Main category: cs.LG

TL;DR: This paper proposes a supervised contrastive learning framework using graph neural networks to generate connectome embeddings that effectively classify subjects by gender using structural and functional brain connectivity data.

Details

Motivation: To advance graph signal processing in brain network analysis by developing methods that better capture the relationship between brain structure and function, with potential applications in precision medicine and neurodegeneration diagnosis.

Method: Uses a graph neural network Encoder-Decoder architecture with supervised contrastive learning and data augmentation to generate subject-level vector representations from structural and functional connectomes.

Result: Achieves state-of-the-art performance in gender classification using Human Connectome Project data, demonstrating effective separation of subjects with different labels.

Conclusion: The proposed connectome-centric framework shows promise for advancing brain function discovery through graph signal processing, with potential impact on understanding neurodegeneration heterogeneity for precision medicine.

Abstract: With recent advancements in non-invasive techniques for measuring brain activity, such as magnetic resonance imaging (MRI), the study of structural and functional brain networks through graph signal processing (GSP) has gained notable prominence. GSP stands as a key tool in unraveling the interplay between the brain’s function and structure, enabling the analysis of graphs defined by the connections between regions of interest – referred to as connectomes in this context. Our work represents a further step in this direction by exploring supervised contrastive learning methods within the realm of graph representation learning. The main objective of this approach is to generate subject-level (i.e., graph-level) vector representations that bring together subjects sharing the same label while separating those with different labels. These connectome embeddings are derived from a graph neural network Encoder-Decoder architecture, which jointly considers structural and functional connectivity. By leveraging data augmentation techniques, the proposed framework achieves state-of-the-art performance in a gender classification task using Human Connectome Project data. More broadly, our connectome-centric methodological advances support the promising prospect of using GSP to discover more about brain function, with potential impact to understanding heterogeneity in the neurodegeneration for precision medicine and diagnosis.

[423] Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy

Main category: cs.LG

TL;DR: Polar Sparsity identifies that attention layers become the computational bottleneck at scale, not MLP layers, and introduces Selective Head Attention with efficient GPU kernels to achieve up to 2.2× speedup for LLM inference without accuracy loss.

Details

Motivation: Contextual sparsity shows promise for LLM inference acceleration but doesn't scale to large batch sizes because the union of active neurons approaches dense computation, limiting practical deployment.

Method: Introduces Polar Sparsity concept showing sparsity importance shifts from MLP to Attention layers at scale. Develops Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels that exploit stable head sparsity in attention layers.

Result: Achieves up to 2.2× end-to-end speedups for models like OPT, LLaMA-2 & 3, Qwen, Mistral across various batch sizes and sequence lengths without compromising accuracy.

Conclusion: First work demonstrating contextual sparsity can scale effectively to large batch sizes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems.

Abstract: Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to (2.2\times) end-to-end speedups for models like OPT, LLaMA-2 & 3, Qwen, Mistral across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.

[424] Mixture of Message Passing Experts with Routing Entropy Regularization for Node Classification

Xuanze Chen, Jiajun Zhou, Yadong Li, Jinsong Chen, Shanqing Yu, Qi Xuan

Main category: cs.LG

TL;DR: GNNMoE is an entropy-driven mixture of experts framework that enables node-level adaptive representation learning for graph neural networks, addressing performance degradation in heterophilous graph structures.

Details

Motivation: Standard GNNs perform poorly on heterophilous graphs where connected nodes have different features and labels, creating a need for more adaptive approaches that can handle diverse neighborhood contexts.

Method: Decomposes message passing into propagation and transformation operations, integrates them through multiple expert networks with hybrid routing mechanism, and uses routing entropy regularization to dynamically adjust soft weighting and soft top-k routing.

Result: Extensive experiments on 12 benchmark datasets show GNNMoE consistently outperforms state-of-the-art node classification methods while maintaining scalability and interpretability.

Conclusion: Provides a unified and principled approach for achieving fine-grained, personalized node representation learning that adapts to diverse graph structures.

Abstract: Graph neural networks (GNNs) have achieved significant progress in graph-based learning tasks, yet their performance often deteriorates when facing heterophilous structures where connected nodes differ substantially in features and labels. To address this limitation, we propose GNNMoE, a novel entropy-driven mixture of message-passing experts framework that enables node-level adaptive representation learning. GNNMoE decomposes message passing into propagation and transformation operations and integrates them through multiple expert networks guided by a hybrid routing mechanism. And a routing entropy regularization dynamically adjusts soft weighting and soft top-$k$ routing, allowing GNNMoE to flexibly adapt to diverse neighborhood contexts. Extensive experiments on twelve benchmark datasets demonstrate that GNNMoE consistently outperforms SOTA node classification methods, while maintaining scalability and interpretability. This work provides a unified and principled approach for achieving fine-grained, personalized node representation learning.

[425] Ultrametric Cluster Hierarchies: I Want ’em All!

Andrew Draganov, Pascal Weber, Rasmus Skibdahl Melanchton Jørgensen, Anna Beer, Claudia Plant, Ira Assent

Main category: cs.LG

TL;DR: This paper shows that for any reasonable hierarchical clustering structure, one can optimally solve center-based clustering objectives (like k-means) over it, and these solutions are themselves hierarchical and can be computed quickly.

Details

Motivation: To extend hierarchical clustering beyond just exploratory analysis by enabling optimal solutions to center-based clustering objectives within any given hierarchy, providing access to multiple meaningful hierarchies from a single cluster tree.

Method: Proving mathematical guarantees that for any reasonable hierarchy, optimal solutions to center-based clustering objectives can be found quickly, and these solutions naturally form hierarchical structures themselves.

Result: The proposed techniques allow quick access to numerous meaningful hierarchies from a single cluster tree, and these methods demonstrate utility across various datasets, hierarchies, and partitioning schemes.

Conclusion: The approach generalizes hierarchical clustering by enabling optimal center-based clustering over any hierarchy, providing fast computation of multiple meaningful hierarchies from which partitions can be chosen, with verified practical utility.

Abstract: Hierarchical clustering is a powerful tool for exploratory data analysis, organizing data into a tree of clusterings from which a partition can be chosen. This paper generalizes these ideas by proving that, for any reasonable hierarchy, one can optimally solve any center-based clustering objective over it (such as $k$-means). Moreover, these solutions can be found exceedingly quickly and are themselves necessarily hierarchical. Thus, given a cluster tree, we show that one can quickly access a plethora of new, equally meaningful hierarchies. Just as in standard hierarchical clustering, one can then choose any desired partition from these new hierarchies. We conclude by verifying the utility of our proposed techniques across datasets, hierarchies, and partitioning schemes.

[426] How Well Can Differential Privacy Be Audited in One Run?

Amit Keinan, Moshe Shenfeld, Katrina Ligett

Main category: cs.LG

TL;DR: This paper analyzes the precision limitations of one-run auditing for ML privacy and proposes methods to minimize interference between data elements to improve auditing efficacy.

Details

Motivation: To understand how precisely one-run auditing can uncover the true privacy parameter of ML algorithms and characterize the barriers to its effectiveness.

Method: Characterizes the maximum achievable efficacy of one-run auditing, identifies interference between data elements as the key barrier, and presents conceptual approaches to minimize this interference.

Result: Shows that interference between observable effects of different data elements is the main limitation for one-run auditing precision.

Conclusion: Proposes new conceptual approaches to reduce interference and improve the performance of one-run auditing for real ML algorithms.

Abstract: Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. (2024) prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that the key barrier to its efficacy is interference between the observable effects of different data elements. We present new conceptual approaches to minimize this barrier, towards improving the performance of one-run auditing of real machine learning algorithms.

[427] Solver-Free Decision-Focused Learning for Linear Optimization Problems

Senne Berden, Ali İrfan Mahmutoğulları, Dimos Tsouros, Tias Guns

Main category: cs.LG

TL;DR: Proposes a solver-free training method for predict-then-optimize problems that reduces computational cost while maintaining decision quality by exploiting geometric structure of linear optimization.

Details

Motivation: Decision-focused learning (DFL) is computationally expensive because it requires solving optimization problems at each loss evaluation, creating a bottleneck for training.

Method: Exploits geometric structure of linear optimization by comparing ground-truth optimal solution quality with precomputed adjacent vertices on the feasible polytope, using this comparison as loss function.

Result: Significantly reduces computational cost while maintaining high decision quality in experiments.

Conclusion: The solver-free training method effectively addresses the computational bottleneck in DFL for linear optimization problems through geometric insights.

Abstract: Mathematical optimization is a fundamental tool for decision-making in a wide range of applications. However, in many real-world scenarios, the parameters of the optimization problem are not known a priori and must be predicted from contextual features. This gives rise to predict-then-optimize problems, where a machine learning model predicts problem parameters that are then used to make decisions via optimization. A growing body of work on decision-focused learning (DFL) addresses this setting by training models specifically to produce predictions that maximize downstream decision quality, rather than accuracy. While effective, DFL is computationally expensive, because it requires solving the optimization problem with the predicted parameters at each loss evaluation. In this work, we address this computational bottleneck for linear optimization problems, a common class of problems in both DFL literature and real-world applications. We propose a solver-free training method that exploits the geometric structure of linear optimization to enable efficient training with minimal degradation in solution quality. Our method is based on the insight that a solution is optimal if and only if it achieves an objective value that is at least as good as that of its adjacent vertices on the feasible polytope. Building on this, our method compares the estimated quality of the ground-truth optimal solution with that of its precomputed adjacent vertices, and uses this as loss function. Experiments demonstrate that our method significantly reduces computational cost while maintaining high decision quality.

[428] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, Teresa Head-Gordon

Main category: cs.LG

TL;DR: SynLlama fine-tunes Llama3 LLMs to generate synthetic pathways using accessible building blocks and organic reaction templates, enabling efficient exploration of synthesizable chemical space with strong performance in synthesis planning.

Details

Motivation: Many generative models produce molecules that are too difficult to synthesize, making them impractical for real-world applications. There's a need for models that can generate practical synthetic pathways.

Method: Fine-tuned Meta’s Llama3 Large Language Models to create SynLlama, which generates full synthetic pathways using commonly accessible building blocks and robust organic reaction templates.

Result: SynLlama explores large synthesizable space with less data, shows strong performance in forward and bottom-up synthesis planning, generalizes to unseen purchasable building blocks, and demonstrates practical use in pharmaceutical contexts for analog synthesis and hit expansion.

Conclusion: SynLlama provides medicinal chemists with a valuable tool for discovery by generating practical synthetic pathways and extending reconstruction capabilities beyond training data to broader synthesizable chemical space.

Abstract: Generative machine learning models for exploring chemical space have shown immense promise, but many molecules they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a novel approach by fine-tuning Meta’s Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data, and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods. We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data. We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.

[429] An empirical study of task and feature correlations in the reuse of pre-trained models

Jama Hussein Mohamud, Willie Brink

Main category: cs.LG

TL;DR: This paper investigates why reusing pre-trained neural networks works well, finding that success depends on task correlation and network architecture choices rather than just feature reuse.

Details

Motivation: To understand why Bob succeeds when reusing Alice's pre-trained neural network for different tasks, and identify the factors contributing to this empirical success.

Method: Created an experimental setup to study factors in neural network reuse, testing task correlation effects, network architecture choices, and layer reuse strategies across different scenarios.

Result: Bob’s success depends on task correlation with Alice’s task, network architecture choices, and which layers are reused. Higher task correlation leads to better performance, and even with uncorrelated tasks, Bob can achieve above-random performance due to Alice’s network design.

Conclusion: Task correlation is crucial for effective neural network reuse, and the optimal number of layers to retrain can indicate task/feature correlation. Semantic task correlations enable effective real-world pre-trained network reuse.

Abstract: Pre-trained neural networks are commonly used and reused in the machine learning community. Alice trains a model for a particular task, and a part of her neural network is reused by Bob for a different task, often to great effect. To what can we ascribe Bob’s success? This paper introduces an experimental setup through which factors contributing to Bob’s empirical success could be studied in silico. As a result, we demonstrate that Bob might just be lucky: his task accuracy increases monotonically with the correlation between his task and Alice’s. Even when Bob has provably uncorrelated tasks and input features from Alice’s pre-trained network, he can achieve significantly better than random performance due to Alice’s choice of network and optimizer. When there is little correlation between tasks, only reusing lower pre-trained layers is preferable, and we hypothesize the converse: that the optimal number of retrained layers is indicative of task and feature correlation. Finally, we show in controlled real-world scenarios that Bob can effectively reuse Alice’s pre-trained network if there are semantic correlations between his and Alice’s task.

[430] TAMIS: Tailored Membership Inference Attacks on Synthetic Data

Paul Andrey, Batiste Le Bars, Marc Tommasi

Main category: cs.LG

TL;DR: TAMIS is a novel Membership Inference Attack against differentially-private synthetic data generation methods using graphical models, improving on MAMA-MIA with lower computational cost and less attacker knowledge.

Details

Motivation: To empirically assess the privacy of machine learning algorithms, particularly differentially-private synthetic data generation methods that rely on graphical models.

Method: Two-fold improvement: 1) Recover graphical model from synthetic dataset alone (no shadow-modeling), 2) Introduce mathematically-grounded attack score with natural threshold for binary predictions.

Result: TAMIS achieves better or similar performance as MAMA-MIA on replicas of the SNAKE challenge with lower computational cost.

Conclusion: TAMIS provides an effective and efficient membership inference attack method for assessing privacy in differentially-private synthetic data generation.

Abstract: Membership Inference Attacks (MIA) enable to empirically assess the privacy of a machine learning algorithm. In this paper, we propose TAMIS, a novel MIA against differentially-private synthetic data generation methods that rely on graphical models. This attack builds upon MAMA-MIA, a recently-published state-of-the-art method. It lowers its computational cost and requires less attacker knowledge. Our attack is the product of a two-fold improvement. First, we recover the graphical model having generated a synthetic dataset by using solely that dataset, rather than shadow-modeling over an auxiliary one. This proves less costly and more performant. Second, we introduce a more mathematically-grounded attack score, that provides a natural threshold for binary predictions. In our experiments, TAMIS achieves better or similar performance as MAMA-MIA on replicas of the SNAKE challenge.

[431] Integration Matters for Learning PDEs with Backwards SDEs

Sungje Park, Stephen Tu

Main category: cs.LG

TL;DR: BSDE-based deep learning methods underperform PINNs due to discretization bias from Euler-Maruyama integration. A Stratonovich-based BSDE formulation with Heun integration eliminates this bias and achieves competitive performance.

Details

Motivation: Standard BSDE-based solvers empirically underperform PINNs for solving high-dimensional PDEs, particularly in stochastic optimal control settings where PDEs are tied to underlying dynamical systems.

Method: Proposed a Stratonovich-based BSDE formulation implemented with stochastic Heun integration to address the discretization bias introduced by standard Euler-Maruyama integration in one-step self-consistency BSDE losses.

Result: The Heun-based BSDE method completely eliminates bias issues faced by EM integration, consistently outperforms EM-based variants, and achieves competitive results with PINNs across multiple high-dimensional benchmarks.

Conclusion: Integration schemes play a critical role in BSDE-based PDE solvers, and the proposed Heun integration properly handles the discretization bias that has limited previous BSDE methods.

Abstract: Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering potential algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, standard BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to one-step self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step-sizes or multi-step self-consistency losses. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.

[432] RiemannFormer: A Framework for Attention in Curved Spaces

Zhongping Ji

Main category: cs.LG

TL;DR: This paper presents a geometric interpretation of transformer attention mechanisms using concepts from differential geometry, reduces parameters through predefined configurations, and introduces local neighborhood emphasis to address transformers’ lack of local inductive bias.

Details

Motivation: To unlock further potential of transformer architectures by providing geometric interpretation of attention mechanisms and addressing their neglect of local inductive bias.

Method: Frameworks attention using metric tensors, tangent spaces, inner products, and parallel transport; reduces parameters via predefined configurations; introduces explicit mechanism to highlight neighborhoods by attenuating remote values.

Result: Experimental results show significant performance improvements relative to baseline models.

Conclusion: The proposed geometric approach and local emphasis mechanisms effectively enhance transformer performance, with further evaluation on visual and large language models planned.

Abstract: This research endeavors to offer insights into unlocking the further potential of transformer-based architectures. One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers. In our framework, the attention mainly involves metric tensors, tangent spaces, inner product, and how they relate to each other. These quantities and structures at discrete positions are intricately interconnected via the parallel transport of tangent vectors. To make the learning process more efficient, we reduce the number of parameters through ingenious predefined configurations. Moreover, we introduce an explicit mechanism to highlight a neighborhood by attenuating the remote values, given that transformers inherently neglect local inductive bias. Experimental results demonstrate that our modules deliver significant performance improvements relative to the baseline. More evaluation experiments on visual and large language models will be launched successively.

[433] Edit Flows: Flow Matching with Edit Operations

Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen

Main category: cs.LG

TL;DR: Edit Flows is a non-autoregressive model that uses discrete flows over sequences via edit operations (insertions, deletions, substitutions) within a Continuous-time Markov Chain framework, enabling flexible variable-length sequence generation.

Details

Motivation: To overcome limitations of non-autoregressive models that impose rigid token-wise structures and struggle with variable-length sequences, unlike autoregressive models which naturally handle variable lengths.

Method: Defines discrete flows over sequences through edit operations within a Continuous-time Markov Chain over sequence space, using an expanded state space with auxiliary variables for efficient training.

Result: Outperforms both autoregressive and mask models on image captioning, and significantly outperforms mask construction in text and code generation tasks.

Conclusion: Edit Flows provides a flexible non-autoregressive approach that better aligns with sequence data structure and achieves superior performance across multiple generation tasks.

Abstract: Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations$\unicode{x2013}$insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.

[434] STOAT: Spatial-Temporal Probabilistic Causal Inference Network

Yang Yang, Du Yin, Hao Xue, Flora Salim

Main category: cs.LG

TL;DR: STOAT is a spatial-temporal probabilistic causal inference network that improves forecasting by incorporating spatial dependencies and causal relationships, outperforming existing methods on COVID-19 data.

Details

Motivation: Existing methods model spatial and temporal dynamics independently and overlook causality-driven probabilistic forecasting, limiting predictive power for spatial-temporal causal time series.

Method: Extends causal inference approach with spatial relation matrix encoding interregional dependencies, processes latent series with deep probabilistic models to estimate distribution parameters, and explores multiple output distributions.

Result: Outperforms state-of-the-art probabilistic forecasting models (DeepAR, DeepVAR, Deep State Space Model) on COVID-19 data across six countries, particularly in regions with strong spatial dependencies.

Conclusion: STOAT bridges causal inference and geospatial probabilistic forecasting, offering a generalizable framework for complex spatial-temporal tasks like epidemic management.

Abstract: Spatial-temporal causal time series (STC-TS) involve region-specific temporal observations driven by causally relevant covariates and interconnected across geographic or network-based spaces. Existing methods often model spatial and temporal dynamics independently and overlook causality-driven probabilistic forecasting, limiting their predictive power. To address this, we propose STOAT (Spatial-Temporal Probabilistic Causal Inference Network), a novel framework for probabilistic forecasting in STC-TS. The proposed method extends a causal inference approach by incorporating a spatial relation matrix that encodes interregional dependencies (e.g. proximity or connectivity), enabling spatially informed causal effect estimation. The resulting latent series are processed by deep probabilistic models to estimate the parameters of the distributions, enabling calibrated uncertainty modeling. We further explore multiple output distributions (e.g., Gaussian, Student’s-$t$, Laplace) to capture region-specific variability. Experiments on COVID-19 data across six countries demonstrate that STOAT outperforms state-of-the-art probabilistic forecasting models (DeepAR, DeepVAR, Deep State Space Model, etc.) in key metrics, particularly in regions with strong spatial dependencies. By bridging causal inference and geospatial probabilistic forecasting, STOAT offers a generalizable framework for complex spatial-temporal tasks, such as epidemic management.

[435] Flat Channels to Infinity in Neural Loss Landscapes

Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea

Main category: cs.LG

TL;DR: The paper identifies special channels in neural network loss landscapes where weights diverge to infinity while neurons become gated linear units, appearing as flat minima but actually representing slow gradient descent paths.

Details

Motivation: To understand the structure of neural network loss landscapes, particularly quasi-flat regions that appear as local minima but actually contain channels where weights diverge to infinity.

Method: Analyzed gradient dynamics and geometry of loss landscapes, identifying channels where output weights diverge while input weights become equal, leading to gated linear unit formation.

Result: Found that gradient-based optimizers frequently reach these channels in regression tasks, where they appear flat but actually represent paths to infinity with specific functional interpretations.

Conclusion: These channels reveal surprising computational capabilities of fully connected layers through the emergence of gated linear units, providing a comprehensive understanding of quasi-flat regions in neural network optimization.

Abstract: The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ’(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

[436] What Do Latent Action Models Actually Learn?

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, Jiang Bian

Main category: cs.LG

TL;DR: This paper analyzes whether latent action models (LAMs) learn action-relevant changes or irrelevant noise from unlabeled videos, using a tractable linear model to provide theoretical insights.

Details

Motivation: LAMs learn action-relevant changes from unlabeled videos, but frame differences can be caused by both controllable actions and exogenous noise, raising concerns about what the learned latents actually capture.

Method: The authors develop a tractable linear model that encapsulates LAM learning essence, enabling analytical study of connections to PCA, data policy requirements, and strategies for learning controllable changes.

Result: The analysis reveals connections between LAM and PCA, identifies desiderata for data-generating policies, and justifies strategies like data augmentation, cleaning, and auxiliary action-prediction to encourage learning of controllable changes.

Conclusion: Numerical simulations provide insights into how observation, action, and noise structures influence LAM learning, offering guidance for developing more effective latent action models.

Abstract: Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern – do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.

[437] The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel

Main category: cs.LG

TL;DR: Causal abstraction becomes trivial when using arbitrarily powerful alignment maps, as any neural network can be mapped to any algorithm, making the concept uninformative without constraints on how models encode information.

Details

Motivation: To critically examine causal abstraction by lifting the linearity constraint typically imposed on alignment maps, revealing that unrestricted causal abstraction is vacuous.

Method: Theoretical analysis proving that any neural network can be mapped to any algorithm under reasonable assumptions, complemented by empirical experiments using randomly initialized language models on the indirect object identification task.

Result: Demonstrated 100% interchange-intervention accuracy even with incapable models, revealing the non-linear representation dilemma and showing causal abstraction becomes trivial without linearity constraints.

Conclusion: Causal abstraction alone is insufficient for mechanistic interpretability; it requires assumptions about how models encode information to be meaningful.

Abstract: The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model’s representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps’ complexity and accuracy. Together, these results suggest an answer to our title’s question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

[438] Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

Main category: cs.LG

TL;DR: Simulation-Grounded Neural Networks (SGNNs) use mechanistic simulations as training data to create interpretable neural networks that bridge the gap between mechanistic models and machine learning, achieving strong performance across scientific domains.

Details

Motivation: To address the tradeoff between mechanistic models (scientifically grounded but limited in complexity) and machine learning models (strong predictive performance but uninterpretable and data-hungry) in scientific modeling.

Method: Pretrain neural networks on synthetic corpora from mechanistic simulations spanning diverse model structures, parameter regimes, stochasticity, and observational artifacts. Enable back-to-simulation attribution for interpretability.

Result: SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one-third, maintained accuracy in ecological forecasting, accurately classified information spread sources, and estimated COVID-19 transmissibility more accurately than traditional methods.

Conclusion: Simulation-grounded learning provides a unified framework where mechanistic simulations can serve as effective training data for robust, interpretable scientific inference across domains.

Abstract: Scientific modeling faces a tradeoff: mechanistic models provide scientific grounding but struggle with real-world complexity, while machine learning models achieve strong predictive performance but require large labeled datasets and are not interpretable. We introduce Simulation-Grounded Neural Networks (SGNNs), which use mechanistic simulations as training data for neural networks. SGNNs are pretrained on synthetic corpora spanning diverse model structures, parameter regimes, stochasticity, and observational artifacts. Simulation-grounded learning has been applied in multiple domains (e.g., surrogate models in physics, forecasting in epidemiology). We provide a unified framework for simulation-grounded learning and evaluated SGNNs across scientific disciplines and modeling tasks. We found that SGNNs were successful across domains: for prediction tasks, they nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one-third, and maintained accuracy in ecological forecasting where task-specific models failed. For inference tasks, SGNNs also accurately classified the source of information spread in simulated social networks and enabled supervised learning for unobservable targets, such as estimating COVID-19 transmissibility more accurately than traditional methods even in early outbreaks. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability. Back-to-simulation attribution matches real-world observations to the training simulations the model considers most similar, identifying which mechanistic processes the model believes best explain the observed data. By providing a unified framework for simulation-grounded learning, we establish when and how mechanistic simulations can serve as effective training data for robust, interpretable scientific inference.

[439] M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction

Guangyin Jin, Sicong Lai, Xiaoshuai Hao, Mingtao Zhang, Jinlei Zhang

Main category: cs.LG

TL;DR: M3-Net is a graph-free MLP-based model for traffic prediction that uses time series embeddings and a novel MLP-Mixer with mixture of experts, achieving superior performance with lightweight deployment.

Details

Motivation: Existing deep learning methods for traffic prediction either require complete traffic network structures or complex model designs, making them inefficient for large-scale deployment.

Method: Proposes M3-Net using time series and spatio-temporal embeddings with MLP-Mixer architecture and mixture of experts mechanism, eliminating the need for graph structures.

Result: Extensive experiments on multiple real datasets show superior prediction performance and lightweight deployment capabilities compared to existing methods.

Conclusion: M3-Net provides a cost-effective, graph-free solution for traffic prediction that outperforms complex models while being more efficient for large-scale deployment.

Abstract: Achieving accurate traffic prediction is a fundamental but crucial task in the development of current intelligent transportation systems.Most of the mainstream methods that have made breakthroughs in traffic prediction rely on spatio-temporal graph neural networks, spatio-temporal attention mechanisms, etc. The main challenges of the existing deep learning approaches are that they either depend on a complete traffic network structure or require intricate model designs to capture complex spatio-temporal dependencies. These limitations pose significant challenges for the efficient deployment and operation of deep learning models on large-scale datasets. To address these challenges, we propose a cost-effective graph-free Multilayer Perceptron (MLP) based model M3-Net for traffic prediction. Our proposed model not only employs time series and spatio-temporal embeddings for efficient feature processing but also first introduces a novel MLP-Mixer architecture with a mixture of experts (MoE) mechanism. Extensive experiments conducted on multiple real datasets demonstrate the superiority of the proposed model in terms of prediction performance and lightweight deployment.Our code is available at https://github.com/jinguangyin/M3_NET

[440] P-DRUM: Post-hoc Descriptor-based Residual Uncertainty Modeling for Machine Learning Potentials

Shih-Peng Huang, Nontawat Charoenphakdee, Yuta Tsuboi, Yong-Bin Zhuang, Wenwen Li

Main category: cs.LG

TL;DR: Proposes P-DRUM, a post-hoc UQ framework for MLIPs that uses GNN descriptors to model prediction residuals as uncertainty proxies, offering computational efficiency over ensemble methods.

Details

Motivation: Ensemble methods are computationally expensive for UQ in MLIPs, while alternative methods may affect accuracy or cannot be applied to trained models.

Method: Leverages descriptors from trained graph neural network potentials to model residual errors between predictions and ground truth as uncertainty proxies.

Result: P-DRUM provides an efficient post-hoc UQ approach that maintains accuracy while being computationally cheaper than ensemble methods.

Conclusion: P-DRUM offers a simple, efficient alternative to ensemble methods for uncertainty quantification in machine learning interatomic potentials.

Abstract: Ensemble method is considered the gold standard for uncertainty quantification (UQ) in machine learning interatomic potentials (MLIPs). However, their high computational cost can limit its practicality. Alternative techniques, such as Monte Carlo dropout and deep kernel learning, have been proposed to improve computational efficiency; however, some of these methods cannot be applied to already trained models and may affect the prediction accuracy. In this paper, we propose a simple and efficient post-hoc framework for UQ that leverages the descriptor of a trained graph neural network potential to estimate residual errors. We refer to this method as post-hoc descriptor-based residual uncertainty modeling (P-DRUM). P-DRUM models the discrepancy between MLIP predictions and ground truth values, allowing these residuals to act as proxies for prediction uncertainty. We explore multiple variants of P-DRUM and benchmark them against established UQ methods, evaluating both their effectiveness and limitations.

[441] A Certifiable Machine Learning-Based Pipeline to Predict Fatigue Life of Aircraft Structures

Ángel Ladrón, Miguel Sánchez-Domínguez, Javier Rozalén, Fernando R. Sánchez, Javier de Vicente, Lucas Lacasa, Eusebio Valero, Gonzalo Rubio

Main category: cs.LG

TL;DR: ML-based pipeline for aircraft wing fatigue life prediction using flight parameters, complementing traditional methods by reducing simulation requirements.

Details

Motivation: Traditional fatigue life prediction methods are time-consuming, complex, and require multiple teams/tools. ML offers faster iterations and generalization to complement conventional simulations.

Method: Machine learning pipeline that estimates fatigue life of aircraft wing locations based on flight parameters from different missions throughout operational life.

Result: Accurate fatigue life predictions with thorough statistical validation and uncertainty quantification in realistic use cases.

Conclusion: The ML pipeline successfully complements traditional methodologies by reducing costly simulations and lowering computational/human resource requirements.

Abstract: Fatigue life prediction is essential in both the design and operational phases of any aircraft, and in this sense safety in the aerospace industry requires early detection of fatigue cracks to prevent in-flight failures. Robust and precise fatigue life predictors are thus essential to ensure safety. Traditional engineering methods, while reliable, are time consuming and involve complex workflows, including steps such as conducting several Finite Element Method (FEM) simulations, deriving the expected loading spectrum, and applying cycle counting techniques like peak-valley or rainflow counting. These steps often require collaboration between multiple teams and tools, added to the computational time and effort required to achieve fatigue life predictions. Machine learning (ML) offers a promising complement to traditional fatigue life estimation methods, enabling faster iterations and generalization, providing quick estimates that guide decisions alongside conventional simulations. In this paper, we present a ML-based pipeline that aims to estimate the fatigue life of different aircraft wing locations given the flight parameters of the different missions that the aircraft will be operating throughout its operational life. We validate the pipeline in a realistic use case of fatigue life estimation, yielding accurate predictions alongside a thorough statistical validation and uncertainty quantification. Our pipeline constitutes a complement to traditional methodologies by reducing the amount of costly simulations and, thereby, lowering the required computational and human resources.

[442] Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

Qi Wang, Hanyang Peng, Yue Yu

Main category: cs.LG

TL;DR: Symphony-MoE enables effective Mixture-of-Experts construction by harmonizing experts from multiple pre-trained models through layer-aware fusion and activation-based functional alignment, overcoming parameter space disparities.

Details

Motivation: Existing MoE upcycling methods limit expert diversity by using only one pre-trained dense model. This paper addresses the need to leverage multiple disparate pre-trained models to create more diverse and powerful MoE models.

Method: Two-stage framework: 1) Training-free harmonization using layer-aware fusion for shared backbone and activation-based functional alignment for parameter misalignment, 2) Post-training to coordinate the entire architecture.

Result: Successfully integrates experts from heterogeneous sources, achieving MoE models that significantly outperform baselines in multi-domain tasks and out-of-distribution generalization.

Conclusion: Symphony-MoE provides an effective solution for constructing diverse MoE models from multiple pre-trained sources, overcoming parameter space disparities and enabling superior performance across various domains.

Abstract: Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

[443] RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, Jiarong Xing

Main category: cs.LG

TL;DR: RouterArena is the first open platform for comprehensive comparison of LLM routers, featuring a standardized dataset, difficulty levels, evaluation metrics, and automated leaderboard updates.

Details

Motivation: With the rapid emergence of various LLM routers, choosing the right one has become increasingly challenging, necessitating a comprehensive comparison platform similar to model leaderboards.

Method: Created RouterArena platform with: (1) principally constructed dataset with broad knowledge domain coverage, (2) distinguishable difficulty levels per domain, (3) extensive evaluation metrics, and (4) automated framework for leaderboard updates.

Result: Produced initial leaderboard with detailed metrics comparison and established an open framework for evaluating new routers available on GitHub.

Conclusion: RouterArena addresses the need for standardized router comparison and provides the first comprehensive platform for evaluating LLM router performance across different scenarios.

Abstract: Today’s LLM ecosystem comprises a wide spectrum of models that differ in size, capability, and cost. No single model is optimal for all scenarios; hence, LLM routers have become essential for selecting the most appropriate model under varying circumstances. However, the rapid emergence of various routers makes choosing the right one increasingly challenging. To address this problem, we need a comprehensive router comparison and a standardized leaderboard, similar to those available for models. In this work, we introduce RouterArena, the first open platform enabling comprehensive comparison of LLM routers. RouterArena has (1) a principally constructed dataset with broad knowledge domain coverage, (2) distinguishable difficulty levels for each domain, (3) an extensive list of evaluation metrics, and (4) an automated framework for leaderboard updates. Leveraging our framework, we have produced the initial leaderboard with detailed metrics comparison as shown in Figure 1. Our framework for evaluating new routers is on https://github.com/RouteWorks/RouterArena

[444] Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Jing Wang, Wonho Bae, Jiahong Chen, Wenxu Wang, Junhyug Noh

Main category: cs.LG

TL;DR: DVD is a latent diffusion model framework for source-free domain adaptation that transfers decision boundaries without accessing source data by using a pre-trained diffusion module to generate source-like cues from target features.

Details

Motivation: To address the challenge of source-free domain adaptation where source data cannot be accessed, leveraging latent diffusion models as privacy-preserving bridges for explicit knowledge transfer.

Method: Encodes source feature label information into latent vicinity using Gaussian priors, trains diffusion network to drift noisy samples to label-consistent representations, then uses frozen diffusion module and InfoNCE loss to align target encoder to generated source-like cues.

Result: Outperforms state-of-the-art methods on standard SFDA benchmarks and enhances source classifier accuracy on in-domain data, also boosting performance in supervised classification and domain generalization.

Conclusion: DVD reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, solving a core challenge in source-free domain adaptation that prior methods couldn’t address.

Abstract: Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature’s label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature’s latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier’s accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.

[445] Heterogeneous Point Set Transformers for Segmentation of Multiple View Particle Detectors

Edgar E. Robles, Dikshant Sagar, Alejandro Yankelevich, Jianming Bian, Pierre Baldi, NOvA Collaboration

Main category: cs.LG

TL;DR: A point set neural network for particle identification in NOvA neutrino detector data that processes sparse 2D views with cross-view mixing, achieving 96.8% AUC with 90% memory reduction.

Details

Motivation: Traditional methods for particle identification in NOvA experiment use clustering and CNNs, but detector data is sparse 2D views rather than 3D, requiring efficient processing.

Method: Proposed point set neural network that operates on sparse matrices with cross-view mixing operation to combine information from both XZ and YZ detector views.

Result: Achieved 96.8% AUC score (vs 85.4% when views processed independently) using less than 10% of memory required by previous methods.

Conclusion: Point set neural network with cross-view mixing effectively handles sparse detector data, achieving high accuracy with significant memory efficiency improvements.

Abstract: NOvA is a long-baseline neutrino oscillation experiment that detects neutrino particles from the NuMI beam at Fermilab. Before data from this experiment can be used in analyses, raw hits in the detector must be matched to their source particles, and the type of each particle must be identified. This task has commonly been done using a mix of traditional clustering approaches and convolutional neural networks (CNNs). Due to the construction of the detector, the data is presented as two sparse 2D images: an XZ and a YZ view of the detector, rather than a 3D representation. We propose a point set neural network that operates on the sparse matrices with an operation that mixes information from both views. Our model uses less than 10% of the memory required using previous methods while achieving a 96.8% AUC score, a higher score than obtained when both views are processed independently (85.4%).

[446] Jet Functors and Weil Algebras in Automatic Differentiation: A Geometric Analysis

Amandip Sangha

Main category: cs.LG

TL;DR: A geometric framework for automatic differentiation using jet functors and Weil algebras, unifying forward/reverse modes and enabling efficient higher-order differentiation in JAX.

Details

Motivation: To provide a unified, coordinate-free geometric foundation for automatic differentiation that clarifies the algebraic structure and enables more efficient implementations.

Method: Differential-geometric formulation using jet functors and Weil algebras, with forward/reverse modes as pushforward/cotangent pullback, implemented in JAX with Weil-mode for computing all mixed derivatives in a single forward pass.

Result: Achieves algebraically exact and numerically stable differentiation with linear cost in algebra dimension, demonstrating that geometric abstraction leads to more efficient and transparent computational differentiation systems.

Conclusion: Geometric abstraction through jet functors and Weil algebras provides a unified framework for automatic differentiation that yields efficient, exact, and transparent differentiation systems with predictable scaling.

Abstract: We present a differential-geometric formulation of automatic differentiation (AD) based on jet functors and Weil algebras. In this framework, forward- and reverse-mode differentiation arise naturally as pushforward and cotangent pullback, while higher-order differentiation corresponds to evaluation in a Weil algebra. This construction provides a unified, coordinate-free view of derivative propagation and clarifies the algebraic structure underlying AD. All results are realized in modern JAX code, where the Weil-mode formulation computes all mixed derivatives in a single forward pass with cost linear in the algebra dimension. The resulting implementation achieves algebraically exact and numerically stable differentiation with predictable scaling, demonstrating that geometric abstraction can yield more efficient and transparent computational differentiation systems. Code is available at https://git.nilu.no/geometric-ad/jet-weil-ad

[447] Generalization Bounds for Rank-sparse Neural Networks

Antoine Ledent, Rodrigo Alves, Yunwen Lei

Main category: cs.LG

TL;DR: This paper investigates generalization bounds for neural networks by exploiting the low-rank structure of weight matrices observed in practice, using Schatten p-quasi norms to derive bounds with improved sample complexity.

Details

Motivation: Recent observations show neural networks exhibit bottleneck rank property where activations and weights become approximately low-rank during training. This aligns with findings that weight decay in linear networks minimizes Schatten p-quasi norms.

Method: The authors prove generalization bounds that leverage the approximate low-rank structure of weight matrices. The analysis uses Schatten p-quasi norms of weight matrices to characterize the complexity.

Result: For small p, the bounds show sample complexity of O~(WrL²) where W is width, L is depth, and r is rank of weight matrices. As p increases, the bounds behave more like traditional norm-based bounds.

Conclusion: The paper provides theoretical generalization guarantees that exploit the low-rank structure commonly observed in trained neural networks, offering improved sample complexity bounds compared to standard norm-based approaches.

Abstract: It has been recently observed in much of the literature that neural networks exhibit a bottleneck rank property: for larger depths, the activation and weights of neural networks trained with gradient-based methods tend to be of approximately low rank. In fact, the rank of the activations of each layer converges to a fixed value referred to as the ``bottleneck rank’’, which is the minimum rank required to represent the training data. This perspective is in line with the observation that regularizing linear networks (without activations) with weight decay is equivalent to minimizing the Schatten $p$ quasi norm of the neural network. In this paper we investigate the implications of this phenomenon for generalization. More specifically, we prove generalization bounds for neural networks which exploit the approximate low rank structure of the weight matrices if present. The final results rely on the Schatten $p$ quasi norms of the weight matrices: for small $p$, the bounds exhibit a sample complexity $ \widetilde{O}(WrL^2)$ where $W$ and $L$ are the width and depth of the neural network respectively and where $r$ is the rank of the weight matrices. As $p$ increases, the bound behaves more like a norm-based bound instead.

[448] Aligning Diffusion Language Models via Unpaired Preference Optimization

Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

Main category: cs.LG

TL;DR: ELBO-KTO combines ELBO surrogate for diffusion log-likelihoods with unpaired preference optimization (KTO) to align diffusion language models without costly pairwise preference data.

Details

Motivation: Aligning diffusion language models to human preferences is challenging due to intractable sequence log-likelihoods and costly pairwise preference data collection.

Method: Uses ELBO surrogate for diffusion log-likelihoods combined with prospect-theoretic unpaired preference objective (KTO), with variance-reduction techniques to stabilize training gradients.

Result: Achieves 65.9% and 62.3% adjusted win rates on preference benchmarks, performs on par or better than base model across reasoning/knowledge tasks like GSM8K and MMLU under identical decoding.

Conclusion: Establishes unpaired preference optimization as a viable alternative to pairwise alignment for diffusion LLMs.

Abstract: Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

[449] Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding

Shakeel Abdulkareem, Bora Yimenicioglu, Khartik Uppalapati, Aneesh Gudipati, Adan Eftekhari, Saleh Yassin

Main category: cs.LG

TL;DR: An adaptive multitask EEG classifier using GRU-TCN with DQN threshold adaptation achieves 98% accuracy for stroke type classification, outperforming baseline methods.

Details

Motivation: Rapid triage of suspected stroke requires accurate, bedside-deployable tools, and EEG is promising but underused at first contact.

Method: Convert 32-channel EEG signals to power spectral density features using Welch method, use GRU-TCN network to predict stroke type, lateralization, and severity, and apply DQN to tune decision thresholds in real time.

Result: Baseline GRU-TCN achieved 89.3% accuracy for stroke type, 96.9% for severity, and 96.7% for lateralization. With DQN adaptation, stroke-type accuracy increased to 98.0%. Validated on independent cohort with patient-level statistics.

Conclusion: Adaptive thresholding shifts operating points to clinically preferred sensitivity-specificity trade-offs, while integrated visualizations support interpretability for stroke triage.

Abstract: Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.

[450] Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning

Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

Main category: cs.LG

TL;DR: The paper analyzes limitations of conventional MIL methods in medical imaging that ignore contextual relationships between instances, and shows newer correlated MIL methods still underperform compared to optimal Bayes estimator.

Details

Motivation: Conventional MIL approaches treat instances separately, ignoring contextual relationships between nearby patches or slices that are essential in medical imaging applications.

Method: Designed a synthetic classification task where adjacent instance features are crucial, compared off-the-shelf MIL approaches to optimal Bayes estimator available in closed-form.

Result: Demonstrated limitations of conventional MIL methods and showed that newer correlated MIL methods still do not achieve best possible performance even with large training datasets.

Conclusion: Current MIL methods, including newer correlated approaches, fail to fully capture contextual relationships between instances, indicating need for improved methods that better account for spatial dependencies.

Abstract: Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still do not achieve the best possible performance when trained with ten thousand training samples, each containing many instances.

[451] Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

Semih Cayci

Main category: cs.LG

TL;DR: Analyzes stochastic Gauss-Newton method’s convergence and generalization for overparameterized deep neural networks, showing how curvature and batch size affect performance.

Details

Motivation: To understand how higher-order optimization methods like stochastic Gauss-Newton affect generalization in deep learning, particularly for overparameterized networks.

Method: Uses stochastic Gauss-Newton with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in regression.

Result: Established finite-time convergence bounds with explicit dependencies on batch size, network width and depth, and derived non-asymptotic generalization bounds using uniform stability.

Conclusion: Identified a favorable generalization regime where larger minimum eigenvalue of Gauss-Newton matrix along optimization path yields tighter stability bounds.

Abstract: An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

[452] Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks

Alper Kalle, Theo Rudkiewicz, Mohamed-Oumar Ouerfelli, Mohamed Tamaazousti

Main category: cs.LG

TL;DR: Proposes data-informed tensor compression for neural networks using covariance-based norms instead of weight-space norms, achieving competitive accuracy without fine-tuning and enabling cross-dataset compression.

Details

Motivation: Neural networks demand high computing power, and traditional compression methods use isotropic norms in weight space that may not preserve functional accuracy, requiring post-compression fine-tuning.

Method: Uses data-informed norms that minimize output distribution changes, expressed as ||(W-Ŵ)Σ¹ᐟ²||F where Σ¹ᐟ² is input covariance. Develops alternating least square algorithms for Tucker-2 and CPD tensor decompositions.

Result: Achieves competitive accuracy without fine-tuning, works across datasets with minor accuracy drop, validated on ResNet-18/50, GoogLeNet with ImageNet, FGVC-Aircraft, Cifar10, Cifar100.

Conclusion: Data-informed tensor compression with covariance-based norms provides effective neural network compression without requiring fine-tuning and enables cross-dataset application.

Abstract: Neural networks are widely used for image-related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory- and compute-footprint can be reduced by compression. In this work, we focus on compression through tensorization and low-rank representations. Whereas classical approaches search for a low-rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight-space, we use data-informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\lVert (W - \widetilde{W}) Σ^{1/2}\rVert_F$ where $Σ^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post-compression fine-tuning, our data-informed approach often achieves competitive accuracy without any fine-tuning. We further show that the same covariance-based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50, and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.

[453] Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models

Jiwoo Shin, Byeonghu Na, Mina Kang, Wonhyeok Choi, Il-Chul Moon

Main category: cs.LG

TL;DR: Proposes a method to improve text-to-image safety by replacing negative prompts with implicit negative embeddings from concept inversion, addressing incompatibility between fine-tuning and training-free guidance approaches.

Details

Motivation: Current text-to-image models can generate harmful content from malicious prompts. Existing defenses (fine-tuning and training-free guidance) show incompatibility when combined, limiting their combined effectiveness.

Method: Replace negative prompts in training-free methods with implicit negative embeddings obtained through concept inversion. This requires no modification to existing approaches and can be easily integrated.

Result: Experimental validation on nudity and violence benchmarks shows consistent improvements in defense success rate while preserving input prompt semantics.

Conclusion: The proposed method effectively addresses the incompatibility between fine-tuning and training-free guidance approaches, improving safety without compromising core semantics.

Abstract: Recent advances in text-to-image generative models have raised concerns about their potential to produce harmful content when provided with malicious input text prompts. To address this issue, two main approaches have emerged: (1) fine-tuning the model to unlearn harmful concepts and (2) training-free guidance methods that leverage negative prompts. However, we observe that combining these two orthogonal approaches often leads to marginal or even degraded defense performance. This observation indicates a critical incompatibility between two paradigms, which hinders their combined effectiveness. In this work, we address this issue by proposing a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion. Our method requires no modification to either approach and can be easily integrated into existing pipelines. We experimentally validate its effectiveness on nudity and violence benchmarks, demonstrating consistent improvements in defense success rate while preserving the core semantics of input prompts.

[454] ActiTect: A Generalizable Machine Learning Pipeline for REM Sleep Behavior Disorder Screening through Standardized Actigraphy

David Bertram, Anja Ophey, Sinah Röttgen, Konstantin Kufer, Gereon R. Fink, Elke Kalbe, Clint Hansen, Walter Maetzler, Maximilian Kapsecker, Lara M. Reimer, Stephan Jonas, Andreas T. Damgaard, Natasha B. Bertelsen, Casper Skjaerbaek, Per Borghammer, Karolien Groenewald, Pietro-Luca Ratti, Michele T. Hu, Noémie Moreau, Michael Sommerauer, Katarzyna Bozek

Main category: cs.LG

TL;DR: ActiTect is an automated, open-source machine learning tool that detects REM sleep behavior disorder (RBD) from wrist-worn actigraphy data with strong performance across multiple validation cohorts.

Details

Motivation: iRBD is a key prodromal marker for α-synucleinopathies like Parkinson's disease, and wrist-worn actimeters offer scalable screening potential but require reliable analysis pipelines.

Method: Developed ActiTect with robust preprocessing, automated sleep-wake detection, and physiologically interpretable motion features. Used nested cross-validation on 78 individuals and validated on multiple external cohorts.

Result: Achieved AUROC of 0.95 in development, 0.86 on local test set, and 0.84-0.94 on external cohorts. Leave-one-dataset-out validation showed consistent performance (AUROC 0.84-0.89).

Conclusion: ActiTect provides a robust, generalizable RBD detection tool that is open-source and ready for broader deployment, advancing unified RBD screening using wearables.

Abstract: Isolated rapid eye movement sleep behavior disorder (iRBD) is a major prodromal marker of $α$-synucleinopathies, often preceding the clinical onset of Parkinson’s disease, dementia with Lewy bodies, or multiple system atrophy. While wrist-worn actimeters hold significant potential for detecting RBD in large-scale screening efforts by capturing abnormal nocturnal movements, they become inoperable without a reliable and efficient analysis pipeline. This study presents ActiTect, a fully automated, open-source machine learning tool to identify RBD from actigraphy recordings. To ensure generalizability across heterogeneous acquisition settings, our pipeline includes robust preprocessing and automated sleep-wake detection to harmonize multi-device data and extract physiologically interpretable motion features characterizing activity patterns. Model development was conducted on a cohort of 78 individuals, yielding strong discrimination under nested cross-validation (AUROC = 0.95). Generalization was confirmed on a blinded local test set (n = 31, AUROC = 0.86) and on two independent external cohorts (n = 113, AUROC = 0.84; n = 57, AUROC = 0.94). To assess real-world robustness, leave-one-dataset-out cross-validation across the internal and external cohorts demonstrated consistent performance (AUROC range = 0.84-0.89). A complementary stability analysis showed that key predictive features remained reproducible across datasets, supporting the final pooled multi-center model as a robust pre-trained resource for broader deployment. By being open-source and easy to use, our tool promotes widespread adoption and facilitates independent validation and collaborative improvements, thereby advancing the field toward a unified and generalizable RBD detection model using wearable devices.

[455] Distributionally Robust Self Paced Curriculum Reinforcement Learning

Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal

Main category: cs.LG

TL;DR: DR-SPCRL adaptively schedules the robustness budget ε as a curriculum to overcome the performance-robustness tradeoff in distributionally robust RL, achieving better stability and 11.8% higher returns under perturbations.

Details

Motivation: Fixed robustness budgets in DRRL create a tradeoff where small ε yields high nominal performance but weak robustness, while large ε causes instability and overly conservative policies.

Method: Treat ε as a continuous curriculum and adaptively schedule it based on the agent’s learning progress, balancing nominal and robust performance throughout training.

Result: Achieves 11.8% average increase in episodic return under perturbations compared to fixed/heuristic scheduling, and approximately 1.9× performance of nominal RL algorithms.

Conclusion: Adaptive curriculum scheduling of robustness budget enables superior robustness-performance tradeoff and stabilizes training in distributionally robust reinforcement learning.

Abstract: A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $ε$. However, fixing $ε$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $ε$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent’s progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.

[456] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero, Yann LeCun

Main category: cs.LG

TL;DR: LeJEPA is a theoretically grounded training objective for Joint-Embedding Predictive Architectures that combines predictive loss with Sketched Isotropic Gaussian Regularization to achieve optimal embedding distributions for minimal downstream prediction risk.

Details

Motivation: To provide practical guidance and theory for Joint-Embedding Predictive Architectures (JEPAs), which currently lack systematic approaches and rely on ad-hoc R&D, by developing a comprehensive theory and lean implementation.

Method: Identifies isotropic Gaussian as optimal embedding distribution, introduces Sketched Isotropic Gaussian Regularization (SIGReg) to constrain embeddings, and combines JEPA predictive loss with SIGReg to create LeJEPA - a simple, scalable objective with linear complexity.

Result: Achieves 79% accuracy on ImageNet-1k with ViT-H/14 using linear evaluation, demonstrates stability across 10+ datasets, 60+ architectures with varying scales and domains, and requires only ~50 lines of code for implementation.

Conclusion: LeJEPA offers a simple, theory-friendly ecosystem that reestablishes self-supervised pre-training as a core pillar of AI research, providing stability, scalability, and practical benefits without complex heuristics.

Abstract: Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).

[457] Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

Main category: cs.LG

TL;DR: Lethe is a dynamic KV cache management framework that reduces memory and latency in LLM reasoning tasks through layerwise sparsity-aware allocation and multi-round token pruning using Recency-Aware Selective Retention.

Details

Motivation: Existing KV compression methods focus on reducing prefill memory but fail to address the dynamic and layer-sensitive nature of long-form generation in reasoning tasks, which involves substantial memory and latency overheads from accumulating KV caches.

Method: Lethe introduces adaptivity along spatial and temporal dimensions: spatial dimension uses layerwise sparsity-aware allocation based on attention redundancy, while temporal dimension employs multi-round token pruning with Recency-Aware Selective Retention (RASR) that considers both recency and token relevance from evolving attention patterns.

Result: Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increasing throughput by up to 2.56x.

Conclusion: The proposed Lethe framework effectively addresses the limitations of existing KV compression methods by dynamically managing KV caches through spatial and temporal adaptivity, significantly improving efficiency in long-form generation tasks while maintaining generation quality.

Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

[458] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Zhongyang Li, Ziyue Li, Tianyi Zhou

Main category: cs.LG

TL;DR: RoMA improves MoE LLM performance by aligning routing weights with task embeddings through manifold regularization, achieving better generalization with lightweight router finetuning.

Details

Motivation: Existing MoE LLMs show suboptimal router performance causing 10-20% accuracy gaps compared to optimal routing, limiting generalization on downstream tasks.

Method: Introduces Routing Manifold Alignment (RoMA) with manifold regularization that encourages routing weights to be close to successful neighbors in task embedding space, requiring only lightweight router finetuning with other parameters frozen.

Result: Substantial improvements on diverse benchmarks for OLMoE, DeepSeekMoE, and Qwen3-MoE models, reducing the performance gap to optimal routing.

Conclusion: RoMA effectively bridges task understanding and solution generation in MoE LLMs, enabling better generalization through task-expert bindings across samples.

Abstract: Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs’ generalization performance. Our method, “Routing Manifold Alignment (RoMA)”, introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

[459] Rectified Noise: A Generative Model Using Positive-incentive Noise

Zhenyu Gu, Yanchen Xu, Sida Huang, Yubin Guo, Hongyuan Zhang

Main category: cs.LG

TL;DR: Rectified Noise (RN) improves generative performance by injecting positive-incentive noise into pre-trained Rectified Flow models, reducing FID scores with minimal additional parameters.

Details

Motivation: While Rectified Flow models use probability flow ODEs, recent research shows that adding noise through reverse-time SDEs during sampling can enhance generative performance. This work aims to leverage positive-incentive noise to improve existing RF models.

Method: Proposed Rectified Noise pipeline that trains pi-noise generators to inject positive-incentive noise into the velocity field of pre-trained RF models, efficiently transforming them into pi-noise generators with minimal parameter overhead.

Result: Extensive experiments show significant improvements: FID reduced from 10.16 to 9.05 on ImageNet-1k, with only 0.39% additional training parameters required for pi-noise generators.

Conclusion: Rectified Noise provides an effective and efficient method to enhance pre-trained Rectified Flow models by incorporating positive-incentive noise, achieving better generative performance with minimal computational overhead.

Abstract: Rectified Flow (RF) has been widely used as an effective generative model. Although RF is primarily based on probability flow Ordinary Differential Equations (ODE), recent studies have shown that injecting noise through reverse-time Stochastic Differential Equations (SDE) for sampling can achieve superior generative performance. Inspired by Positive-incentive Noise (pi-noise), we propose an innovative generative algorithm to train pi-noise generators, namely Rectified Noise (RN), which improves the generative performance by injecting pi-noise into the velocity field of pre-trained RF models. After introducing the Rectified Noise pipeline, pre-trained RF models can be efficiently transformed into pi-noise generators. We validate Rectified Noise by conducting extensive experiments across various model architectures on different datasets. Notably, we find that: (1) RF models using Rectified Noise reduce FID from 10.16 to 9.05 on ImageNet-1k. (2) The models of pi-noise generators achieve improved performance with only 0.39% additional training parameters.

[460] Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

Junkai Lu, Peng Chen, Chenjuan Guo, Yang Shu, Meng Wang, Bin Yang

Main category: cs.LG

TL;DR: DTAF is a dual-branch framework for long-term time series forecasting that addresses non-stationarity in both temporal and frequency domains using specialized modules for temporal stabilization and frequency wave modeling.

Details

Motivation: Real-world time series often exhibit non-stationarity including temporal distribution shifts and spectral variability, which pose significant challenges for long-term forecasting across domains like energy, finance, transportation, and cloud computing.

Method: DTAF uses a dual-branch approach: Temporal Stabilizing Fusion (TFS) module with non-stationary mix of experts filter to handle temporal non-stationarity, and Frequency Wave Modeling (FWM) module with frequency differencing to address spectral shifts. The outputs are fused for robust forecasting.

Result: Extensive experiments on real-world benchmarks show DTAF outperforms state-of-the-art baselines with significant improvements in forecasting accuracy under non-stationary conditions.

Conclusion: DTAF effectively addresses non-stationarity in both temporal and frequency domains, providing robust time series forecasting that adapts to distribution shifts and spectral variability.

Abstract: Time series forecasting is critical for decision-making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions. All codes are available at https://github.com/PandaJunk/DTAF.

[461] A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation

Xianshuai Shi, Jianfeng Zhu, Leibo Liu

Main category: cs.LG

TL;DR: The paper provides a unified theoretical framework that interprets Transformer components through field theory and manifold embeddings.

Details

Motivation: Transformers lack a unified physical/mathematical interpretation for positional encoding and attention mechanisms.

Method: Map discrete positions to spatial functions on continuous manifolds and interpret Transformer layers as kernel-modulated operators.

Result: A structural theoretical framework integrating positional encoding, kernel integral operators, and attention mechanisms.

Conclusion: Transformer layers can be understood as kernel-modulated operators acting over embedded manifolds.

Abstract: The Transformer architecture has achieved tremendous success in natural language processing, computer vision, and scientific computing through its self-attention mechanism. However, its core components-positional encoding and attention mechanisms-have lacked a unified physical or mathematical interpretation. This paper proposes a structural theoretical framework that integrates positional encoding, kernel integral operators, and attention mechanisms for in-depth theoretical investigation. We map discrete positions (such as text token indices and image pixel coordinates) to spatial functions on continuous manifolds, enabling a field-theoretic interpretation of Transformer layers as kernel-modulated operators acting over embedded manifolds.

cs.MA

[462] Introduction to Automated Negotiation

Dave de Jonge

Main category: cs.MA

TL;DR: Introductory textbook on automated negotiation for CS students with no prerequisites, includes Python framework for implementing negotiation algorithms.

Details

Motivation: To provide accessible entry point to automated negotiation for computer science students without requiring specialized background knowledge.

Method: Uses educational textbook approach with accompanying Python-based negotiation framework that allows hands-on implementation and experimentation.

Result: Creates accessible learning resource with practical tools for understanding and implementing automated negotiation algorithms.

Conclusion: Successfully provides beginner-friendly introduction to automated negotiation with practical implementation framework that can be easily adapted to other programming languages.

Abstract: This book is an introductory textbook targeted towards computer science students who are completely new to the topic of automated negotiation. It does not require any prerequisite knowledge, except for elementary mathematics and basic programming skills. This book comes with an simple toy-world negotiation framework implemented in Python that can be used by the readers to implement their own negotiation algorithms and perform experiments with them. This framework is small and simple enough that any reader who does not like to work in Python should be able to re-implement it very quickly in any other programming language of their choice.

[463] Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives

Romain Cosentino, Sarath Shekkizhar, Adam Earle

Main category: cs.MA

TL;DR: Theoretical framework for analyzing interactions between language model agents with misaligned objectives, showing biased equilibria and predictable residual errors based on objective gaps and prompt geometry.

Details

Motivation: To understand and predict the dynamics of multi-agent systems where language model agents interact with conflicting goals, and to link prompt design to system stability and robustness.

Method: Develop theoretical framework for iterative gradient updates between two agents using each other’s outputs as inputs, analyze generation dynamics with misaligned objectives, establish convergence conditions, and validate with transformer models and GPT-5 on in-context linear regression.

Result: Misaligned objectives lead to biased equilibrium where neither agent reaches its target, with predictable residual errors. Conditions for asymmetric convergence established, and algorithm developed for achieving one-sided adversarial success.

Conclusion: The framework enables study, prediction, and defense of multi-agent systems by explicitly linking prompt design and interaction setup to stability, bias, and robustness properties.

Abstract: We develop a theoretical framework for agent-to-agent interactions in multi-agent scenarios. We consider the setup in which two language model based agents perform iterative gradient updates toward their respective objectives in-context, using the output of the other agent as input. We characterize the generation dynamics associated with the interaction when the agents have misaligned objectives, and show that this results in a biased equilibrium where neither agent reaches its target - with the residual errors predictable from the objective gap and the geometry induced by the prompt of each agent. We establish the conditions for asymmetric convergence and provide an algorithm that provably achieves an adversarial result, producing one-sided success. Experiments with trained transformer models as well as GPT$5$ for the task of in-context linear regression validate the theory. Our framework presents a setup to study, predict, and defend multi-agent systems; explicitly linking prompt design and interaction setup to stability, bias, and robustness.

[464] Achieving Equilibrium under Utility Heterogeneity: An Agent-Attention Framework for Multi-Agent Multi-Objective Reinforcement Learning

Zhuhui Li, Chunbo Luo, Liming Huang, Luyu Qi, Geyong Min

Main category: cs.MA

TL;DR: AA-MAMORL is a multi-agent multi-objective RL framework that learns joint beliefs over other agents’ utility functions during centralized training, enabling decentralized execution without communication while approximating Bayesian Nash Equilibrium.

Details

Motivation: Existing MAMOS methods struggle with heterogeneous objectives and utility functions, intensifying training non-stationarity due to private utility functions and associated policies.

Method: Propose Agent-Attention Multi-Agent Multi-Objective RL (AA-MAMORL) that implicitly learns joint beliefs over other agents’ utility functions during centralized training, mapping global states and utilities to each agent’s policy.

Result: AA-MAMORL significantly outperforms state-of-the-art methods in both custom MAMO Particle environment and MOMALand benchmark, demonstrating improved performance with access to global preferences.

Conclusion: Access to global utility functions is necessary for Bayesian Nash Equilibrium under decentralized execution constraints, and AA-MAMORL effectively addresses this while preserving decentralized execution without inter-agent communication.

Abstract: Multi-agent multi-objective systems (MAMOS) have emerged as powerful frameworks for modelling complex decision-making problems across various real-world domains, such as robotic exploration, autonomous traffic management, and sensor network optimisation. MAMOS offers enhanced scalability and robustness through decentralised control and more accurately reflects inherent trade-offs between conflicting objectives. In MAMOS, each agent uses utility functions that map return vectors to scalar values. Existing MAMOS optimisation methods face challenges in handling heterogeneous objective and utility function settings, where training non-stationarity is intensified due to private utility functions and the associated policies. In this paper, we first theoretically prove that direct access to, or structured modeling of, global utility functions is necessary for the Bayesian Nash Equilibrium under decentralised execution constraints. To access the global utility functions while preserving the decentralised execution, we propose an Agent-Attention Multi-Agent Multi-Objective Reinforcement Learning (AA-MAMORL) framework. Our approach implicitly learns a joint belief over other agents’ utility functions and their associated policies during centralised training, effectively mapping global states and utilities to each agent’s policy. In execution, each agent independently selects actions based on local observations and its private utility function to approximate a BNE, without relying on inter-agent communication. We conduct comprehensive experiments in both a custom-designed MAMO Particle environment and the standard MOMALand benchmark. The results demonstrate that access to global preferences and our proposed AA-MAMORL significantly improve performance and consistently outperform state-of-the-art methods.

[465] Learning Efficient Communication Protocols for Multi-Agent Reinforcement Learning

Xinren Zhang, Jiadong Yu, Zixin Zhong

Main category: cs.MA

TL;DR: A framework for learning efficient multi-round communication protocols in MARL using three novel metrics to optimize communication topology and messaging while maintaining task performance.

Details

Motivation: Existing MARL research lacks thorough investigation of communication protocol optimization, leading to inefficient information exchange with redundant messages among agents.

Method: Proposed a generalized framework with three Communication Efficiency Metrics (CEMs): IEI and SEI for efficiency-augmented optimization, and TEI for explicit evaluation. Integrated IEI and SEI as adjusted loss functions to promote informative messaging and role specialization.

Result: The learned communication protocol significantly enhances communication efficiency and achieves better cooperation performance with improved success rates in comprehensive experiments.

Conclusion: The framework successfully addresses communication inefficiency in MARL by optimizing both communication topology and messaging through novel efficiency metrics.

Abstract: Multi-Agent Systems (MAS) have emerged as a powerful paradigm for modeling complex interactions among autonomous entities in distributed environments. In Multi-Agent Reinforcement Learning (MARL), communication enables coordination but can lead to inefficient information exchange, since agents may generate redundant or non-essential messages. While prior work has focused on boosting task performance with information exchange, the existing research lacks a thorough investigation of both the appropriate definition and the optimization of communication protocols (communication topology and message). To fill this gap, we introduce a generalized framework for learning multi-round communication protocols that are both effective and efficient. Within this framework, we propose three novel Communication Efficiency Metrics (CEMs) to guide and evaluate the learning process: the Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) for efficiency-augmented optimization, and the Topology Efficiency Index (TEI) for explicit evaluation. We integrate IEI and SEI as the adjusted loss functions to promote informative messaging and role specialization, while using TEI to quantify the trade-off between communication volume and task performance. Through comprehensive experiments, we demonstrate that our learned communication protocol can significantly enhance communication efficiency and achieves better cooperation performance with improved success rates.

[466] Enhancing PIBT via Multi-Action Operations

Egor Yukhnevich, Anton Andreychuk

Main category: cs.MA

TL;DR: Enhanced PIBT with multi-action operations improves performance in orientation-constrained MAPF scenarios while maintaining efficiency, achieving state-of-the-art results in online LMAPF-T.

Details

Motivation: Standard PIBT's short-horizon design performs poorly when agents have orientation constraints and need time-consuming rotation actions, limiting its effectiveness in such scenarios.

Method: Enhanced PIBT with multi-action operations, combined with graph-guidance technique and large neighborhood search optimization.

Result: Achieves state-of-the-art performance in online LMAPF-T setting while preserving PIBT’s hallmark efficiency.

Conclusion: The enhanced PIBT successfully addresses orientation constraints in MAPF problems through multi-action operations, maintaining efficiency while significantly improving performance in challenging scenarios.

Abstract: PIBT is a rule-based Multi-Agent Path Finding (MAPF) solver, widely used as a low-level planner or action sampler in many state-of-the-art approaches. Its primary advantage lies in its exceptional speed, enabling action selection for thousands of agents within milliseconds by considering only the immediate next timestep. However, this short-horizon design leads to poor performance in scenarios where agents have orientation and must perform time-consuming rotation actions. In this work, we present an enhanced version of PIBT that addresses this limitation by incorporating multi-action operations. We detail the modifications introduced to improve PIBT’s performance while preserving its hallmark efficiency. Furthermore, we demonstrate how our method, when combined with graph-guidance technique and large neighborhood search optimization, achieves state-of-the-art performance in the online LMAPF-T setting.

[467] Game Theory and Multi-Agent Reinforcement Learning for Zonal Ancillary Markets

Francesco Morri, Hélène Le Cadre, Pierre Gruet, Luce Brotcorne

Main category: cs.MA

TL;DR: This paper analyzes zonal ancillary market coupling using game theory, formulating it as a bilevel problem and generalized Nash game. It compares three equilibrium computation methods: integrated optimization, Gauss-Seidel best-response, and multi-agent deep reinforcement learning.

Details

Motivation: To characterize and analyze zonal ancillary market coupling through noncooperative game theory, addressing the need for efficient market equilibrium computation methods in electricity markets.

Method: Formulated the ancillary market as a multi-leader single follower bilevel problem, cast as a generalized Nash game. Used three approaches: integrated optimization, Gauss-Seidel best-response, and multi-agent deep reinforcement learning. Applied methods to real data from Germany and Austria.

Result: Multi-agent deep reinforcement learning achieved smallest convergence rate but required pretraining, while best-response was slowest. Deep reinforcement learning resulted in smaller market costs but higher profit allocation variability. Stronger zone coupling reduced costs for larger zones.

Conclusion: Multi-agent deep reinforcement learning shows promise for market equilibrium computation but requires careful handling of profit allocation variability. Zone coupling benefits larger zones economically.

Abstract: We characterize zonal ancillary market coupling relying on noncooperative game theory. To that purpose, we formulate the ancillary market as a multi-leader single follower bilevel problem, that we subsequently cast as a generalized Nash game with side constraints and nonconvex feasibility sets. We determine conditions for equilibrium existence and show that the game has a generalized potential game structure. To compute market equilibrium, we rely on two exact approaches: an integrated optimization approach and Gauss-Seidel best-response, that we compare against multi-agent deep reinforcement learning. On real data from Germany and Austria, simulations indicate that multi-agent deep reinforcement learning achieves the smallest convergence rate but requires pretraining, while best-response is the slowest. On the economics side, multi-agent deep reinforcement learning results in smaller market costs compared to the exact methods, but at the cost of higher variability in the profit allocation among stakeholders. Further, stronger coupling between zones tends to reduce costs for larger zones.

[468] Rainbow Delay Compensation: A Multi-Agent Reinforcement Learning Framework for Mitigating Delayed Observation

Songchen Fu, Siang Chen, Shaojing Zhao, Letian Bai, Ta Li, Yonghong Yan

Main category: cs.MA

TL;DR: Proposes Rainbow Delay Compensation (RDC), a MARL framework to handle stochastic individual observation delays in multi-agent systems, achieving near delay-free performance.

Details

Motivation: Observation delays are common in real-world multi-agent systems, preventing agents from making decisions based on true environmental states, which severely degrades MARL performance.

Method: Formulates DSID-POMDP model for decentralized stochastic individual delays, then proposes RDC framework with recommended implementations for its modules, tested on MPE and SMAC benchmarks.

Result: Baseline MARL methods suffer severe performance degradation under delays, while RDC-enhanced approach mitigates this issue and achieves ideal delay-free performance in certain delay scenarios with good generalizability.

Conclusion: Provides a novel perspective on multi-agent delayed observation problems and offers an effective solution framework that maintains performance despite observation delays.

Abstract: In real-world multi-agent systems (MASs), observation delays are ubiquitous, preventing agents from making decisions based on the environment’s true state. An individual agent’s local observation typically comprises multiple components from other agents or dynamic entities within the environment. These discrete observation components with varying delay characteristics pose significant challenges for multi-agent reinforcement learning (MARL). In this paper, we first formulate the decentralized stochastic individual delay partially observable Markov decision process (DSID-POMDP) by extending the standard Dec-POMDP. We then propose the Rainbow Delay Compensation (RDC), a MARL training framework for addressing stochastic individual delays, along with recommended implementations for its constituent modules. We implement the DSID-POMDP’s observation generation pattern using standard MARL benchmarks, including MPE and SMAC. Experiments demonstrate that baseline MARL methods suffer severe performance degradation under fixed and unfixed delays. The RDC-enhanced approach mitigates this issue, remarkably achieving ideal delay-free performance in certain delay scenarios while maintaining generalizability. Our work provides a novel perspective on multi-agent delayed observation problems and offers an effective solution framework. The source code is available at https://github.com/linkjoker1006/RDC-pymarl.

[469] Distributionally Robust Markov Games with Average Reward

Zachary Roch, Yue Wang

Main category: cs.MA

TL;DR: The paper studies distributionally robust Markov games with average-reward criterion, establishing connections between best-response policies and optimal policies, proving existence of stationary Nash Equilibrium, and introducing Robust Nash-Iteration with convergence guarantees.

Details

Motivation: To provide a comprehensive theoretical and algorithmic foundation for multi-agent decision-making under uncertainty over extended horizons in complex environments.

Method: Established connections between best-response policies and optimal policies, derived robust Bellman equation solutions, constructed set-valued maps, introduced Robust Nash-Iteration algorithm, and connected average-reward NE to discounted robust equilibria.

Result: Proved existence of stationary Nash Equilibrium under irreducible assumption and weakly communicating setting, provided convergence guarantees for Robust Nash-Iteration, and showed approximation of average-reward NE as discount factor approaches one.

Conclusion: The study provides a solid theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments through comprehensive analysis of distributionally robust Markov games.

Abstract: We study distributionally robust Markov games (DR-MGs) with the average-reward criterion, a crucial framework for multi-agent decision-making under uncertainty over extended horizons. We first establish a connection between the best-response policies and the optimal policies for the induced single-agent problems. Under a standard irreducible assumption, we derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, and derive the existence of stationary Nash Equilibrium (NE) based on these results. We also study a more general weakly communicating setting. We construct a set-valued map and show its value is a subset of the best-response policies, convex and upper hemi-continuous, which imply the existence of NE. We then introduce Robust Nash-Iteration, and provide convergence guarantees. Finally, we connect average-reward NE to discounted robust equilibria, showing approximation as the discount factor approaches one. Our studies provide comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.

cs.MM

[470] Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Jingtian Ma, Jingyuan Wang, Wayne Xin Zhao, Guoping Liu, Xiang Wen

Main category: cs.MM

TL;DR: ST-CLIP is a novel model that integrates spatio-temporal information into vision-language models for Traffic Scene Understanding, using CLIP as backbone with spatio-temporal context aware multi-aspect prompt learning.

Details

Motivation: Current Traffic Scene Understanding research often ignores spatio-temporal information and interrelations between different traffic scene aspects, treating it as common image understanding without leveraging spatio-temporal data from navigation apps.

Method: Uses CLIP backbone with Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning, including dynamic spatio-temporal context representation and bi-level ST-aware multi-aspect prompt learning that integrates ST-context vectors into word embeddings.

Result: Demonstrates superior performance on two real-world datasets in complex scene understanding scenarios with few-shot learning strategy.

Conclusion: First successful integration of spatio-temporal information into vision-language models for Traffic Scene Understanding, showing improved performance in complex scenarios.

Abstract: Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.

[471] MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz

Main category: cs.MM

TL;DR: MCAD is an end-to-end pipeline for generating Audio Descriptions (AD) for soccer games without ground truth AD, using a fine-tuned VideoLLM and multimodal contextual cues, with a new evaluation metric ARGE-AD.

Details

Motivation: To extend AD generation beyond movies to sports (soccer) without relying on ground truth AD, addressing the lack of domain-specific AD datasets.

Method: Fine-tune VideoLLM on movie AD datasets, incorporate multimodal cues (player identities, events, commentary), and use input prompts to generate AD text for video segments.

Result: Produced complete AD text for soccer game clips, validated with new ARGE-AD metric across domains, and contributed expert-annotated AD for 100 soccer clips.

Conclusion: MCAD successfully generates AD for soccer games without ground truth, with ARGE-AD providing effective evaluation, expanding AD automation to new domains.

Abstract: Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people’s names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.

[472] RFNNS: Robust Fixed Neural Network Steganography with Universal Text-to-Image Models

Yu Cheng, Jiuan Zhou, Jiawei Chen, Zhaoxia Yin, Xinpeng Zhang

Main category: cs.MM

TL;DR: RFNNS improves fixed neural network steganography by embedding perturbations in complex texture regions and using robust perturbation generation, achieving better visual quality and robustness against attacks.

Details

Motivation: Address limitations of Fixed Neural Network Steganography (FNNS) which exhibits noticeable distortion and limited robustness, compromising security and practical applicability.

Method: Uses texture-aware localization to embed perturbations in complex texture regions, and robust steganographic perturbation generation (RSPG) strategy combined with AI-generated cover images.

Result: Achieves 23% average SSIM increase for recovered secret images under common attacks, and reduces LPIPS value to 39% of SOTA method against unknown attacks.

Conclusion: RFNNS significantly enhances robustness and practical value for covert communication compared to existing FNNS methods.

Abstract: With the rapid development of generative AI, image steganography has garnered widespread attention due to its unique concealment. Recent studies have demonstrated the practical advantages of Fixed Neural Network Steganography (FNNS), notably its ability to achieve stable information embedding and extraction without any additional network training. However, the stego images generated by FNNS still exhibit noticeable distortion and limited robustness. These drawbacks compromise the security of the embedded information and restrict the practical applicability of the method. To address these limitations, we propose Robust Fixed Neural Network Steganography (RFNNS). Specifically, a texture-aware localization technique selectively embeds perturbations carrying secret information into regions of complex textures, effectively preserving visual quality. Additionally, a robust steganographic perturbation generation (RSPG) strategy is designed to enhance the decoding accuracy, even under common and unknown attacks. These robust perturbations are combined with AI-generated cover images to produce stego images. Experimental results demonstrate that RFNNS significantly improves robustness compared to state-of-the-art FNNS methods, achieving an average increase in SSIM of 23% for recovered secret images under common attacks. Furthermore, the LPIPS value of recovered secrets images against previously unknown attacks achieved by RFNNS was reduced to 39% of the SOTA method, underscoring its practical value for covert communication.

eess.AS

[473] ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: eess.AS

TL;DR: ParaS2S is a reinforcement learning framework for speech-to-speech models that optimizes both content and speaking style at waveform level, achieving 11% improvement over supervised fine-tuning.

Details

Motivation: Speech-to-Speech models lack ability to handle paralinguistic cues (emotion, tone, speaker attributes) and respond appropriately in both content and style, with limited high-quality demonstrations available.

Method: Introduces ParaS2S RL framework with ParaS2SBench benchmark for evaluating content/style appropriateness, and uses Group Relative Policy Optimization (GRPO) to learn from unlabeled speech with scalable scoring feedback.

Result: Existing S2S models fail on paralinguistic attributes, performing no better than pipeline baselines. ParaS2S achieves 11% relative improvement in content/style appropriateness over supervised fine-tuning, requiring fewer annotations.

Conclusion: The RL approach effectively improves paralinguistic-aware S2S capabilities, surpassing prior models while being more annotation-efficient than pure supervised fine-tuning.

Abstract: Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cues–such as emotion, tone, and speaker attributes–and to respond appropriately in both content and style remains underexplored. Progress is further hindered by the scarcity of high-quality and expressive demonstrations. To address this, we introduce a novel reinforcement learning (RL) framework for paralinguistic-aware S2S, ParaS2S, which evaluates and optimizes both content and speaking style directly at the waveform level. We first construct ParaS2SBench, a benchmark comprehensively evaluates S2S models’ output for content and style appropriateness from diverse and challenging input queries. It scores the fitness of input-output pairs and aligns well with human judgments, serving as an automatic judge for model outputs. With this scalable scoring feedback, we enable the model to explore and learn from diverse unlabeled speech via Group Relative Policy Optimization (GRPO). Experiments show that existing S2S models fail to respond appropriately to paralinguistic attributes, performing no better than pipeline-based baselines. Our RL approach achieves a 11% relative improvement in response content and style’s appropriateness on ParaS2SBench over supervised fine-tuning (SFT), surpassing all prior models while requiring substantially fewer warm-up annotations than pure SFT.

[474] Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask

Tianzi Wang, Xurong Xie, Zengrui Jin, Mengzhe Geng, Jiajun Deng, Zhaoqing Li, Shoukang Hu, Shujie Hu, Guinan Li, Mingyu Cui, Helen Meng, Xunying Liu

Main category: eess.AS

TL;DR: Proposes a non-autoregressive block-based attention mask decoder (AMD) that enables parallel inference within output blocks while maintaining monotonic left-to-right prediction between blocks, achieving significant decoding speedup without accuracy loss.

Details

Motivation: Autoregressive Transformer decoders in ASR systems limit efficient inference parallelization due to their sequential nature. Non-autoregressive approaches aim to achieve decoding speedup while maintaining comparable accuracy to AR baselines.

Method: Developed AMD that performs parallel inference within contiguous blocks of output labels with monotonic left-to-right prediction between blocks. Designed one-pass beam search to dynamically fuse CTC, AR decoder, and AMD probabilities. Tested on Conformer encoder-decoder ASR with filterbank features, WavLM integration, and LLM-based decoder.

Result: Achieved decoding speedup ratios of 1.44x, 1.55x, and 2.31x across three model configurations without statistically significant WER increases. When operating with comparable RTFs, achieved statistically significant WER reductions of 0.19%, 0.62% and 0.13% absolute (4.3%, 16.3%, and 3.8% relative) on LS960 task.

Conclusion: The proposed AMD-enabled tripartite decoder provides flexible performance-efficiency trade-off, achieving both decoding speedup and accuracy improvements across different ASR system configurations including Conformer and LLM-based systems.

Abstract: Automatic speech recognition (ASR) systems often rely on autoregressive (AR) Transformer decoder architectures, which limit efficient inference parallelization due to their sequential nature. To this end, non-autoregressive (NAR) approaches aim primarily to achieve significant decoding speedup while the maintaining recognition accuracy that is comparable to AR baselines. This paper proposes a novel NAR block-based attention mask decoder (AMD) that effectively improves decoding efficiency while maintaining ASR accuracy, and also offering flexibility in balancing the performance-efficiency trade-off on both Conformer and large language model (LLM)-based ASR systems. The proposed AMD performs parallel inference within contiguous blocks of output labels while maintaining monotonic left-to-right prediction between blocks. A one-pass beam search algorithm is designed to dynamically fuse Connectionist Temporal Classification (CTC), AR decoder, and AMD probabilities. Experiments are conducted on normal speech LS960 and DBank elderly speech across: a) The Conformer encoder-decoder ASR system with filterbank input features; b) its integration with WavLM features; and c) further advancement by integrating an LLM-based decoder. On the LS960 task, the proposed AMD empowered tripartite decoder achieves decoding speedup ratios of up to 1.44x, 1.55x, and 2.31x under the three model configurations over the CTC + AR baselines, without statistically significant WER increases. When operating with real-time factors (RTFs) comparable to the baselines, the tripartite decoder produces statistically significant WER reductions of 0.19%, 0.62% and 0.13% absolute (4.3%, 16.3%, and 3.8% relative). Similar improvements are also obtained on the DBank task.

eess.IV

[475] SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

Shuhang Chen, Hangjie Yuan, Pengwei Liu, Hanxue Gu, Tao Feng, Dong Ni

Main category: eess.IV

TL;DR: SAMora is a framework that enhances SAM for medical image segmentation by using hierarchical self-supervised learning at image, patch, and pixel levels, achieving SOTA performance with 90% fewer fine-tuning epochs.

Details

Motivation: SAM's performance is limited with small labeled datasets, while medical data contains valuable hierarchical information that is often overlooked.

Method: Proposes SAMora with complementary self-supervised learning objectives at image, patch, and pixel levels, and introduces HL-Attn for hierarchical fusion of multi-scale features while maintaining their distinct characteristics.

Result: Outperforms existing SAM variants on Synapse, LA, and PROMISE12 datasets in both few-shot and fully supervised settings, achieving state-of-the-art performance.

Conclusion: SAMora effectively captures hierarchical medical knowledge and is compatible with various SAM variants, demonstrating superior performance with significantly reduced training time.

Abstract: The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation. Yet, its performance is limited when only a small amount of labeled data is available, while there is abundant valuable yet often overlooked hierarchical information in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed, and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants. It achieves state-of-the-art performance in both few-shot and fully supervised settings while reducing fine-tuning epochs by 90%. The code is available at https://github.com/ShChen233/SAMora.

Jingwen Fu, Ming Xiao, Zhonghao Lyu, Mikael Skoglund, Celimuge Wu

Main category: eess.IV

TL;DR: A robust multi-modal task-oriented communication framework using two-stage variational information bottleneck with mutual information minimization to compress inter-modal redundancy and enhance semantic reliability under channel noise.

Details

Motivation: To address the challenge of simultaneously compressing inter-modal redundancy and improving semantic reliability under channel distortion in multi-modal semantic communications.

Method: Two-stage variational information bottleneck framework: (1) uni-modal VIB for modality-specific compression, (2) mutual information minimization with adversarial training to suppress cross-modal dependencies, and (3) multi-modal VIB for fused representation compression and robustness enhancement.

Result: Significantly outperforms existing baselines in accuracy and reliability on multi-modal emotion recognition tasks, particularly under low signal-to-noise ratio regimes.

Conclusion: Provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability for efficient multi-modal semantic communications.

Abstract: Semantic communications for multi-modal data can transmit task-relevant information efficiently over noisy and bandwidth-limited channels. However, a key challenge is to simultaneously compress inter-modal redundancy and improve semantic reliability under channel distortion. To address the challenge, we propose a robust and efficient multi-modal task-oriented communication framework that integrates a two-stage variational information bottleneck (VIB) with mutual information (MI) redundancy minimization. In the first stage, we apply uni-modal VIB to compress each modality separately, i.e., text, audio, and video, while preserving task-specific features. To enhance efficiency, an MI minimization module with adversarial training is then used to suppress cross-modal dependencies and to promote complementarity rather than redundancy. In the second stage, a multi-modal VIB is further used to compress the fused representation and to enhance robustness against channel distortion. Experimental results on multi-modal emotion recognition tasks demonstrate that the proposed framework significantly outperforms existing baselines in accuracy and reliability, particularly under low signal-to-noise ratio regimes. Our work provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability.

[477] Fluence Map Prediction with Deep Learning: A Transformer-based Approach

Ujunwa Mgboh, Rafi Sultan, Dongxiao Zhu, Joshua Kim

Main category: eess.IV

TL;DR: Deep learning framework using 3D Swin-UNETR to predict fluence maps directly from CT images and contours for prostate IMRT, achieving high accuracy and clinical quality.

Details

Motivation: Conventional fluence map optimization in IMRT is time-consuming and dependent on planner expertise, requiring an automated solution to accelerate treatment planning while maintaining quality.

Method: End-to-end 3D Swin-UNETR transformer network trained on 99 prostate IMRT cases (79 training, 20 testing) to predict nine-beam fluence maps from CT images and anatomical contours using hierarchical self-attention.

Result: Achieved R^2 of 0.95 ± 0.02, MAE of 0.035 ± 0.008, gamma passing rate of 85 ± 10% (3%/3mm), with no significant DVH differences between predicted and clinical plans.

Conclusion: The framework enables fully automated, inverse-free fluence map prediction with enhanced spatial coherence, accuracy, and efficiency, offering a scalable solution for automated IMRT plan generation.

Abstract: Accurate fluence map prediction is essential in intensity-modulated radiation therapy (IMRT) to maximize tumor coverage while minimizing dose to healthy tissues. Conventional optimization is time-consuming and dependent on planner expertise. This study presents a deep learning framework that accelerates fluence map generation while maintaining clinical quality. An end-to-end 3D Swin-UNETR network was trained to predict nine-beam fluence maps directly from volumetric CT images and anatomical contours using 99 prostate IMRT cases (79 for training and 20 for testing). The transformer-based model employs hierarchical self-attention to capture both local anatomical structures and long-range spatial dependencies. Predicted fluence maps were imported into the Eclipse Treatment Planning System for dose recalculation, and model performance was evaluated using beam-wise fluence correlation, spatial gamma analysis, and dose-volume histogram (DVH) metrics. The proposed model achieved an average R^2 of 0.95 +/- 0.02, MAE of 0.035 +/- 0.008, and gamma passing rate of 85 +/- 10 percent (3 percent / 3 mm) on the test set, with no significant differences observed in DVH parameters between predicted and clinical plans. The Swin-UNETR framework enables fully automated, inverse-free fluence map prediction directly from anatomical inputs, enhancing spatial coherence, accuracy, and efficiency while offering a scalable and consistent solution for automated IMRT plan generation.

[478] 3D-TDA – Topological feature extraction from 3D images for Alzheimer’s disease classification

Faisal Ahmed, Taymaz Akan, Fatih Gelir, Owen T. Carmichael, Elizabeth A. Disbrow, Steven A. Conrad, Mohammad A. N. Bhuiyan

Main category: eess.IV

TL;DR: Proposes a novel feature extraction method using persistent homology on structural MRI to diagnose Alzheimer’s disease, achieving high accuracy with simple machine learning models.

Details

Motivation: Urgent need for early, objective, and accurate AD diagnosis using low-cost measurement modalities, especially with recent approval of disease-modifying therapies.

Method: Uses persistent homology to extract topological features from brain MRI, converts them to feature vectors via Betti functions, and integrates with XGBoost for classification.

Result: Outperforms state-of-the-art deep learning models with 97.43% accuracy for binary classification and 95.47% for three-class classification on ADNI 3D MRI data.

Conclusion: The topological approach provides computationally efficient diagnosis without data augmentation or extensive preprocessing, offering complementary information to traditional deep learning methods.

Abstract: Now that disease-modifying therapies for Alzheimer disease have been approved by regulatory agencies, the early, objective, and accurate clinical diagnosis of AD based on the lowest-cost measurement modalities possible has become an increasingly urgent need. In this study, we propose a novel feature extraction method using persistent homology to analyze structural MRI of the brain. This approach converts topological features into powerful feature vectors through Betti functions. By integrating these feature vectors with a simple machine learning model like XGBoost, we achieve a computationally efficient machine learning model. Our model outperforms state-of-the-art deep learning models in both binary and three-class classification tasks for ADNI 3D MRI disease diagnosis. Using 10-fold cross-validation, our model achieved an average accuracy of 97.43 percent and sensitivity of 99.09 percent for binary classification. For three-class classification, it achieved an average accuracy of 95.47 percent and sensitivity of 94.98 percent. Unlike many deep learning models, our approach does not require data augmentation or extensive preprocessing, making it particularly suitable for smaller datasets. Topological features differ significantly from those commonly extracted using convolutional filters and other deep learning machinery. Because it provides an entirely different type of information from machine learning models, it has the potential to combine topological features with other models later on.

[479] Compositional Distributed Learning for Multi-View Perception: A Maximal Coding Rate Reduction Perspective

Zhuojun Tian, Mehdi Bennis

Main category: eess.IV

TL;DR: A compositional distributed learning framework for multi-view perception using maximal coding rate reduction and subspace basis fusion, where agents exchange truncated basis matrices to achieve fused subspaces while maintaining representation diversity.

Details

Motivation: To develop a distributed learning approach that enables multiple agents to collaboratively learn from multi-view data while preserving representation diversity and avoiding correlated subspaces that occur in baseline methods.

Method: Each agent performs periodic SVD on learned subspaces, exchanges truncated basis matrices, and fuses subspaces. A projection matrix minimizes distance between outputs and projections to enforce representations toward fused subspaces.

Result: Theoretical guarantees show bounded trace on coding-rate change and consistency of basis fusion. Numerical simulations demonstrate high classification accuracy while maintaining representation diversity compared to baselines.

Conclusion: The proposed algorithm successfully achieves collaborative multi-view perception with guaranteed theoretical properties and superior performance over baselines that produce correlated subspaces and coupled representations.

Abstract: In this letter, we formulate a compositional distributed learning framework for multi-view perception by leveraging the maximal coding rate reduction principle combined with subspace basis fusion. In the proposed algorithm, each agent conducts a periodic singular value decomposition on its learned subspaces and exchanges truncated basis matrices, based on which the fused subspaces are obtained. By introducing a projection matrix and minimizing the distance between the outputs and its projection, the learned representations are enforced towards the fused subspaces. It is proved that the trace on the coding-rate change is bounded and the consistency of basis fusion is guaranteed theoretically. Numerical simulations validate that the proposed algorithm achieves high classification accuracy while maintaining representations’ diversity, compared to baselines showing correlated subspaces and coupled representations.

[480] ROI-based Deep Image Compression with Implicit Bit Allocation

Kai Hu, Han Wang, Renhe Liu, Zhilin Li, Shenghui Song, Yu Liu

Main category: eess.IV

TL;DR: Proposes an efficient ROI-based deep image compression model with implicit bit allocation using a novel Mask-Guided Feature Enhancement module and dual decoders, outperforming explicit bit allocation methods.

Details

Motivation: Existing ROI-based compression methods use explicit bit allocation with hard gating that impacts entropy model statistics and limits coding performance. The goal is to develop implicit bit allocation for better rate-distortion performance.

Method: Uses Mask-Guided Feature Enhancement module with Region-Adaptive Attention and Frequency-Spatial Collaborative Attention blocks, plus dual decoders for separate foreground/background reconstruction to enable data-driven balance between foreground enhancement and background quality.

Result: Experiments on COCO2017 show the implicit-based method significantly outperforms explicit bit allocation approaches in rate-distortion performance while maintaining satisfactory visual quality in background regions.

Conclusion: This is the first work using implicit bit allocation for high-quality region-adaptive coding, achieving optimal results through flexible bit allocation and frequency-spatial domain collaboration.

Abstract: Region of Interest (ROI)-based image compression has rapidly developed due to its ability to maintain high fidelity in important regions while reducing data redundancy. However, existing compression methods primarily apply masks to suppress background information before quantization. This explicit bit allocation strategy, which uses hard gating, significantly impacts the statistical distribution of the entropy model, thereby limiting the coding performance of the compression model. In response, this work proposes an efficient ROI-based deep image compression model with implicit bit allocation. To better utilize ROI masks for implicit bit allocation, this paper proposes a novel Mask-Guided Feature Enhancement (MGFE) module, comprising a Region-Adaptive Attention (RAA) block and a Frequency-Spatial Collaborative Attention (FSCA) block. This module allows for flexible bit allocation across different regions while enhancing global and local features through frequencyspatial domain collaboration. Additionally, we use dual decoders to separately reconstruct foreground and background images, enabling the coding network to optimally balance foreground enhancement and background quality preservation in a datadriven manner. To the best of our knowledge, this is the first work to utilize implicit bit allocation for high-quality regionadaptive coding. Experiments on the COCO2017 dataset show that our implicit-based image compression method significantly outperforms explicit bit allocation approaches in rate-distortion performance, achieving optimal results while maintaining satisfactory visual quality in the reconstructed background regions.

[481] Augment to Augment: Diverse Augmentations Enable Competitive Ultra-Low-Field MRI Enhancement

Felix F Zimmermann

Main category: eess.IV

TL;DR: This paper studies data augmentation strategies for enhancing ultra-low-field (ULF) MRI images to match high-field appearance using deep learning, achieving competitive results in the ULF-EnC challenge despite limited paired training data.

Details

Motivation: ULF MRI offers broader accessibility but suffers from poor SNR, reduced resolution, and non-standard contrasts. Image-to-image translation can help but is limited by scarce paired training data (only 50 paired 3D volumes available).

Method: Used task-adapted data augmentations including strong, diverse augmentations and auxiliary tasks on high-field data with a standard deep model for ULF image enhancement.

Result: The submission ranked third by brain-masked SSIM on public validation and fourth by official score on final test leaderboard, showing substantial improvement in fidelity.

Conclusion: Strong and diverse data augmentations, including auxiliary tasks on high-field data, significantly improve the performance of deep learning models for ULF MRI enhancement when paired training data is limited.

Abstract: Ultra-low-field (ULF) MRI promises broader accessibility but suffers from low signal-to-noise ratio (SNR), reduced spatial resolution, and contrasts that deviate from high-field standards. Image-to-image translation can map ULF images to a high-field appearance, yet efficacy is limited by scarce paired training data. Working within the ULF-EnC challenge constraints (50 paired 3D volumes; no external data), we study how task-adapted data augmentations impact a standard deep model for ULF image enhancement. We show that strong, diverse augmentations, including auxiliary tasks on high-field data, substantially improve fidelity. Our submission ranked third by brain-masked SSIM on the public validation leaderboard and fourth by the official score on the final test leaderboard. Code is available at https://github.com/fzimmermann89/low-field-enhancement.

[482] MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation

De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zhi-Chao Lai, Zeng-Guang Hou

Main category: eess.IV

TL;DR: MOSformer is a novel 2.5D medical image segmentation model that uses dual encoders and momentum updates to effectively fuse inter-slice information, achieving state-of-the-art performance on three benchmark datasets.

Details

Motivation: Existing 2.5D segmentation models use single encoders that fail to effectively fuse inter-slice information, leading to suboptimal segmentation performance. The goal is to better leverage inter-slice information from multi-scale feature maps.

Method: Proposes MOSformer with dual encoders to enhance feature distinguishability among slices, one of which is moving-averaged for consistent representations. Includes an inter-slice fusion transformer (IF-Trans) module to fuse multi-scale features across slices.

Result: Achieved state-of-the-art results on three datasets: Synapse (85.63% DSC), ACDC (92.19% DSC), and AMOS (85.43% DSC).

Conclusion: MOSformer demonstrates strong competitiveness in medical image segmentation by effectively leveraging inter-slice information through its dual encoder and fusion transformer architecture.

Abstract: Medical image segmentation takes an important position in various clinical applications. 2.5D-based segmentation models bridge the computational efficiency of 2D-based models with the spatial perception capabilities of 3D-based models. However, existing 2.5D-based models primarily adopt a single encoder to extract features of target and neighborhood slices, failing to effectively fuse inter-slice information, resulting in suboptimal segmentation performance. In this study, a novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information from multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain consistent slice representations. Moreover, an inter-slice fusion transformer (IF-Trans) module is developed to fuse inter-slice multi-scale features. MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate MOSformer’s competitiveness in medical image segmentation.

[483] Multi-scale Cascaded Foundation Model for Whole-body Organs-at-risk Segmentation

Rui Hao, Dayu Tan, Qiankun Li, Chunhou Zheng, Weimin Zhong, Zhigang Zeng

Main category: eess.IV

TL;DR: MCFNet is a multi-scale cascaded fusion network for accurate organ-at-risk segmentation in radiotherapy, featuring sharp extraction and flexible connection backbones that improve boundary localization and preserve fine structures while maintaining computational efficiency.

Details

Motivation: Accurate segmentation of organs-at-risk is vital for safe radiotherapy and surgery, but existing methods segment only limited sets of organs and lack systematic treatment of OARs segmentation.

Method: Multi-scale Cascaded Fusion Network (MCFNet) with Sharp Extraction Backbone for downsampling and Flexible Connection Backbone for skip-connection fusion, plus adaptive loss-aggregation strategy for stable optimization.

Result: Outperforms existing methods with consistent robustness and strong cross-dataset generalization across 10 datasets with 36,131 image-mask pairs from 671 patients, excelling in organ segmentation and providing reliable image-guided support.

Conclusion: MCFNet improves precision and safety of radiotherapy and surgery while supporting personalized treatment, advancing modern medical technology.

Abstract: Accurate segmentation of organs-at-risk (OARs) is vital for safe and precise radiotherapy and surgery. Most existing studies segment only a limited set of organs or regions, lacking a systematic treatment of OARs segmentation. We present a Multi-scale Cascaded Fusion Network (MCFNet) that aggregates features across multiple scales and resolutions. MCFNet consists of a Sharp Extraction Backbone for the downsampling path and a Flexible Connection Backbone for skip-connection fusion, strengthening representation learning in both stages. This design improves boundary localization and preserves fine structures while maintaining computational efficiency, enabling reliable performance even on low-resolution inputs. Experiments on an NVIDIA A6000 GPU using 36,131 image-mask pairs from 671 patients across 10 datasets show consistent robustness and strong cross-dataset generalization. An adaptive loss-aggregation strategy further stabilizes optimization and yields additional gains in accuracy and training efficiency. Through extensive validation, MCFNet outperforms existing methods, excelling in organ segmentation and providing reliable image-guided support for computer-aided diagnosis. Our solution aims to improve the precision and safety of radiotherapy and surgery while supporting personalized treatment, advancing modern medical technology. The code has been made available on GitHub: https://github.com/Henry991115/MCFNet.

[484] UltraSam: A Foundation Model for Ultrasound using Large Open-Access Segmentation Datasets

Adrien Meyer, Aditya Murali, Farahdiba Zarin, Didier Mutter, Nicolas Padoy

Main category: eess.IV

TL;DR: Created US-43d, the largest public ultrasound segmentation dataset, and UltraSam - a SAM-style foundation model tailored for ultrasound that outperforms existing models in segmentation and serves as effective initialization for downstream tasks.

Details

Motivation: Automated ultrasound image analysis is challenging due to anatomical complexity and limited annotated data, requiring a data-centric approach with large-scale datasets and specialized foundation models.

Method: Compiled US-43d dataset (43 datasets, 280k+ images, 50+ anatomical structures), adapted SAM to create UltraSam supporting point/box prompts, and used UltraSam as model initialization for downstream tasks.

Result: UltraSam significantly outperforms existing SAM-style models on prompt-based segmentation across three datasets, and UltraSam-initialized models surpass ImageNet-, SAM-, and MedSAM-initialized models in various downstream segmentation and classification tasks.

Conclusion: US-43d and UltraSam provide powerful tools for ultrasound analysis, with UltraSam demonstrating strong foundational capabilities as both a segmentation model and initialization for downstream tasks.

Abstract: Purpose: Automated ultrasound image analysis is challenging due to anatomical complexity and limited annotated data. To tackle this, we take a data-centric approach, assembling the largest public ultrasound segmentation dataset and training a versatile visual foundation model tailored for ultrasound. Methods: We compile US-43d, a large-scale collection of 43 open-access ultrasound datasets with over 280,000 images and segmentation masks for more than 50 anatomical structures. We then introduce UltraSam, an adaptation of the Segment Anything Model (SAM) that is trained on US-43d and supports both point- and box-prompts. Finally, we introduce a new use case for SAM-style models by using UltraSam as a model initialization that can be fine-tuned for various downstream analysis tasks, demonstrating UltraSam’s foundational capabilities. Results: UltraSam achieves vastly improved performance over existing SAM-style models for prompt-based segmentation on three diverse public datasets. Moreover, an UltraSam-initialized Vision Transformer surpasses ImageNet-, SAM-, and MedSAM-initialized models in various downstream segmentation and classification tasks, highlighting UltraSam’s effectiveness as a foundation model. Conclusion: We compile US-43d, a large-scale unified ultrasound dataset, and introduce UltraSam, a powerful multi-purpose SAM-style model for ultrasound images. We release our code and pretrained models at https://github.com/CAMMA-public/UltraSam and invite the community to further this effort by contributing high-quality datasets.

[485] A Bayesian Approach to Segmentation with Noisy Labels via Spatially Correlated Distributions

Ryu Tadokoro, Tsukasa Takagi, Shin-ichi Maeda

Main category: eess.IV

TL;DR: Proposes a Bayesian probabilistic model to handle spatially correlated label errors in semantic segmentation, using a novel ECCD framework for tractable inference.

Details

Motivation: Address annotation errors in semantic segmentation (common in medical imaging/remote sensing) where errors are spatially correlated rather than independent.

Method: Bayesian estimation with probabilistic model for label errors, using ECCD framework with continuous latent Gaussian field and KMS covariance structure for scalable variational inference.

Result: Significant performance improvement by leveraging spatial correlation of label errors; achieves clean-label comparable performance in lung segmentation under moderate noise.

Conclusion: Spatial correlation modeling of label errors is crucial for robust semantic segmentation, and the proposed ECCD framework enables efficient Bayesian inference for such problems.

Abstract: In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios, such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground-truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors.To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data include label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. However, Bayesian inference for such spatially correlated discrete variables is notoriously intractable. To overcome this fundamental challenge, we introduce a novel class of probabilistic models, which we term the ELBO-Computable Correlated Discrete Distribution (ECCD). By representing the discrete dependencies through a continuous latent Gaussian field with a Kac-Murdock-Szegö (KMS) structured covariance, our framework enables scalable and efficient variational inference for problems previously considered computationally prohibitive. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.

[486] EvRWKV: A Continuous Interactive RWKV Framework for Effective Event-Guided Low-Light Image Enhancement

Wenjie Cai, Qingguo Meng, Zhenyu Wang, Xingbo Dong, Zhe Jin

Main category: eess.IV

TL;DR: EvRWKV is a novel framework for low-light image enhancement using event cameras, featuring continuous cross-modal interaction through dual-domain processing that outperforms image-only methods by 1.79-1.85 dB PSNR and improves semantic segmentation by 35.44% mIoU.

Details

Motivation: Existing fusion approaches for event camera-based low-light image enhancement face limitations: early fusion struggles with modality heterogeneity, while late fusion severs crucial feature correlations between event and image data.

Method: Proposes EvRWKV framework with continuous cross-modal interaction through dual-domain processing, including Cross-RWKV Module for temporal and cross-modal dependencies, and Event Image Spectral Fusion Enhancer (EISFE) for joint adaptive frequency-domain denoising and spatial-domain alignment.

Result: Significantly outperforms image-only methods by 1.79 dB and 1.85 dB in PSNR on SDE and SDSD datasets respectively. Enhances semantic segmentation performance by 35.44% mIoU when using EvRWKV-enhanced images.

Conclusion: The continuous cross-modal interaction in EvRWKV effectively maintains feature consistency from low-level textures to high-level semantics, demonstrating superior performance for low-light image enhancement and practical utility for downstream applications.

Abstract: Event cameras offer significant potential for Low-light Image Enhancement (LLIE), yet existing fusion approaches are constrained by a fundamental dilemma: early fusion struggles with modality heterogeneity, while late fusion severs crucial feature correlations. To address these limitations, we propose EvRWKV, a novel framework that enables continuous cross-modal interaction through dual-domain processing, which mainly includes a Cross-RWKV Module to capture fine-grained temporal and cross-modal dependencies, and an Event Image Spectral Fusion Enhancer (EISFE) module to perform joint adaptive frequency-domain denoising and spatial-domain alignment. This continuous interaction maintains feature consistency from low-level textures to high-level semantics. Extensive experiments on the real-world SDE and SDSD datasets demonstrate that EvRWKV significantly outperforms only image-based methods by 1.79 dB and 1.85 dB in PSNR, respectively. To further validate the practical utility of our method for downstream applications, we evaluated its impact on semantic segmentation. Experiments demonstrate that images enhanced by EvRWKV lead to a significant 35.44% improvement in mIoU.

[487] RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation

Jierui Qu, Jianchun Zhao

Main category: eess.IV

TL;DR: RL-U²Net is a dual-branch U-Net with reinforcement learning for feature alignment that achieves state-of-the-art multi-modal 3D whole-heart segmentation, with Dice coefficients of 93.1% on CT and 87.0% on MRI.

Details

Motivation: Existing multi-modal segmentation methods suffer from spatial inconsistency between modalities, static fusion strategies, and inefficient decoupled feature alignment and segmentation processes.

Method: Dual-branch U-Net processes CT and MRI patches in parallel, with RL-XAlign module using cross-modal attention and reinforcement learning to learn optimal rotation strategies for anatomical pose and texture feature alignment.

Result: Achieved Dice coefficients of 93.1% on CT and 87.0% on MRI on MM-WHS 2017 dataset, outperforming existing state-of-the-art methods.

Conclusion: The proposed RL-U²Net effectively addresses multi-modal segmentation challenges and demonstrates superior performance for precise whole-heart segmentation.

Abstract: Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-U$^2$Net, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U$^2$Net outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.

[488] PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng, Deshu Chen, Yuchen Liu, Hongwei Zhang, Shuqi Wang, Lanlan Li, Limei Han, Yuan Cheng, Zixin Hu, Yuan Qi, Le Xue

Main category: eess.IV

TL;DR: PET2Rep is the first comprehensive benchmark for evaluating vision-language models on PET radiology report generation, addressing the gap in molecular imaging applications and revealing current VLMs’ poor performance on this task.

Details

Motivation: PET imaging provides unique metabolic information but manual report generation is labor-intensive. Existing VLM applications focus on structural imaging, overlooking PET's molecular characteristics.

Method: Created PET2Rep dataset with whole-body PET image-report pairs covering dozens of organs. Evaluated 30 state-of-the-art VLMs using both standard NLG metrics and novel clinical efficacy metrics for radiotracer uptake patterns.

Result: Current VLMs perform poorly on PET report generation, falling short of practical clinical needs. Identified key insufficiencies in medical VLM applications.

Conclusion: There’s a significant gap in VLM capabilities for PET report generation, highlighting the need for specialized development to address molecular imaging requirements.

Abstract: Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficacy metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.

Today’s Research Highlights

Table of Contents

cs.CL

[1] SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

[2] Where did you get that? Towards Summarization Attribution for Analysts

[3] GMTRouter: Personalized LLM Router over Multi-turn User Interactions

[4] The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

[5] Knowledge Graph Analysis of Legal Understanding and Violations in LLMs

[6] Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

[7] Diverse Preference Learning for Capabilities and Alignment

[8] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

[9] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning

[10] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency Via Adversarial Nudge

[11] Self-HarmLLM: Can Large Language Model Harm Itself?

[12] OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

[13] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study

[14] Evaluating DisCoCirc in Translation Tasks & its Limitations: A Comparative Study Between Bengali & English

[15] A Super-Learner with Large Language Models for Medical Emergency Advising

[16] Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM

[17] Detecting Suicidal Ideation in Text with Interpretable Deep Learning: A CNN-BiGRU with Attention Mechanism

[18] Structured Uncertainty guided Clarification for LLM Agents

[19] Toward Automated Cognitive Assessment in Parkinson’s Disease Using Pretrained Language Models

[20] BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

[21] Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

[22] BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

[23] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models

[24] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

[25] TiDAR: Think in Diffusion, Talk in Autoregression

[26] EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

[27] SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

[28] Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

[29] A Neurosymbolic Approach to Natural Language Formalization and Verification

[30] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

[31] Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

[32] Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

[33] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

[34] One-Topic-Doesn’t-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning

[35] DoPE: Denoising Rotary Position Embedding

[36] LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

[37] A Hybrid Search for Complex Table Question Answering in Securities Report

[38] Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays

[39] The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

[40] Pretraining Finnish ModernBERTs

[41] Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

[42] C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

[43] LiteraryTaste: A Preference Dataset for Creative Writing Personalization

[44] Towards Explainable Khmer Polarity Classification

[45] AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness

[46] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

[47] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

[48] Spider4SSC & S2CLite: A text-to-multi-query-language dataset using lightweight ontology-agnostic SPARQL to Cypher parser

[49] MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

[50] Self-Correcting Large Language Models: Generation vs. Multiple Choice

[51] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

[52] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

[53] CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

[54] GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning

[55] BIG5-TPoT: Predicting BIG Five Personality Traits, Facets, and Items Through Targeted Preselection of Texts

[56] Readability Measures and Automatic Text Simplification: In the Search of a Construct

[57] SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

[58] NaturalTurn: A Method to Segment Speech into Psychologically Meaningful Conversational Turns

[59] Evaluating Deep Unlearning in Large Language Models

[60] Large Language Model Benchmarks in Medical Tasks

[61] Training and Evaluating Language Models with Template-based Data Generation

[62] OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation

[63] How Linguistics Learned to Stop Worrying and Love the Language Models

[64] FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

[65] Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

[66] Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs

[67] MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization

[68] Exploiting individual differences to bootstrap communication

[69] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

[70] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

[71] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

[72] LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High

[73] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

[74] anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

[75] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

[76] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

[77] Unveiling Super Experts in Mixture-of-Experts Large Language Models