Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 98]
- cs.CV [Total: 132]
- cs.AI [Total: 68]
- cs.SD [Total: 18]
- cs.LG [Total: 195]
- cs.MA [Total: 2]
- cs.MM [Total: 3]
- eess.AS [Total: 19]
- eess.IV [Total: 14]
cs.CL
[1] Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries
Yuchen Zhang, Ravi Shekhar, Haralambos Mouratidis
Main category: cs.CL
TL;DR: Family-based connector sharing for multilingual LLM-ASR reduces parameters while improving cross-domain generalization by grouping languages by linguistic families.
Details
Motivation: Previous LLM-powered ASR systems train separate connectors per language, which overlooks linguistic relatedness and is inefficient for multilingual deployment.Method: Propose connector-sharing strategy based on linguistic family membership - one connector per language family instead of per language, connecting frozen speech encoder to pretrained LLM via lightweight connectors.
Result: Family-based connectors reduce parameter count while improving generalization across domains, validated across two multilingual LLMs and two real-world corpora (curated and crowd-sourced speech).
Conclusion: Linguistic family-based connector sharing offers practical and scalable strategy for efficient multilingual ASR deployment by leveraging linguistic relatedness.
Abstract: Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
[2] Self-Aware Knowledge Probing: Evaluating Language Models’ Relational Knowledge through Confidence Calibration
Christopher Kissling, Elena Merdjanovska, Alan Akbik
Main category: cs.CL
TL;DR: Proposes a calibration probing framework for relational knowledge in LMs, evaluating three confidence modalities and finding most models are overconfident, especially masked LMs.
Details
Motivation: Existing knowledge probes only evaluate accuracy/precision but ignore model reliability reflected in confidence calibration. Need to assess how well model confidence scores align with actual correctness.Method: Proposes calibration probing framework covering three confidence modalities: (1) intrinsic confidence (model’s own confidence scores), (2) structural consistency (confidence across rephrased statements), and (3) semantic grounding (understanding linguistic confidence expressions). Analyzes ten causal and six masked language models.
Result: Most models, especially masked language models, are overconfident. Best-calibrated scores come from confidence estimates accounting for inconsistencies due to statement rephrasing. Even largest pre-trained models fail to accurately encode semantics of linguistic confidence expressions.
Conclusion: Calibration probing reveals important reliability issues in LMs’ knowledge representation. Confidence calibration matters for trustworthy AI, and current models need improvement in properly calibrating their confidence about relational knowledge.
Abstract: Knowledge probing quantifies how much relational knowledge a language model (LM) has acquired during pre-training. Existing knowledge probes evaluate model capabilities through metrics like prediction accuracy and precision. Such evaluations fail to account for the model’s reliability, reflected in the calibration of its confidence scores. In this paper, we propose a novel calibration probing framework for relational knowledge, covering three modalities of model confidence: (1) intrinsic confidence, (2) structural consistency and (3) semantic grounding. Our extensive analysis of ten causal and six masked language models reveals that most models, especially those pre-trained with the masking objective, are overconfident. The best-calibrated scores come from confidence estimates that account for inconsistencies due to statement rephrasing. Moreover, even the largest pre-trained models fail to encode the semantics of linguistic confidence expressions accurately.
[3] Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu, Xu Yang
Main category: cs.CL
TL;DR: SFDD uses data filtering based on token predictive distribution flatness to accelerate speculative decoding training by 2x with only 50% data while maintaining inference speed.
Details
Motivation: Speculative decoding typically requires training a draft model on large datasets, which is computationally expensive. The authors found that not all training samples contribute equally to SD acceptance rates, suggesting potential for data efficiency improvements.Method: Proposed “flatness” metric to quantify how flat token predictive distributions are from the target model. Developed Sample-level-flatness-based Dataset Distillation (SFDD) approach that filters training data to retain only the most valuable samples based on this flatness metric.
Result: Experiments on EAGLE framework show SFDD achieves over 2x training speedup using only 50% of data while keeping final model’s inference speedup within 4% of full-dataset baseline.
Conclusion: SFDD introduces an effective data-centric approach that substantially improves training efficiency for speculative decoding by identifying and retaining only the most valuable training samples based on predictive distribution flatness.
Abstract: Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50% of the data, while keeping the final model’s inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://anonymous.4open.science/r/Flatness.
[4] BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models
Kaustubh D. Dhole
Main category: cs.CL
TL;DR: BabyReasoningBench is a new benchmark for evaluating reasoning in “baby” language models trained on child-like data, showing these models have uneven reasoning abilities across different task types.
Details
Motivation: Existing benchmarks for language models are adult-centric and assume broad world knowledge, which doesn't match "baby" language models trained on developmentally plausible input like child-directed speech. This mismatch obscures what reasoning abilities actually emerge from child-like training data.Method: Created BabyReasoningBench - a benchmark of 19 reasoning tasks generated by GPT-5.2, grounded in classic developmental psychology paradigms. Tasks cover theory of mind, analogical/relational reasoning, causal inference, intervention selection, and core reasoning primitives. Tested two GPT-2 based baby language models pretrained on 10M and 100M tokens of child-directed speech.
Result: Baby language models show overall low but uneven performance with dissociations across task families: scaling improves causal and physical reasoning tasks, but belief attribution and pragmatics-sensitive tasks remain challenging even with more data.
Conclusion: BabyReasoningBench provides a developmentally grounded framework for analyzing what reasoning emerges from child-like training distributions and testing mechanistic hypotheses about how such abilities develop in language models.
Abstract: Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence. These assumptions are mismatched to baby language models trained on developmentally plausible input such as child-directed speech and early-childhood narratives, and they obscure which reasoning abilities (if any) emerge under such constraints. We introduce BabyReasoningBench, a GPT-5.2 generated benchmark of 19 reasoning tasks grounded in classic paradigms from developmental psychology, spanning theory of mind, analogical and relational reasoning, causal inference and intervention selection, and core reasoning primitives that are known to be confounded by memory and pragmatics. We find that two GPT-2 based baby language models (pretrained on 10M and 100M of child-directed speech text) show overall low but uneven performance, with dissociations across task families: scaling improves several causal and physical reasoning tasks, while belief attribution and pragmatics-sensitive tasks remain challenging. BabyReasoningBench provides a developmentally grounded lens for analyzing what kinds of reasoning are supported by child-like training distributions, and for testing mechanistic hypotheses about how such abilities emerge.
[5] LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O’Hearn
Main category: cs.CL
TL;DR: LLMs show strong performance in predicting program termination on SV-Comp 2025 benchmarks, with GPT-5 and Claude Sonnet-4.5 ranking close to top tools, but struggle with providing valid proofs and performance degrades with longer programs.
Details
Motivation: The Halting Problem is undecidable, making automatic verification tools approximate and language-specific. Recent LLM advances raise the question of whether LLMs can reliably predict program termination, potentially offering a new approach to this fundamental problem.Method: Evaluated LLMs on diverse C programs from the Termination category of SV-Comp 2025, comparing their termination prediction performance against traditional verification tools.
Result: LLMs perform remarkably well: GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool, and Code World Model (CWM) would place just behind the second-ranked tool. However, LLMs often fail to provide valid witness proofs, and performance decreases as program length increases.
Conclusion: LLMs are effective at predicting program termination but have limitations in providing proofs and handling longer programs. These insights motivate further research into using LLMs for reasoning about undecidable problems in computer science.
Abstract: Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
[6] Malicious Repurposing of Open Science Artefacts by Using Large Language Models
Zahra Hashemi, Zhiqiang Zhong, Jun Pang, Wei Zhao
Main category: cs.CL
TL;DR: LLMs can generate harmful research proposals by repurposing open science artefacts, but they make unreliable evaluators for dual-use risk assessment, requiring human oversight.
Details
Motivation: While LLMs show promise for scientific discovery, there's little research on their potential to generate harmful research by exploiting open science artefacts for malicious purposes.Method: Developed an end-to-end pipeline that: 1) bypasses LLM safeguards via persuasion-based jailbreaking, 2) reinterprets NLP papers to identify and repurpose their artefacts (datasets, methods, tools) by exploiting vulnerabilities, and 3) assesses safety using a three-dimensional evaluation framework (harmfulness, feasibility of misuse, soundness of technicality).
Result: LLMs can generate harmful proposals by repurposing ethically designed open artefacts. However, LLM evaluators show significant disagreement: GPT-4.1 assigns higher scores (greater potential harms, higher soundness and feasibility), Gemini-2.5-pro is stricter, and Grok-3 falls between these extremes.
Conclusion: LLMs cannot yet serve as reliable judges in malicious evaluation setups, making human evaluation essential for credible dual-use risk assessment.
Abstract: The rapid evolution of large language models (LLMs) has fuelled enthusiasm about their role in advancing scientific discovery, with studies exploring LLMs that autonomously generate and evaluate novel research ideas. However, little attention has been given to the possibility that such models could be exploited to produce harmful research by repurposing open science artefacts for malicious ends. We fill the gap by introducing an end-to-end pipeline that first bypasses LLM safeguards through persuasion-based jailbreaking, then reinterprets NLP papers to identify and repurpose their artefacts (datasets, methods, and tools) by exploiting their vulnerabilities, and finally assesses the safety of these proposals using our evaluation framework across three dimensions: harmfulness, feasibility of misuse, and soundness of technicality. Overall, our findings demonstrate that LLMs can generate harmful proposals by repurposing ethically designed open artefacts; however, we find that LLMs acting as evaluators strongly disagree with one another on evaluation outcomes: GPT-4.1 assigns higher scores (indicating greater potential harms, higher soundness and feasibility of misuse), Gemini-2.5-pro is markedly stricter, and Grok-3 falls between these extremes. This indicates that LLMs cannot yet serve as reliable judges in a malicious evaluation setup, making human evaluation essential for credible dual-use risk assessment.
[7] FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar
Main category: cs.CL
TL;DR: FROST is an attention-aware method that prunes uncritical reasoning paths using attention weights to create shorter, more reliable reasoning trajectories, achieving significant reductions in token usage and improvements in accuracy.
Details
Motivation: Traditional reasoning approaches often follow lengthy, inefficient paths. The paper aims to improve reasoning efficiency by identifying and pruning uncritical reasoning paths using attention mechanisms.Method: Introduces reasoning outliers and designs an attention-based mechanism to remove them. Uses attention weights to prune uncritical reasoning paths at the sentence level while preserving reasoning capacity.
Result: Outperforms state-of-the-art methods (TALE, ThinkLess) on four benchmarks using Phi-4-Reasoning and GPT-OSS-20B models. Achieves 69.68% average reduction in token usage and 26.70% accuracy improvement over base model. Reduces maximum infinity norm by 15.97% and average kurtosis by 91.09% in attention outlier metrics.
Conclusion: FROST effectively enhances reasoning efficiency by leveraging attention mechanisms to prune uncritical paths, resulting in shorter, more reliable reasoning trajectories with significant performance improvements.
Abstract: We propose FROST, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model’s reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST
[8] Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback
Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Main category: cs.CL
TL;DR: First multi-reward RLAIF framework for speech dialogue systems combining semantic, audio-quality, and emotion-consistency rewards with turn-level preference sampling for incremental decoding.
Details
Motivation: Prior RLHF/RLAIF for speech dialogue systems is limited to single semantic rewards at utterance level, overlooking multi-dimensional conversational quality (semantic coherence, audio naturalness, speaker consistency, emotion alignment, turn-taking) and mismatched with incremental generation in duplex systems.Method: Multi-reward RLAIF framework combining semantic, audio-quality, and emotion-consistency rewards. Uses turn-level preference sampling and aggregates per-block log-probabilities within a single DPO objective to align utterance-level preferences with incremental blockwise decoding.
Result: Single-reward RLAIF selectively improves targeted metrics, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. First systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.
Conclusion: Holistic multi-reward alignment is crucial for practical conversational speech dialogue systems, addressing the multi-dimensional nature of conversational quality that prior single-reward approaches overlook.
Abstract: Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.
[9] PsyProbe: Proactive and Interpretable Dialogue through User State Modeling for Exploratory Counseling
Sohhyung Park, Hyunji Kang, Sungzoon Cho, Dongil Kim
Main category: cs.CL
TL;DR: PsyProbe is a proactive mental health dialogue system that systematically models user psychological states using the PPPPPI framework and cognitive error detection to generate therapeutic exploration questions during counseling.
Details
Motivation: Existing mental health dialogue systems are predominantly reactive and lack systematic user state modeling for proactive therapeutic exploration in counseling sessions.Method: PsyProbe combines: 1) State Builder for extracting structured psychological profiles using PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) with cognitive error detection, 2) Memory Construction for tracking information gaps, 3) Strategy Planner for Motivational Interviewing behavioral codes, and 4) Response Generator with Question Ideation and Critic/Revision modules.
Result: Evaluated with 27 participants in real-world Korean counseling scenarios: 1) Full PsyProbe consistently outperforms baselines in automatic evaluation, 2) User evaluation shows significantly increased engagement intention and improved naturalness, 3) Expert evaluation by certified counselor demonstrates substantial improvement in core issue understanding and question rates comparable to professional counselors.
Conclusion: Systematic state modeling and proactive questioning are effective for therapeutic exploration in counseling, with PsyProbe validating this approach through improved user engagement, naturalness, and professional-level question generation.
Abstract: Recent advances in large language models have enabled mental health dialogue systems, yet existing approaches remain predominantly reactive, lacking systematic user state modeling for proactive therapeutic exploration. We introduce PsyProbe, a dialogue system designed for the exploration phase of counseling that systematically tracks user psychological states through the PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) augmented with cognitive error detection. PsyProbe combines State Builder for extracting structured psychological profiles, Memory Construction for tracking information gaps, Strategy Planner for Motivational Interviewing behavioral codes, and Response Generator with Question Ideation and Critic/Revision modules to generate contextually appropriate, proactive questions. We evaluate PsyProbe with 27 participants in real-world Korean counseling scenarios, including automatic evaluation across ablation modes, user evaluation, and expert evaluation by a certified counselor. The full PsyProbe model consistently outperforms baseline and ablation modes in automatic evaluation. User evaluation demonstrates significantly increased engagement intention and improved naturalness compared to baseline. Expert evaluation shows that PsyProbe substantially improves core issue understanding and achieves question rates comparable to professional counselors, validating the effectiveness of systematic state modeling and proactive questioning for therapeutic exploration.
[10] Leveraging Sentence-oriented Augmentation and Transformer-Based Architecture for Vietnamese-Bahnaric Translation
Tan Sang Nguyen, Quoc Nguyen Pham, Tho Quan
Main category: cs.CL
TL;DR: The paper proposes neural machine translation techniques with augmentation strategies for Vietnamese-Bahnaric translation to preserve the endangered Bahnaric language, addressing resource constraints through flexible, data-efficient methods.
Details
Motivation: The Bahnaric language is culturally significant but endangered, requiring preservation efforts. While NMT can help make content accessible to Bahnaric speakers, Vietnamese-Bahnaric translation faces challenges due to limited linguistic resources and data constraints.Method: The authors employ state-of-the-art NMT techniques with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation. Both approaches are flexible, work with various NMT models, require no complex preprocessing, additional systems, or extra data beyond existing parallel corpora.
Result: The abstract doesn’t provide specific quantitative results, but implies that the proposed methods address the resource constraints in Vietnamese-Bahnaric translation, making NMT more viable for this low-resource language pair.
Conclusion: The proposed augmentation strategies offer practical solutions for Vietnamese-Bahnaric NMT, contributing to Bahnaric language preservation through improved translation accessibility without requiring extensive additional resources.
Abstract: The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.
[11] Transparency-First Medical Language Models: Datasheets, Model Cards, and End-to-End Data Provenance for Clinical NLP
Olaf Yunus Laitinen Imanov, Taner Yilmaz, Ayse Tuba Tugrul, Melike Nesrin Zaman, Ozkan Gunalp, Duygu Erisken, Sila Burde Dulger, Rana Irem Turhan, Izzet Ozdemir, Derya Umut Kulali, Ozan Akbulut, Harun Demircioglu, Hasan Basri Kara, Berfin Tavan
Main category: cs.CL
TL;DR: TeMLM introduces transparency-first release artifacts for clinical language models, unifying provenance, data transparency, modeling transparency, and governance into a machine-checkable bundle with defined artifacts and conformance checklist.
Details
Motivation: To address the need for transparency in clinical language models by creating standardized, machine-checkable release artifacts that unify various aspects of transparency (provenance, data, modeling, governance) for better auditing and validation.Method: Defines an artifact suite (TeMLM-Card, TeMLM-Datasheet, TeMLM-Provenance) and a lightweight conformance checklist for repeatable auditing. Instantiates the artifacts on Technetium-I, a large-scale synthetic clinical NLP dataset with 498,000 notes, 7.74M PHI entity annotations, and ICD-9-CM diagnosis labels.
Result: Reports reference results for ProtactiniumBERT (100M parameters) on PHI de-identification (token classification) and top-50 ICD-9 code extraction (multi-label classification) using the synthetic dataset. Demonstrates the framework’s practical application.
Conclusion: Synthetic benchmarks are valuable for tooling and process validation, but models should ultimately be validated on real clinical data prior to deployment. TeMLM provides a transparency framework to support this validation process.
Abstract: We introduce TeMLM, a set of transparency-first release artifacts for clinical language models. TeMLM unifies provenance, data transparency, modeling transparency, and governance into a single, machine-checkable release bundle. We define an artifact suite (TeMLM-Card, TeMLM-Datasheet, TeMLM-Provenance) and a lightweight conformance checklist for repeatable auditing. We instantiate the artifacts on Technetium-I, a large-scale synthetic clinical NLP dataset with 498,000 notes, 7.74M PHI entity annotations across 10 types, and ICD-9-CM diagnosis labels, and report reference results for ProtactiniumBERT (about 100 million parameters) on PHI de-identification (token classification) and top-50 ICD-9 code extraction (multi-label classification). We emphasize that synthetic benchmarks are valuable for tooling and process validation, but models should be validated on real clinical data prior to deployment.
[12] Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, Ray Mooney
Main category: cs.CL
TL;DR: VLMs are vulnerable to textual misinformation that contradicts visual evidence, showing 48.2% performance drop when faced with conflicting multimodal inputs.
Details
Motivation: While VLMs show strong multimodal reasoning on VQA benchmarks, their robustness against textual misinformation remains under-explored. Existing research studied misinformation in text-only domains, but it's unclear how VLMs handle contradictory information from different modalities.Method: 1) Created CONTEXT-VQA dataset with image-question pairs and systematically generated persuasive prompts that deliberately conflict with visual evidence. 2) Designed and executed a thorough evaluation framework to benchmark model susceptibility to conflicting multimodal inputs.
Result: Comprehensive experiments on 11 state-of-the-art VLMs reveal they are vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of conflicting text. Models show average performance drop of over 48.2% after just one round of persuasive conversation.
Conclusion: Current VLMs have a critical limitation in robustness against textual manipulation. The findings underscore the need for improved robustness against textual misinformation in multimodal AI systems.
Abstract: Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
[13] How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Shawn Im, Changdae Oh, Zhen Fang, Sharon Li
Main category: cs.CL
TL;DR: The paper develops closed-form expressions for transformer weights in early training stages, showing they emerge as compositions of three statistical basis functions from text data.
Details
Motivation: To understand how semantic associations are learned and represented in language models, connecting deep learning with linguistic theory and developing mechanistic foundations for LLMs.Method: Analyze training dynamics using leading-term gradient approximations to derive closed-form expressions for transformer weights at early training stages, revealing compositions of bigram, token-interchangeability, and context mappings.
Result: Theoretical weight characterizations closely match learned weights in real-world LLMs, and the framework provides interpretability for how transformers capture semantic associations.
Conclusion: Transformer weights emerge as simple compositions of statistical basis functions from text data, providing a mechanistic understanding of how semantic associations are learned and enabling better interpretation of LLM representations.
Abstract: Semantic associations such as the link between “bird” and “flew” are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.
[14] A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
Aakash Trivedi, Aniket Upadhyay, Pratik Narang, Dhruv Kumar, Praveen Kumar
Main category: cs.CL
TL;DR: Hybrid pipeline combining RoBERTa classifier with LLM outperforms baselines for extracting actionable suggestions from customer reviews.
Details
Motivation: Existing approaches fail to isolate precise improvement instructions from mixed-intent customer reviews, which are essential for operational decision-making.Method: Hybrid pipeline: high-recall RoBERTa classifier trained with precision-recall surrogate to reduce false negatives, combined with controlled instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization.
Result: Outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence across hospitality and food datasets. Human evaluations confirm suggestions are clear, faithful, and interpretable.
Conclusion: Hybrid reasoning architectures achieve meaningful improvements in fine-grained actionable suggestion mining, though challenges remain in domain adaptation and efficient local deployment.
Abstract: Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need. We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision-recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable. Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.
[15] DREAMSTATE: Diffusing States and Parameters for Recurrent Large Language Models
Liu Xiao
Main category: cs.CL
TL;DR: DREAMSTATE framework enables generation and editing of RWKV RNN states using conditional diffusion transformers, revealing their structural knowledge representation and enabling hybrid RNN-DiT architectures.
Details
Motivation: Modern RNNs like RWKV have powerful short-range modeling and efficient fixed-size states, but there's a significant lack of research into their internal state as an editable knowledge representation. The authors aim to fill this gap by exploring the representational properties of RWKV states.Method: Proposed DREAMSTATE framework uses conditional Diffusion Transformer (DiT) to directly model the probability manifold of RWKV states, enabling state generation and editing. Also developed a novel hybrid architecture combining local RNN advantages with global context adaptability via parallel DiT that dynamically generates WKV parameters.
Result: Successfully uncovered and modeled state’s representational potential through t-SNE visualizations and controlled generation experiments. Hybrid model can be trained stably via multi-objective loss, validating design feasibility.
Conclusion: Opens new research direction for RNN state representation and provides concrete architectural reference for future model design, enabling context-aware dynamic recurrence mechanisms.
Abstract: Modern Recurrent Neural Networks (RNNs), such as RWKV, are distinguished by their powerful short-range modeling capabilities and efficient fixed-size states, which constitute a core advantage over standard Transformers. However, there is a significant lack of research into their internal state as an editable knowledge representation. To fill this gap, we first explore the representational properties of the RWKV state by proposing the DREAMSTATE framework. This framework utilizes a conditional Diffusion Transformer (DiT) to directly model the probability manifold of the state, enabling its generation and editing. The structural nature of this representation is validated through t-SNE visualizations and controlled generation experiments. After successfully uncovering and modeling the state’s representational potential, we further propose a novel hybrid architecture that combines the local advantages of RNNs with global context adaptability. This architecture features a parallel DiT that processes a variable-length global context to dynamically generate and adjust the core recurrent module’s WKV parameters, transforming the fixed recurrence mechanism into a context-aware dynamic function. Experiments demonstrate that this hybrid model can be trained stably via a multi-objective loss, validating its design feasibility. Our work not only opens a new research direction for RNN state representation but also provides a concrete architectural reference for future model design. The code is publicly available at: https://huggingface.co/2dgx41s/DreamState.
[16] RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering
Kaehyun Um, KyuHwan Yeom, Haerim Yang, Minyoung Choi, Hyeongjun Yang, Kyong-Ho Lee
Main category: cs.CL
TL;DR: RPO-RAG is a KG-based retrieval-augmented generation framework specifically designed for small LLMs (<8B parameters) that improves reasoning on knowledge graph question answering through semantic sampling, relation-aware optimization, and answer-centered prompting.
Details
Motivation: Existing KG-based RAG approaches have limitations: they use semantics-unaware path sampling, are weakly aligned with KG reasoning objectives, and don't organize retrieved paths effectively for small LLMs. Prior work also focuses on large LLMs (ChatGPT/GPT-4) or models above 7B parameters, leaving sub-7B models underexplored.Method: RPO-RAG introduces three key innovations: (1) query-path semantic sampling strategy for informative supervisory signals, (2) relation-aware preference optimization aligning training with intermediate KG reasoning signals, and (3) answer-centered prompt design organizing entities and reasoning paths in interpretable format.
Result: Extensive experiments on WebQSP and CWQ KGQA datasets show RPO-RAG bridges performance gap between small and large LLMs. On WebQSP, improves F1 by up to 8.8%; on CWQ achieves new SOTA among models under 8B parameters in both Hit and F1 metrics.
Conclusion: RPO-RAG substantially improves reasoning capability of small LLMs (even under 3B parameters), highlighting their potential for resource-efficient and practical on-device KGQA applications.
Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs’ ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.
[17] DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models
Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding, Bozhou Li, Bohan Zeng, Yang Shi, Qiang Liu, Yuanxing Zhang, Pengfei Wan, Liang Wang, Tieniu Tan
Main category: cs.CL
TL;DR: DiaDem is an audiovisual video captioning model that generates more accurate dialogue descriptions using synthesized SFT data and two-stage GRPO training, with DiaDemBench benchmark for evaluation.
Details
Motivation: Existing audiovisual captioning models struggle with faithful dialogue descriptions, which are crucial for downstream understanding and generation tasks.Method: 1) Synthesize high-quality dataset for supervised fine-tuning (SFT), 2) Employ difficulty-partitioned two-stage GRPO (likely Gradient-based Reinforcement Policy Optimization) strategy to enhance dialogue descriptions.
Result: DiaDem outperforms Gemini series in dialogue description accuracy on DiaDemBench and achieves competitive performance on general audiovisual captioning benchmarks.
Conclusion: DiaDem effectively addresses dialogue description limitations in audiovisual captioning, with commercial models showing substantial room for improvement in dialogue-aware captioning.
Abstract: Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
[18] Riddle Quest : The Enigma of Words
Niharika Sri Parasa, Chaitali Diwan, Srinath Srinivasa
Main category: cs.CL
TL;DR: A pipeline for creating and evaluating analogy-based riddles to test LLMs’ reasoning coverage and ambiguity handling.
Details
Motivation: Riddles require creative interpretation and inference, making them useful for examining language models' ability to handle ambiguity and multiple valid interpretations.Method: Four-component pipeline: triples creator builds structured facts, semantic mapper selects analogy attributes, stylized generator creates riddle clues, and validator collects all possible answers.
Result: LLMs often guess the main intended answer but frequently miss other valid interpretations, revealing limitations in reasoning coverage.
Conclusion: Riddles serve as a lightweight tool for evaluating language models’ ability to handle ambiguity and recognize multiple valid solutions.
Abstract: Riddles are concise linguistic puzzles that describe an object or idea through indirect, figurative, or playful clues. They are a longstanding form of creative expression, requiring the solver to interpret hints, recognize patterns, and draw inferences to identify the answers. In this work, we introduce a simple pipeline for creating and evaluating analogy-based riddles. The system includes a triples creator that builds structured facts about a concept, a semantic mapper that selects attributes useful for analogy, a stylized generator that turns them into riddle clues, and a validator that collects all possible answers the riddle could point to. We use this validator to study whether large language models can recover the full answer set for different riddle types. Our case study shows that while models often guess the main intended answer, they frequently miss other valid interpretations. This highlights the value of riddles as a lightweight tool for examining reasoning coverage and ambiguity handling in language models.
[19] DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian
Main category: cs.CL
TL;DR: DART is a speculative decoding method that uses parallel generation instead of autoregressive drafting to reduce latency, achieving 2.03x-3.44x speedup over standard decoding.
Details
Motivation: Existing model-based draft designs like EAGLE3 improve accuracy but require multi-step autoregressive inference, creating high drafting latency that becomes the performance bottleneck in speculative decoding.Method: DART predicts logits for multiple future masked positions in parallel within a single forward pass using target model hidden states, eliminating autoregressive rollouts. It also introduces an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity.
Result: DART achieves 2.03x-3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average while preserving high draft accuracy.
Conclusion: DART substantially reduces draft-stage overhead while maintaining high accuracy, offering a practical speculative decoding framework that significantly improves end-to-end decoding speed.
Abstract: Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x–3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.
[20] ReToP: Learning to Rewrite Electronic Health Records for Clinical Prediction
Jesus Lovon-Melgarejo, Jose G. Moreno, Christine Damase-Michel, Lynda Tamine
Main category: cs.CL
TL;DR: ReToP is an LLM-based framework that improves clinical prediction by training an EHR rewriter and predictor end-to-end, using synthetic data and a novel CSC score to generate clinically relevant rewrites that enhance task performance.
Details
Motivation: Existing LLM approaches for EHR-based clinical prediction are task-agnostic, using LLMs as EHR encoders or completion modules without integrating prediction task signals, which limits performance accuracy.Method: Proposes Rewrite-To-Predict (ReToP) with end-to-end training of EHR rewriter and clinical predictor. Uses clinical-driven feature selection to generate synthetic pseudo-labels for training, and introduces Classifier Supervised Contribution (CSC) score to align rewrites with prediction objectives.
Result: ReToP surpasses strong baselines across three clinical tasks on MIMIC-IV, shows generalizability to unseen datasets/tasks with minimal fine-tuning, preserves faithful rewrites, and emphasizes task-relevant predictive features.
Conclusion: ReToP effectively addresses task-agnostic limitations of existing LLM approaches by integrating prediction signals into EHR rewriting, resulting in improved clinical prediction performance and generalizability.
Abstract: Electronic Health Records (EHRs) provide crucial information for clinical decision-making. However, their high-dimensionality, heterogeneity, and sparsity make clinical prediction challenging. Large Language Models (LLMs) allowed progress towards addressing this challenge by leveraging parametric medical knowledge to enhance EHR data for clinical prediction tasks. Despite the significant achievements made so far, most of the existing approaches are fundamentally task-agnostic in the sense that they deploy LLMs as EHR encoders or EHR completion modules without fully integrating signals from the prediction tasks. This naturally hinders task performance accuracy. In this work, we propose Rewrite-To-Predict (ReToP), an LLM-based framework that addresses this limitation through an end-to-end training of an EHR rewriter and a clinical predictor. To cope with the lack of EHR rewrite training data, we generate synthetic pseudo-labels using clinical-driven feature selection strategies to create diverse patient rewrites for fine-tuning the EHR rewriter. ReToP aligns the rewriter with prediction objectives using a novel Classifier Supervised Contribution (CSC) score that enables the EHR rewriter to generate clinically relevant rewrites that directly enhance prediction. Our ReToP framework surpasses strong baseline models across three clinical tasks on MIMIC-IV. Moreover, the analysis of ReToP shows its generalizability to unseen datasets and tasks with minimal fine-tuning while preserving faithful rewrites and emphasizing task-relevant predictive features.
[21] MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning
Yimeng Wang, Jiaxing Zhao, Hongbin Xie, Hexing Ma, Yuzhen Lei, Shuangxue Liu, Xuan Song, Zichen Zhang, Haoran Zhang
Main category: cs.CL
TL;DR: MetaGen is a training-free framework that dynamically adapts both role specifications and collaboration topology during inference for multi-agent LLM systems, improving accuracy and cost efficiency.
Details
Motivation: Existing multi-agent LLM systems use fixed role libraries and frozen interaction topologies, which cause task mismatches, prevent adaptation to new evidence during reasoning, and inflate inference costs.Method: MetaGen generates and rewrites query-conditioned role specifications to maintain a dynamic role pool, instantiates a constrained execution graph around a minimal backbone, and iteratively updates role prompts while adjusting structural decisions using lightweight feedback signals - all without updating base model weights.
Result: Experiments on code generation and multi-step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi-agent baselines.
Conclusion: MetaGen provides an effective training-free approach for dynamic multi-agent collaboration that adapts both roles and interaction structures during inference, addressing limitations of rigid fixed-topology systems.
Abstract: Large language models are increasingly deployed as multi-agent systems, where specialized roles communicate and collaborate through structured interactions to solve complex tasks that often exceed the capacity of a single agent. However, most existing systems still rely on a fixed role library and an execution-frozen interaction topology, a rigid design choice that frequently leads to task mismatch, prevents timely adaptation when new evidence emerges during reasoning, and further inflates inference cost. We introduce MetaGen, a training-free framework that adapts both the role space and the collaboration topology at inference time, without updating base model weights. MetaGen generates and rewrites query-conditioned role specifications to maintain a controllable dynamic role pool, then instantiates a constrained execution graph around a minimal backbone. During execution, it iteratively updates role prompts and adjusts structural decisions using lightweight feedback signals. Experiments on code generation and multi-step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi-agent baselines.
[22] Formula-One Prompting: Adaptive Reasoning Through Equations For Applied Mathematics
Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul
Main category: cs.CL
TL;DR: F-1 Prompting improves LLM mathematical reasoning by first extracting governing equations from problems, then adaptively selecting solving strategies (CoT, PoT, or direct computation), achieving significant gains especially in applied domains like finance and physics.
Details
Motivation: Current prompting techniques (CoT, PoT) don't explicitly leverage the crucial step of recalling/deriving governing equations needed for applied mathematics problems in domains like finance, physics, and cryptography.Method: Two-phase approach: 1) Formulate governing equations from problem descriptions, 2) Select adaptive solving strategy among Chain-of-Thought, Program-of-Thought, or direct computation based on generated equations, all within a single LLM call.
Result: Outperforms CoT by +5.76% and PoT by +8.42% on average across five models and four benchmarks. Largest gains in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, physics (+2.55%) shows larger gains than pure math (+0.44%).
Conclusion: F-1 is more effective than CoT in applied mathematics problems by explicitly leveraging mathematical equations as intermediate representations before adaptive solving.
Abstract: Prompting techniques such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) improve LLM mathematical reasoning by structuring intermediate steps in natural language or code. However, applied mathematics problems in domains like finance, physics, and cryptography often require recalling or deriving governing equations, a step that current approaches do not explicitly leverage. We propose Formula-One Prompting (F-1), a two-phase approach that uses mathematical equations as an intermediate representation before adaptive solving. F-1 first formulates governing equations from problem descriptions, then selects a solving strategy among CoT, PoT, or direct computation based on the generated equations, all within a single LLM call. Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average. Crucially, gains are largest in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, larger gains on physics (+2.55%) than pure math (+0.44%). This demonstrates that F-1 is more effective than CoT in applied mathematics problems.
[23] CAMEO: Collection of Multilingual Emotional Speech Corpora
Iwona Christop, Maciej Czajka
Main category: cs.CL
TL;DR: CAMEO is a curated multilingual emotional speech dataset collection for emotion recognition research, featuring standardized benchmarks, easy access via Hugging Face, and performance results for various models.
Details
Motivation: To facilitate research in emotion recognition and speech-related tasks by providing easy access to multilingual emotional speech data, ensuring reproducibility, and establishing standardized benchmarks for evaluating SER systems across different emotional states and languages.Method: The paper describes dataset selection criteria, curation and normalization processes to create a standardized collection of multilingual emotional speech datasets, making them publicly available with metadata and a leaderboard on Hugging Face.
Result: The CAMEO collection is successfully created and made publicly available, with performance results provided for several speech emotion recognition models, establishing a standardized benchmark for evaluation.
Conclusion: CAMEO provides a valuable resource for speech emotion recognition research by offering curated multilingual datasets with standardized benchmarks, promoting reproducibility and comparative evaluation across different languages and emotional states.
Abstract: This paper presents CAMEO – a curated collection of multilingual emotional speech datasets designed to facilitate research in emotion recognition and other speech-related tasks. The main objectives were to ensure easy access to the data, to allow reproducibility of the results, and to provide a standardized benchmark for evaluating speech emotion recognition (SER) systems across different emotional states and languages. The paper describes the dataset selection criteria, the curation and normalization process, and provides performance results for several models. The collection, along with metadata, and a leaderboard, is publicly available via the Hugging Face platform.
[24] When Benchmarks Leak: Inference-Time Decontamination for LLMs
Jianzhe Chai, Yu Zhe, Jun Sakuma
Main category: cs.CL
TL;DR: DeconIEP is a decontamination framework that applies small, bounded perturbations in the input embedding space during evaluation to mitigate test set contamination in LLM benchmarks without altering the benchmark itself.
Details
Motivation: Benchmark-based evaluation of LLMs is threatened by test set contamination, where test samples leak into training data and artificially inflate performance. Existing mitigation approaches either alter the evaluation set or interfere with normal inference, leading to performance degradation on clean inputs.Method: DeconIEP operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, it learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways.
Result: Across multiple open-weight LLMs and benchmarks, DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.
Conclusion: DeconIEP provides an effective solution to test set contamination that preserves benchmark integrity and maintains model performance on clean inputs, addressing limitations of previous decontamination approaches.
Abstract: Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.
[25] Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation
Tathagata Raha, Clement Christophe, Nada Saadi, Hamza A Javed, Marco AF Pimentel, Ronnie Rajan, Praveenkumar Kanithi
Main category: cs.CL
TL;DR: CEF is a reference-free evaluation framework that uses cross-examination between source and generated texts to assess semantic fidelity through three interpretable scores, outperforming traditional metrics like BLEU and BERTScore.
Details
Motivation: Traditional text evaluation metrics (BLEU, BERTScore) fail to adequately capture semantic fidelity in generative text-to-text tasks, particularly in identifying critical semantic errors like content omissions and factual contradictions.Method: Adapts Cross-Examination Framework (CEF) to treat source and candidate texts as independent knowledge bases. Generates verifiable questions from each text and performs cross-examination to derive three scores: Coverage (how much source content is covered), Conformity (how well candidate aligns with source), and Consistency (internal coherence). Includes systematic robustness analysis for judge model selection.
Result: Validated across translation, summarization, and clinical note-generation tasks. Identifies critical errors missed by standard metrics. Strong correlation between reference-free and with-reference modes validates reliability without gold references. Human expert validation shows CEF mismatching questions align with meaning-altering semantic errors, particularly excelling at identifying entity-based and relational distortions.
Conclusion: CEF provides a robust, reference-free evaluation framework with interpretable scores that better captures semantic fidelity than traditional metrics, particularly effective at identifying critical semantic errors in generative text-to-text tasks.
Abstract: Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF’s reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.
[26] Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement
Diego Rossini, Lonneke van der Plas
Main category: cs.CL
TL;DR: DeBERTa-v3-large model achieves 69.8% F1 on CoAM dataset for MWE identification, beating Qwen-72B by 12 points with 165x fewer parameters through binary token classification, linguistic features, and data augmentation.
Details
Motivation: To develop an efficient multiword expression identification approach that outperforms large language models while using significantly fewer parameters, enabling resource-constrained deployments.Method: Three key techniques: (1) reformulating detection as binary token-level START/END/INSIDE classification instead of span-based prediction, (2) incorporating NP chunking and dependency features to help identify discontinuous and NOUN-type MWEs, and (3) applying oversampling to address severe class imbalance in training data.
Result: Achieved 69.8% F1 on CoAM dataset (12 points higher than Qwen-72B’s 57.8% F1) using 165x fewer parameters, and confirmed generalization with 78.9% F1 on STREUSLE dataset.
Conclusion: Carefully designed smaller models can substantially outperform large language models on structured NLP tasks, with important implications for resource-constrained deployments where efficiency matters.
Abstract: We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
[27] Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?
Ahrii Kim, Seong-heum Kim
Main category: cs.CL
TL;DR: LLMs show near-human APE quality with simple prompting but fail to effectively use document context, have high costs, and require human evaluation despite strong performance.
Details
Motivation: To systematically evaluate how well large language models (LLMs) perform automatic post-editing (APE) of machine translations, especially when document-level context is available, since this capability remains poorly understood despite LLMs' strong translation abilities.Method: Conducted systematic comparison of proprietary and open-weight LLMs using naive document-level prompting setup. Analyzed APE quality, contextual behavior, robustness to data poisoning attacks, and efficiency metrics.
Result: Proprietary LLMs achieve near human-level APE quality with simple one-shot prompting, but fail to effectively exploit document-level context for contextual error correction. They show higher robustness to attacks than open-weight models. Standard automatic metrics don’t reliably reflect qualitative improvements. High cost and latency make proprietary LLMs impractical for real-world deployment.
Conclusion: LLMs show promise for APE but current limitations include ineffective use of document context, unreliable automatic metrics requiring human evaluation, and impractical costs/latency. Need more efficient long-context modeling approaches for practical translation refinement.
Abstract: Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE–especially under document-level context–remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement.
[28] KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLMs for Enhancing Automated Fact-checking
Vítor N. Lourenço, Aline Paes, Tillman Weyde, Audrey Depeige, Mohnish Dubey
Main category: cs.CL
TL;DR: KG-CRAFT improves claim verification by using LLMs with knowledge graph-based contrastive questions to guide evidence distillation and veracity assessment.
Details
Motivation: To enhance automatic claim verification by better leveraging evidence sources through structured knowledge representation and contrastive reasoning.Method: Constructs knowledge graphs from claims and reports, formulates contextually relevant contrastive questions based on graph structure, distills evidence-based reports, and synthesizes summaries for LLM-based veracity assessment.
Result: Achieves state-of-the-art predictive performance on LIAR-RAW and RAWFC datasets, demonstrating effectiveness of knowledge graph-based contrastive reasoning for fact-checking.
Conclusion: KG-CRAFT successfully improves LLMs’ fact-checking capabilities through structured knowledge representation and contrastive questioning, advancing automated claim verification systems.
Abstract: Claim verification is a core component of automated fact-checking systems, aimed at determining the truthfulness of a statement by assessing it against reliable evidence sources such as documents or knowledge bases. This work presents KG-CRAFT, a method that improves automatic claim verification by leveraging large language models (LLMs) augmented with contrastive questions grounded in a knowledge graph. KG-CRAFT first constructs a knowledge graph from claims and associated reports, then formulates contextually relevant contrastive questions based on the knowledge graph structure. These questions guide the distillation of evidence-based reports, which are synthesised into a concise summary that is used for veracity assessment by LLMs. Extensive evaluations on two real-world datasets (LIAR-RAW and RAWFC) demonstrate that our method achieves a new state-of-the-art in predictive performance. Comprehensive analyses validate in detail the effectiveness of our knowledge graph-based contrastive reasoning approach in improving LLMs’ fact-checking capabilities.
[29] Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition
Isha Pandey, Ashish Mittal, Vartul Bahuguna, Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: SMEAR-MoE: A stabilized Mixture-of-Experts projector for multilingual ASR that prevents expert collapse while enabling cross-lingual sharing, achieving up to 7.6% WER reduction over single-projector baselines.
Details
Motivation: Single projectors in LLM-based ASR systems struggle to capture diverse acoustic-to-semantic mappings required for multilingual settings, necessitating more sophisticated projection mechanisms.Method: Proposes SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts to prevent expert collapse while enabling cross-lingual sharing. Systematically compares monolithic, static multi-projector, and dynamic MoE designs across four Indic languages.
Result: Achieves up to 7.6% relative WER reduction over single-projector baseline while maintaining comparable runtime efficiency. Expert routing analysis shows linguistically meaningful specialization with related languages sharing experts.
Conclusion: Stable multi-expert projectors are key to scalable and robust multilingual ASR, with SMEAR-MoE demonstrating effective cross-lingual sharing and performance improvements.
Abstract: Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.
[30] ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles
Ricardo Campos, Raquel Sequeira, Sara Nerea, Inês Cantante, Diogo Folques, Luís Filipe Cunha, João Canavilhas, António Branco, Alípio Jorge, Sérgio Nunes, Nuno Guimarães, Purificação Silvano
Main category: cs.CL
TL;DR: Introduces ClaimPT, a European Portuguese news article dataset with 1,308 articles and 6,875 claim annotations, addressing the lack of accessible fact-checking resources for Portuguese.
Details
Motivation: Manual fact-checking is slow and can't scale with online misinformation spread. Portuguese lacks accessible, licensed datasets for automated fact-checking research, unlike English which dominates due to abundant data.Method: Created ClaimPT dataset through partnership with LUSA (Portuguese News Agency), focusing on journalistic content. Two trained annotators labeled each article with curator validation using a newly proposed annotation scheme. Also developed baseline models for claim detection.
Result: Produced a dataset of 1,308 European Portuguese news articles with 6,875 individual claim annotations, establishing initial benchmarks for claim detection in Portuguese.
Conclusion: ClaimPT advances low-resource fact-checking research and enhances understanding of misinformation in news media by providing a valuable resource for Portuguese NLP and IR applications.
Abstract: Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, NLP developments and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and IR applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.
[31] GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
Wei Huang, Anda Cheng, Yinggui Wang
Main category: cs.CL
TL;DR: GradPruner is a gradient-guided layer pruning method for LLMs that reduces parameters by 40% with only 0.99% accuracy drop during fine-tuning, improving both training and inference efficiency.
Details
Motivation: Fine-tuning LLMs is time-consuming and expensive, while existing structured pruning methods require additional training overhead, making efficient fine-tuning challenging. There's a need to simultaneously improve both training and inference efficiency for downstream tasks.Method: GradPruner uses cumulative gradients from early fine-tuning to compute an Initial Gradient Information Accumulation Matrix (IGIA-Matrix) that assesses layer importance. It prunes less important layers, sparsifies them based on IGIA-Matrix, and merges them with remaining layers (only merging elements with same sign to reduce interference).
Result: Experiments on two LLMs across eight downstream datasets (medical, financial, and general benchmarks) show GradPruner achieves 40% parameter reduction with only 0.99% accuracy decrease.
Conclusion: GradPruner effectively enhances both training and inference efficiency of LLM fine-tuning through gradient-guided layer pruning, achieving significant parameter reduction with minimal accuracy loss across diverse domains.
Abstract: Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.
[32] Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs
Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Qi Jia, Chunyi Li, Renrui Zhang, Heng Li, Zongrui Wang, Wei Sun
Main category: cs.CL
TL;DR: VLSafetyBencher is an automated system for constructing safety benchmarks for large vision-language models, addressing limitations of manual, static benchmarks with four collaborative agents for efficient, high-quality sample generation.
Details
Motivation: Existing LVLM safety benchmarks are labor-intensive, static, and have limited discriminative power, failing to keep pace with rapidly evolving models and emerging safety risks.Method: Four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents work together to automatically construct and select high-quality safety benchmark samples.
Result: System can construct high-quality safety benchmarks within one week at minimal cost, with benchmarks showing 70% safety rate disparity between most and least safe models.
Conclusion: VLSafetyBencher provides an automated, efficient solution for LVLM safety benchmarking that can adapt to evolving models and risks, significantly improving over manual approaches.
Abstract: Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high-quality samples. Experiments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.
[33] Yunque DeepResearch Technical Report
Yuxuan Cai, Xinyi Lai, Peng Yuan, Weiting Liu, Huajian Li, Mingda Li, Xinghua Wang, Shengxie Zheng, Yanchao Hao, Yuyang Yin, Zheng Wei
Main category: cs.CL
TL;DR: Yunque DeepResearch is a hierarchical, modular framework that addresses limitations in deep research for autonomous agents, featuring multi-agent orchestration, dynamic context management, and proactive supervision to achieve SOTA performance on various benchmarks.
Details
Motivation: Current deep research approaches for autonomous agents face critical limitations: escalating contextual noise in long-horizon tasks, fragility leading to cascading errors, and lack of modular extensibility, hindering their full potential.Method: Introduces a hierarchical framework with three key components: 1) Multi-Agent Orchestration System for routing subtasks to specialized tools/sub-agents, 2) Dynamic Context Management that structures completed sub-goals into semantic summaries, and 3) proactive Supervisor Module for anomaly detection and context pruning.
Result: Achieves state-of-the-art performance across multiple agentic deep research benchmarks including GAIA, BrowseComp, BrowseComp-ZH, and Humanity’s Last Exam. The framework is open-sourced with reproducible implementations.
Conclusion: Yunque DeepResearch provides a robust, modular solution to overcome key limitations in deep research for autonomous agents, enabling more effective navigation of complex, open-ended tasks while being made available to the research community.
Abstract: Deep research has emerged as a transformative capability for autonomous agents, empowering Large Language Models to navigate complex, open-ended tasks. However, realizing its full potential is hindered by critical limitations, including escalating contextual noise in long-horizon tasks, fragility leading to cascading errors, and a lack of modular extensibility. To address these challenges, we introduce Yunque DeepResearch, a hierarchical, modular, and robust framework. The architecture is characterized by three key components: (1) a centralized Multi-Agent Orchestration System that routes subtasks to an Atomic Capability Pool of tools and specialized sub-agents; (2) a Dynamic Context Management mechanism that structures completed sub-goals into semantic summaries to mitigate information overload; and (3) a proactive Supervisor Module that ensures resilience through active anomaly detection and context pruning. Yunque DeepResearch achieves state-of-the-art performance across a range of agentic deep research benchmarks, including GAIA, BrowseComp, BrowseComp-ZH, and Humanity’s Last Exam. We open-source the framework, reproducible implementations, and application cases to empower the community.
[34] Decompose-and-Formalise: Recursively Verifiable Natural Language Inference
Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
Main category: cs.CL
TL;DR: A decompose-and-formalise framework improves neuro-symbolic NLI by breaking down arguments into atomic steps, isolating failures locally, and using diagnostic-guided refinement instead of global regeneration, achieving significant verification improvements.
Details
Motivation: Current neuro-symbolic pipelines for NLI struggle with long, complex inputs where autoformalisation errors are amplified, and failures require costly global regeneration due to difficulty localizing responsible spans from prover diagnostics.Method: Proposes a decompose-and-formalise framework that: (1) decomposes premise-hypothesis pairs into entailment trees of atomic steps, (2) verifies trees bottom-up to isolate failures to specific nodes, (3) performs local diagnostic-guided refinement, and (4) introduces θ-substitution in event-based logical forms for consistent argument-role bindings.
Result: Achieves highest explanation verification rates across reasoning tasks using five LLM backbones, improving over state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.
Conclusion: The decompose-and-formalise framework effectively addresses scaling challenges in neuro-symbolic NLI by enabling localized error isolation and refinement, significantly improving verification rates while maintaining efficiency and accuracy.
Abstract: Recent work has shown that integrating large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI). However, scaling such refinement to naturalistic NLI remains difficult: long, syntactically rich inputs and deep multi-step arguments amplify autoformalisation errors, where a single local mismatch can invalidate the proof. Moreover, current methods often handle failures via costly global regeneration due to the difficulty of localising the responsible span or step from prover diagnostics. Aiming to address these problems, we propose a decompose-and-formalise framework that (i) decomposes premise-hypothesis pairs into an entailment tree of atomic steps, (ii) verifies the tree bottom-up to isolate failures to specific nodes, and (iii) performs local diagnostic-guided refinement instead of regenerating the whole explanation. Moreover, to improve faithfulness of autoformalisation, we introduce $θ$-substitution in an event-based logical form to enforce consistent argument-role bindings. Across a range of reasoning tasks using five LLM backbones, our method achieves the highest explanation verification rates, improving over the state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.
[35] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs
Xinzhong Wang, Ya Guo, Jing Li, Huan Chen, Yi Tu, Yijie Hong, Gongshen Liu, Huijia Zhu
Main category: cs.CL
TL;DR: PIP introduces a parallel inference paradigm for Key Information Extraction from documents, achieving 5-36x speedup over autoregressive models by generating all target values simultaneously using mask tokens.
Details
Motivation: Current LLMs and MLLMs for KIE rely on autoregressive inference which creates efficiency bottlenecks, especially when extracting multiple independent fields sequentially. This limits scalability for real-world applications.Method: Reformulates KIE by using “[mask]” tokens as placeholders for all target values, enabling simultaneous generation in a single forward pass. Includes tailored mask pre-training strategy and construction of large-scale supervised datasets.
Result: PIP-models achieve 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models.
Conclusion: PIP substantially improves efficiency while maintaining high accuracy, paving the way for scalable and practical real-world KIE solutions.
Abstract: Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using “[mask]” tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
[36] RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems
Weicong Liu, Zixuan Yang, Yibo Zhao, Xiang Li
Main category: cs.CL
TL;DR: LR-bench: A new benchmark for reviewer assignment using 2024-2025 AI/NLP papers with self-assessed familiarity ratings, plus RATE framework that creates keyword-based reviewer profiles and fine-tunes embeddings for better matching.
Details
Motivation: Reviewer assignment is challenging in the LLM era due to outdated benchmarks (pre-2023) and poor proxy signals for reviewer familiarity. There's a need for up-to-date evaluation data and better matching methods.Method: 1) Created LR-bench benchmark with 1055 expert annotations from 2024-2025 AI/NLP papers using email survey for self-assessed familiarity ratings. 2) Proposed RATE framework that distills reviewers’ recent publications into keyword-based profiles and fine-tunes embedding models with weak preference supervision from heuristic retrieval signals.
Result: The approach achieves state-of-the-art performance on both LR-bench and CMU gold-standard dataset, outperforming strong embedding baselines by a clear margin.
Conclusion: LR-bench provides a high-fidelity, up-to-date benchmark for reviewer assignment evaluation, and RATE offers an effective reviewer-centric ranking framework that improves matching accuracy through keyword-based profiles and fine-tuned embeddings.
Abstract: Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre-2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR-bench, a high-fidelity, up-to-date benchmark curated from 2024-2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings collected via a large-scale email survey, yielding 1055 expert-annotated paper-reviewer-score annotations. We further propose RATE, a reviewer-centric ranking framework that distills each reviewer’s recent publications into compact keyword-based profiles and fine-tunes an embedding model with weak preference supervision constructed from heuristic retrieval signals, enabling matching each manuscript against a reviewer profile directly. Across LR-bench and the CMU gold-standard dataset, our approach consistently achieves state-of-the-art performance, outperforming strong embedding baselines by a clear margin. We release LR-bench at https://huggingface.co/datasets/Gnociew/LR-bench, and a GitHub repository at https://github.com/Gnociew/RATE-Reviewer-Assign.
[37] One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Shaosheng Cao
Main category: cs.CL
TL;DR: DLMs suffer from moving sink instability; adding a dedicated extra sink token with self-only attention stabilizes attention sinks and improves performance.
Details
Motivation: Diffusion Language Models (DLMs) enable parallel text generation but suffer from critical instability called the "moving sink phenomenon" where sink tokens unpredictably shift across diffusion steps, undermining inference robustness.Method: Introduce a simple extra sink token via modified attention mask: a special token constrained to attend solely to itself while remaining globally visible to all other tokens, creating a dedicated structural sink.
Result: Adding a single extra token stabilizes attention sinks and substantially improves model performance; effectiveness is independent of token position and has negligible semantic content, confirming its role as a robust structural sink.
Conclusion: The moving sink phenomenon in DLMs can be effectively resolved by introducing a dedicated extra sink token with self-only attention, providing a simple but effective solution to stabilize attention mechanisms and improve inference robustness.
Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer’s value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.
[38] SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier
Main category: cs.CL
TL;DR: SynCABEL is a framework that uses LLMs to generate synthetic training data for biomedical entity linking, achieving SOTA results with less human annotation and improving clinically valid predictions.
Details
Motivation: The paper addresses the scarcity of expert-annotated training data in biomedical entity linking, which is a major bottleneck for supervised approaches in this domain.Method: SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. It combines decoder-only models with guided inference and introduces an LLM-as-a-judge protocol for evaluation.
Result: SynCABEL establishes new state-of-the-art results across three multilingual benchmarks (MedMentions, QUAERO, SPACCC), reaches full human supervision performance with up to 60% less annotated data, and significantly improves clinically valid predictions according to LLM-as-a-judge evaluation.
Conclusion: SynCABEL effectively addresses the data scarcity problem in biomedical entity linking through synthetic data generation, reduces reliance on costly expert labeling, and improves clinical validity of predictions. The framework’s datasets, models, and code are publicly released.
Abstract: We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference establish new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.
[39] Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes
Yifan Wang, Jichen Zheng, Jingyuan Sun, Yunhao Zhang, Chunyu Ye, Jixing Li, Chengqing Zong, Shaonan Wang
Main category: cs.CL
TL;DR: LLMs can simulate aphasia by selectively perturbing functional components, with modular architectures providing more localized mappings to clinical phenotypes.
Details
Motivation: To create scalable computational proxies for testing rehabilitation hypotheses and probing the functional organization of language by simulating aphasic impairments in LLMs.Method: Developed a clinically grounded framework that selectively perturbs functional components in LLMs (both Mixture-of-Experts and dense Transformers), identifies subtype-linked components for Broca’s and Wernicke’s aphasia, interprets components with linguistic probing, and induces graded impairments by perturbing top-k components, evaluated using Western Aphasia Battery subtests.
Result: Subtype-targeted perturbations produce more systematic, aphasia-like regressions than random perturbations, and MoE modularity enables more localized and interpretable phenotype-to-component mappings across architectures.
Conclusion: Modular LLMs with clinically informed component perturbations offer a promising platform for simulating aphasic language production and studying how distinct language functions degrade under targeted disruptions.
Abstract: Large language models (LLMs) increasingly exhibit human-like linguistic behaviors and internal representations that they could serve as computational simulators of language cognition. We ask whether LLMs can be systematically manipulated to reproduce language-production impairments characteristic of aphasia following focal brain lesions. Such models could provide scalable proxies for testing rehabilitation hypotheses, and offer a controlled framework for probing the functional organization of language. We introduce a clinically grounded, component-level framework that simulates aphasia by selectively perturbing functional components in LLMs, and apply it to both modular Mixture-of-Experts models and dense Transformers using a unified intervention interface. Our pipeline (i) identifies subtype-linked components for Broca’s and Wernicke’s aphasia, (ii) interprets these components with linguistic probing tasks, and (iii) induces graded impairments by progressively perturbing the top-k subtype-linked components, evaluating outcomes with Western Aphasia Battery (WAB) subtests summarized by Aphasia Quotient (AQ). Across architectures and lesioning strategies, subtype-targeted perturbations yield more systematic, aphasia-like regressions than size-matched random perturbations, and MoE modularity supports more localized and interpretable phenotype-to-component mappings. These findings suggest that modular LLMs, combined with clinically informed component perturbations, provide a promising platform for simulating aphasic language production and studying how distinct language functions degrade under targeted disruptions.
[40] TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu
Main category: cs.CL
TL;DR: TokenSeek is a universal plugin for transformer models that uses instance-aware token selection to dramatically reduce fine-tuning memory usage (down to 14.8% for Llama3.2 1B) while maintaining or improving performance.
Details
Motivation: Fine-tuning LLMs is memory-intensive, with activations being the main memory bottleneck. Existing activation optimization methods are data-agnostic, leading to ineffective and unstable fine-tuning performance.Method: TokenSeek uses instance-aware token seeking and ditching - a selective approach that identifies and retains important tokens while discarding less important ones during fine-tuning, implemented as a universal plugin for transformer models.
Result: Achieves significant memory savings (e.g., 14.8% of original memory for Llama3.2 1B) with on-par or even better performance compared to standard fine-tuning. The method also provides interpretable insights into token importance.
Conclusion: TokenSeek offers an effective solution to the memory efficiency problem in LLM fine-tuning through data-aware token selection, providing both practical benefits and valuable insights for future research on token efficiency.
Abstract: Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: https://runjia.tech/iclr_tokenseek/
[41] Strong Reasoning Isn’t Enough: Evaluating Evidence Elicitation in Interactive Diagnosis
Zhuohan Long, Zhijie Bao, Zhongyu Wei
Main category: cs.CL
TL;DR: Proposes EviMed benchmark and REFINE strategy for evaluating and improving interactive medical consultation agents’ evidence-gathering abilities, showing diagnostic reasoning alone is insufficient for effective information collection.
Details
Motivation: Existing evaluations for medical consultation agents are static or outcome-centric, neglecting the evidence-gathering process. There's a need to evaluate how well agents proactively elicit missing clinical evidence under uncertainty during interactive consultations.Method: 1) Creates EviMed benchmark with simulated patient/reporter grounded in atomic evidences; 2) Introduces Information Coverage Rate (ICR) to quantify evidence collection completeness; 3) Proposes REFINE strategy that uses diagnostic verification to guide agents in resolving uncertainties; 4) Evaluates 10 models with varying reasoning abilities.
Result: Strong diagnostic reasoning doesn’t guarantee effective information collection, which is a primary bottleneck in interactive settings. REFINE consistently outperforms baselines across datasets and enables smaller agents to achieve superior performance under strong reasoning supervision through effective model collaboration.
Conclusion: The evidence-gathering process is crucial for interactive medical consultation, and REFINE provides an effective strategy to improve agents’ proactive uncertainty resolution. The EviMed benchmark enables systematic evaluation of consultation process quality beyond just diagnostic outcomes.
Abstract: Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \rev{simulated reporter} grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID-Benchmark .
[42] LVLMs and Humans Ground Differently in Referential Communication
Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan Brennan, Owen Rambow
Main category: cs.CL
TL;DR: The paper studies how well AI agents can model common ground in collaborative referential communication tasks, revealing limitations in current LVLMs’ ability to resolve referring expressions like humans do.
Details
Motivation: For generative AI agents to effectively partner with humans, they need to accurately predict human intent, but current systems are limited by an inability to model common ground - the shared understanding between communicators.Method: A referential communication experiment with factorial design involving director-matcher pairs (human-human, human-AI, AI-human, AI-AI) interacting over multiple turns in repeated rounds to match pictures of objects without obvious lexical labels. The study includes tools for analyzing accuracy, efficiency, and lexical overlap.
Result: The paper releases a corpus of 356 dialogues (89 pairs over 4 rounds each) that reveals LVLMs’ limitations in interactively resolving referring expressions, a crucial skill underlying human language use.
Conclusion: Current Large Vision-Language Models have significant limitations in modeling common ground and resolving referring expressions in collaborative tasks, highlighting a critical deficit that needs to be addressed for effective human-AI partnership.
Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs’ limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
[43] Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation
Aohua Li, Yuanshuo Zhang, Ge Gao, Bo Chen, Xiaobing Zhao
Main category: cs.CL
TL;DR: Proposes zero-shot stance detection with dynamic target generation and multi-target adaptation (DGTA) for identifying multiple target-stance pairs from text without prior target knowledge, using fine-tuned LLMs on Chinese social media data.
Details
Motivation: Real-world social media stance detection faces challenges because targets are not predefined or static but complex and dynamic, unlike current research that assumes given targets.Method: Proposes DGTA task with dynamic target generation and multi-target adaptation. Constructs Chinese social media dataset with multi-dimensional metrics. Explores integrated and two-stage fine-tuning strategies for LLMs, evaluating various baseline models.
Result: Fine-tuned LLMs achieve superior performance: two-stage fine-tuned Qwen2.5-7B attains highest comprehensive target recognition score of 66.99%, while integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves stance detection F1 score of 79.26%.
Conclusion: The proposed DGTA framework effectively addresses real-world stance detection challenges by enabling automatic identification of multiple target-stance pairs without prior target knowledge, with fine-tuned LLMs demonstrating strong performance.
Abstract: Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.
[44] When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
Main category: cs.CL
TL;DR: Iterative RAG with synchronized retrieval-reasoning loops outperforms static RAG (even with gold context) by up to 25.6 percentage points on scientific multi-hop QA, especially benefiting non-reasoning fine-tuned models.
Details
Motivation: It's unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. The study aims to provide the first controlled, mechanism-level diagnostic of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound.Method: Benchmarked 11 state-of-the-art LLMs under three regimes: (1) No Context (parametric memory), (2) Gold Context (all oracle evidence at once), and (3) Iterative RAG (training-free controller alternating retrieval, hypothesis refinement, and evidence-aware stopping). Used ChemKGMultiHopQA chemistry dataset with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration.
Result: Iterative RAG consistently outperforms Gold Context across models with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, enables dynamic correction of early hypothesis drift, but has remaining failure modes including incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates.
Conclusion: Staged retrieval is often more influential than the mere presence of ideal evidence. The study provides practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
[45] Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering
Fangan Dong, Zuming Yan, Xuri Ge, Zhiwei Xu, Mengqi Zhang, Xuanang Chen, Ben He, Xin Xin, Zhumin Chen, Ying Zhou
Main category: cs.CL
TL;DR: AdaRAS is a lightweight test-time framework that improves LLM reasoning reliability by selectively steering neuron activations, achieving significant gains on math/coding benchmarks without additional training or sampling costs.
Details
Motivation: Current LLMs require expensive post-training or sampling strategies for reliable reasoning performance, limiting practical efficiency. The authors discovered that a small subset of neurons strongly correlates with reasoning correctness, suggesting a more efficient approach.Method: AdaRAS identifies Reasoning-Critical Neurons (RCNs) using a polarity-aware mean-difference criterion, then adaptively steers their activations during inference. It enhances incorrect reasoning traces while avoiding degradation on already-correct cases.
Result: Experiments on 10 mathematics and coding benchmarks show consistent improvements, including over 13% gains on AIME-24 and AIME-25. The method exhibits strong transferability across datasets and scalability to stronger models.
Conclusion: AdaRAS outperforms post-training methods without additional training or sampling cost, providing an efficient test-time framework for improving LLM reasoning reliability through targeted neuron activation steering.
Abstract: Despite the strong reasoning capabilities of recent large language models (LLMs), achieving reliable performance on challenging tasks often requires post-training or computationally expensive sampling strategies, limiting their practical efficiency. In this work, we first show that a small subset of neurons in LLMs exhibits strong predictive correlations with reasoning correctness. Based on this observation, we propose AdaRAS (Adaptive Reasoning Activation Steering), a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations. AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference, enhancing incorrect reasoning traces while avoiding degradation on already-correct cases. Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25. Moreover, AdaRAS exhibits strong transferability across datasets and scalability to stronger models, outperforming post-training methods without additional training or sampling cost.
[46] Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection
Nicholas Cheng
Main category: cs.CL
TL;DR: Reflective Translation framework uses self-critique and revision to improve low-resource language translation without fine-tuning, showing consistent BLEU and COMET score improvements.
Details
Motivation: Low-resource languages like isiZulu and isiXhosa face challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest self-reflection can improve reasoning quality and factual consistency.Method: Reflective Translation framework: model generates initial translation, produces structured self-critique, then uses this reflection to generate refined translation. Evaluated on English-isiZulu and English-isiXhosa using OPUS-100 and NTREX-African datasets with multiple prompting strategies and confidence thresholds.
Result: Consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains up to +0.22 BLEU and +0.18 COMET. Statistical significance testing confirms robust improvements. The method is model-agnostic, requires no fine-tuning.
Conclusion: Structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings. The approach introduces a reflection-augmented dataset that can support future supervised or analysis-driven work.
Abstract: Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English-isiZulu and English-isiXhosa translation using OPUS-100 and NTREX-African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model-agnostic, requires no fine-tuning, and introduces a reflection-augmented dataset that can support future supervised or analysis-driven work. These findings demonstrate that structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings.
[47] Evaluation of Oncotimia: An LLM based system for supporting tumour boards
Luis Lorenzo, Marcos Montana-Mendez, Sergio Figueiras, Miguel Boubeta, Cristobal Bernardo-Castineira
Main category: cs.CL
TL;DR: ONCOTIMIA is a clinical tool using GenAI/LLMs to automatically complete lung cancer tumour board forms, achieving 80% accuracy with acceptable latency, reducing documentation burden while maintaining data quality.
Details
Motivation: Multidisciplinary tumour boards (MDTBs) require manual processing of large volumes of heterogeneous clinical information, creating substantial documentation burden that needs automation solutions.Method: Developed ONCOTIMIA - a modular clinical tool combining multi-layer data lake, hybrid relational/vector storage, retrieval-augmented generation (RAG), and rule-driven adaptive form model to transform unstructured clinical docs into structured tumour board records. Evaluated six LLMs on AWS Bedrock with ten lung cancer cases.
Result: Best configuration achieved 80% correct field completion with clinically acceptable response times. Larger and more recent models showed best accuracy without prohibitive latency. LLM-assisted autocompletion is technically feasible and operationally viable.
Conclusion: LLM-assisted autocompletion forms can significantly reduce documentation burden in multidisciplinary lung cancer workflows while preserving data quality, demonstrating both technical feasibility and operational viability.
Abstract: Multidisciplinary tumour boards (MDTBs) play a central role in oncology decision-making but require manual processes and structuring large volumes of heterogeneous clinical information, resulting in a substantial documentation burden. In this work, we present ONCOTIMIA, a modular and secure clinical tool designed to integrate generative artificial intelligence (GenAI) into oncology workflows and evaluate its application to the automatic completion of lung cancer tumour board forms using large language models (LLMs). The system combines a multi-layer data lake, hybrid relational and vector storage, retrieval-augmented generation (RAG) and a rule-driven adaptive form model to transform unstructured clinical documentation into structured and standardised tumour board records. We assess the performance of six LLMs deployed through AWS Bedrock on ten lung cancer cases, measuring both completion form accuracy and end-to-end latency. The results demonstrate high performance across models, with the best performing configuration achieving an 80% of correct field completion and clinically acceptable response time for most LLMs. Larger and more recent models exhibit best accuracies without incurring prohibitive latency. These findings provide empirical evidence that LLM- assisted autocompletion form is technically feasible and operationally viable in multidisciplinary lung cancer workflows and support its potential to significantly reduce documentation burden while preserving data quality.
[48] ThinkNote: Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognition Modeling
Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, Chenyan Xiong
Main category: cs.CL
TL;DR: ThinkNote is a framework that improves LLMs’ ability to use external knowledge through constructivist learning principles, achieving 10% better performance on QA benchmarks.
Details
Motivation: LLMs often struggle with effectively leveraging unfamiliar external information, showing suboptimal behaviors and inconsistencies when exposed to new knowledge.Method: A two-stage constructivist cognitive modeling process: 1) Knowledge assimilation to align new information with parametric memory, and 2) Thought accommodation to adapt internal reasoning for consistent outputs.
Result: Achieves 10% improvement over strong baselines on various question-answering benchmarks, effectively integrates external knowledge, and improves LLMs’ self-consistency.
Conclusion: ThinkNote successfully enhances LLMs’ external knowledge utilization through constructivist learning principles, leading to more reliable and consistent performance.
Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of NLP tasks. However, they often exhibit suboptimal behaviors and inconsistencies when exposed to unfamiliar external information, underscoring their limitations in effectively leveraging such knowledge. Inspired by constructivist learning theory, we propose ThinkNote, a novel framework that enhances the external knowledge utilization of LLMs through a two-stage constructivist cognitive modeling process. Specifically, ThinkNote performs knowledge assimilation to align new information with the model’s parametric memory, forming a coherent internal representation. It then applies thought accommodation to adapt internal reasoning, thereby promoting more consistent and reliable outputs. Extensive experimental results demonstrate that ThinkNote achieves a 10% improvement over strong baseline methods on various question-answering benchmarks. Further analysis indicates that ThinkNote effectively integrates and utilizes external knowledge to help LLMs generate accurate responses and improves their self-consistency. All data and codes are available at https://github.com/OpenMatch/ThinkNote.
[49] Epistemological Bias As a Means for the Automated Detection of Injustices in Text
Kenya Andrews, Lamogha Chiazor
Main category: cs.CL
TL;DR: Novel framework combining epistemology with NLP to detect subtle implicit injustices in text, offering explainability and reducing human analysis time.
Details
Motivation: Implicit biases and stereotypes in text are subtle and often overlooked in automated detection due to their unconscious nature and societal pervasiveness, making detection challenging.Method: Framework that integrates epistemological knowledge with NLP models to enhance detection of implicit injustices, providing explainability for the identified biases.
Result: Empirical study shows effective detection of implicit injustices; human baseline validation mostly agrees with framework’s identification of implicit bias, stereotypes, and sentiment.
Conclusion: Automated framework pipeline is valuable for assisting users in detecting implicit injustices with explainability while significantly reducing the time burden compared to manual analysis.
Abstract: Injustices in text are often subtle since implicit biases or stereotypes frequently operate unconsciously due to the pervasive nature of prejudice in society. This makes automated detection of injustices more challenging which leads to them being often overlooked. We introduce a novel framework that combines knowledge from epistemology to enhance the detection of implicit injustices in text using NLP models to address these complexities and offer explainability. Our empirical study shows how our framework can be applied to effectively detect these injustices. We validate our framework using a human baseline study which mostly agrees with the choice of implicit bias, stereotype, and sentiment. The main feedback from the study was the extended time required to analyze, digest, and decide on each component of our framework. This highlights the importance of our automated framework pipeline that assists users in detecting implicit injustices while offering explainability and reducing time burdens on humans.
[50] Task-Specific Directions: Definition, Exploration, and Utilization in Parameter Efficient Fine-Tuning
Chongjie Si, Zhiyi Shi, Shifan Zhang, Xiaokang Yang, Hanspeter Pfister, Wei Shen
Main category: cs.CL
TL;DR: This paper introduces task-specific directions (TSDs) for PEFT, proposes LoRA-Dash to maximize TSD impact during fine-tuning, develops LoRA-Init for task-specific initialization, and combines them into LoRA-TSD, all showing significant performance improvements.
Details
Motivation: Large language models require extensive resources for full fine-tuning. While PEFT methods like LoRA help, there's a need to better understand and optimize the transition from pretrained states to task-specific enhancements through task-specific directions.Method: 1) Proposes a framework to define and explore task-specific directions (TSDs); 2) Introduces LoRA-Dash to maximize TSD impact during fine-tuning; 3) Develops LoRA-Init for task-specific initialization by identifying directions needing most adjustment; 4) Combines both into LoRA-TSD.
Result: Extensive experiments demonstrate the effectiveness of these methods, with LoRA-Init significantly enhancing LoRA’s performance and LoRA-TSD showing improved results through combined approach.
Conclusion: The paper successfully identifies and leverages task-specific directions to improve PEFT methods, providing both theoretical framework (TSDs) and practical implementations (LoRA-Dash, LoRA-Init, LoRA-TSD) that enhance model performance on targeted tasks.
Abstract: Large language models demonstrate impressive performance on downstream tasks, yet they require extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs), which are critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA’s performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA’s performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms behind their success.
[51] TableMaster: A Recipe to Advance Table Understanding with Language Models
Lang Cao, Hanbing Liu
Main category: cs.CL
TL;DR: TableMaster enhances language models for table understanding by addressing data location, semantic deficiency, numerical accuracy, and reasoning flexibility challenges through content extraction, semantic verbalization, and adaptive reasoning.
Details
Motivation: Current language models struggle with table understanding due to the structured nature of tabular data, facing challenges in data location, semantic comprehension, numerical accuracy, and reasoning flexibility.Method: Proposes TableMaster framework with two main components: 1) extracts relevant table content and verbalizes it with enriched semantic context, and 2) introduces adaptive reasoning that dynamically adjusts between textual and symbolic reasoning based on each query.
Result: Achieves 78.13% accuracy on WikiTQ dataset using GPT-4o-mini, surpassing existing baselines and demonstrating effectiveness in table understanding tasks.
Conclusion: TableMaster provides a practical framework for robust table understanding by addressing key challenges, representing a step toward more reliable language model capabilities for structured data.
Abstract: Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines. We hope this work will serve as a practical step toward more robust and reliable table understanding.
[52] EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, Laizhong Cui
Main category: cs.CL
TL;DR: EmoBench-M is a new benchmark for evaluating emotional intelligence in multimodal LLMs across 13 real-world scenarios, revealing significant gaps between current MLLMs and human performance.
Details
Motivation: As MLLMs integrate into robotics and AI applications, they need emotional intelligence to address human emotional needs in real-world interactions. Existing benchmarks are static, text-based, or text-image focused, failing to capture the dynamic, multimodal nature of emotional expressions in real scenarios.Method: Built EmoBench-M based on established psychological theories of emotional intelligence, evaluating MLLMs across 13 valuation scenarios from three dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis.
Result: Evaluations of both open-source and closed-source MLLMs show significant performance gaps compared to humans, highlighting the need to advance their emotional intelligence capabilities.
Conclusion: Current MLLMs lack sufficient emotional intelligence for real-world applications, and EmoBench-M provides a comprehensive benchmark to measure and advance these capabilities across multimodal, dynamic scenarios.
Abstract: With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real-world interactions and fail to capture the dynamic, multimodal nature of emotional expressions, making them inadequate for evaluating MLLMs’ EI. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of both open-source and closed-source MLLMs on EmoBench-M reveal a significant performance gap between them and humans, highlighting the need to further advance their EI capabilities. All benchmark resources, including code and datasets, are publicly available at https://emo-gml.github.io/.
[53] Ultra-Low-Dimensional Prompt Tuning via Random Projection
Zijun Wu, Yongchang Hao, Lili Mou
Main category: cs.CL
TL;DR: ULPT reduces prompt tuning parameters by 98% using 2D optimization with frozen random up-projection while maintaining performance.
Details
Motivation: Large language models are costly to fine-tune, and existing prompt tuning methods still use many parameters tied to model dimensionality, limiting parameter efficiency.Method: Optimize prompts in ultra-low-dimensional space (e.g., 2D) and use a frozen random matrix for up-projection to the model’s hidden dimensionality.
Result: Achieves 98% reduction in training parameters compared to vanilla prompt tuning while preserving performance, outperforms recent parameter-efficient methods across 20+ NLP tasks.
Conclusion: ULPT provides a storage-efficient framework for massive LLM customization with significantly fewer parameters than existing methods.
Abstract: Large language models achieve state-of-the-art performance but are increasingly costly to fine-tune. Prompt tuning is a parameter-efficient fine-tuning method that addresses parameter-efficiency by learning prompt embeddings, but these embeddings are typically tied to the model’s hidden dimensionality, limiting parameter saving. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), a simple yet effective method that optimizes prompts in a low-dimensional space (e.g., 2D) and uses a frozen random matrix for up-projection. ULPT can achieve 98% reduction in the training parameters compared to vanilla prompt tuning while preserving performance. Our extensive experiments across over 20 NLP tasks demonstrate that ULPT consistently outperforms recent parameter-efficient tuning methods using significantly fewer parameters, making it well-suited as a storage-efficient framework for massive LLM customization.
[54] Large Multimodal Models for Low-Resource Languages: A Survey
Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu
Main category: cs.CL
TL;DR: Survey paper analyzing techniques for adapting large multimodal models to low-resource languages, covering 117 studies across 96 languages, with focus on visual enhancement, data creation, and cross-modal transfer strategies.
Details
Motivation: To systematically understand how researchers adapt large multimodal models for low-resource languages, addressing challenges of limited data and computational resources, and making these models more accessible to speakers of understudied languages.Method: Comprehensive analysis of 117 studies across 96 low-resource languages, categorizing works into resource-oriented and method-oriented contributions, with further sub-categorization and comparison of performance and efficiency.
Result: Identified key patterns in adaptation approaches, found visual information serves as crucial bridge for improving performance in low-resource settings, but challenges remain in hallucination mitigation and computational efficiency.
Conclusion: Provides researchers with clear understanding of current approaches and remaining challenges in making large multimodal models accessible to speakers of low-resource languages, complemented by open-source repository.
Abstract: In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
[55] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
Main category: cs.CL
TL;DR: Visual Instruction Rewriting transforms multimodal inputs into text-only commands for privacy-preserving on-device AI, using a compact 250M parameter model that achieves effective rewriting while protecting visual data.
Details
Motivation: Address privacy concerns in multimodal AI systems where cloud-based VLMs require transmitting sensitive visual data, and enable real-time on-device usability for AR/VR/smartphone applications.Method: Develop Visual Instruction Rewriting approach that converts multimodal instructions to text-only commands. Create dataset of 39k+ examples across 14 domains. Train compact VLM (250M params) pretrained on image captioning and fine-tuned for instruction rewriting.
Result: Quantized model (<500MB storage) achieves effective instruction rewriting as measured by BLEU, METEOR, ROUGE metrics and semantic parsing analysis, enabling privacy-focused multimodal applications.
Conclusion: Visual Instruction Rewriting enables privacy-preserving multimodal AI by converting visual inputs to text-only commands on-device, addressing both privacy concerns and real-time usability limitations of cloud-based VLMs.
Abstract: Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
[56] Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova, Matthias Keicher, Nassir Navab
Main category: cs.CL
TL;DR: A reinforcement learning method for fine-tuning LLMs to produce calibrated confidence estimates alongside factual answers, optimizing for proper calibration through logarithmic scoring rule rewards.
Details
Motivation: Safe and trustworthy use of LLMs requires accurate confidence expression in their answers, as current models often lack proper calibration between confidence estimates and actual accuracy.Method: Novel Reinforcement Learning approach that directly fine-tunes LLMs to express calibrated confidence estimates. Uses reward based on logarithmic scoring rule that explicitly penalizes both over- and under-confidence, encouraging alignment between confidence estimates and predictive accuracy.
Result: Models trained with this approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting emergence of general confidence awareness.
Conclusion: The proposed method successfully integrates confidence calibration into the generative process of LLMs, unlike prior decoupled approaches, resulting in better calibrated models that maintain calibration across different tasks.
Abstract: A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness.
[57] Entropy-Gated Branching for Efficient Test-Time Reasoning
Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu
Main category: cs.CL
TL;DR: EGB improves LLM reasoning by branching only at high-uncertainty steps and pruning with a lightweight verifier, achieving 22.6% accuracy improvement while being 31%-75% faster than beam search.
Details
Motivation: Test-time compute methods improve LLM reasoning but waste most computation on exploring low-diversity branches where models already have high confidence. A small subset of uncertain reasoning steps disproportionately impacts final accuracy.Method: Entropy-Gated Branching (EGB) - branches only at high-uncertainty steps and prunes expansions with a lightweight verifier to allocate computational resources dynamically during inference.
Result: On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance.
Conclusion: Dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
Abstract: Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs). However, these approaches require substantially more computational resources, with most compute wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these critical junctures tends to yield more diverse and higher-quality candidate reasoning steps. We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier. On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
[58] PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation
Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang
Main category: cs.CL
TL;DR: PROPHET is a new benchmark for inferable future event forecasting using news articles, addressing the problem of non-inferable questions in existing benchmarks through Causal Intervened Likelihood (CIL) filtering.
Details
Motivation: Existing event forecasting benchmarks often contain non-inferable questions that lack sufficient supporting rationales in retrieved news articles, making evaluation unreliable. There's a need for a benchmark that ensures questions are actually answerable from available information.Method: 1) Propose Causal Intervened Likelihood (CIL) - a statistical measure using causal inference to assess question inferability. 2) Collect recent trend forecasting questions and filter them using CIL to create PROPHET benchmark. 3) Evaluate various prediction methods on this filtered, inferable dataset.
Result: 1) Validated CIL’s effectiveness in assessing inferability. 2) Created PROPHET benchmark with inferable forecasting questions. 3) Evaluated representative prediction methods on PROPHET, providing insights for future forecasting research.
Conclusion: PROPHET addresses a critical gap in event forecasting evaluation by ensuring benchmark questions are inferable from available news articles, enabling more reliable assessment of LLM-based forecasting systems through causal inference-based filtering.
Abstract: Predicting future events based on news on the Web stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG)-and-reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles downloaded from the Web. However, because there is no consideration of whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions, and then filtered the data using CIL resulting in an inferable benchmark for future forecasting. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into future forecasting with the aid of CIL. Subsequently, we evaluate several representative prediction methods on PROPHET. The overall results draws valuable insights for task of future directions.
[59] Out of Style: RAG’s Fragility to Linguistic Variation
Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, Maarten Sap
Main category: cs.CL
TL;DR: RAG systems are vulnerable to linguistic variations in user queries, with performance drops up to 40% for less formal or grammatically incorrect queries, highlighting robustness gaps for real-world deployment.
Details
Motivation: Despite strong benchmark performance, RAG systems' robustness to real-world user queries with linguistic variations remains unexplored, creating a critical gap for practical deployment where user queries exhibit diverse linguistic characteristics.Method: Systematically analyzed impact of four linguistic dimensions (formality, readability, politeness, grammatical correctness) on RAG performance. Evaluated 2 retrieval models and 9 LLMs (3-72B parameters) across 4 QA datasets.
Result: Linguistic reformulations significantly impact both retrieval and generation stages: up to 40.41% drop in Recall@5 for less formal queries, and 38.86% drop in answer match scores for grammatically incorrect queries. RAG systems are more sensitive than LLM-only generations.
Conclusion: RAG systems are fragile to linguistic variations in user queries, with error propagation across components. Findings highlight the need for improved robustness techniques to ensure reliable performance in diverse real-world user interactions.
Abstract: Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions. Code is available at https://github.com/Springcty/RAG-fragility-to-linguistic-variation.
[60] Propaganda AI: An Analysis of Semantic Divergence in Large Language Models
Nay Myat Min, Long H. Pham, Yige Li, Jun Sun
Main category: cs.CL
TL;DR: LLMs can show concept-conditioned semantic divergence where common cues (ideologies, public figures) trigger uniform stance-like responses that evade token-based audits. RAVEN detects this by combining semantic entropy and cross-model disagreement.
Details
Motivation: Current safety evaluations have a blind spot: LLMs can exhibit concept-conditioned semantic divergence where high-level cues elicit unusually uniform, stance-like responses that evade token-trigger audits. This carries major societal stakes as such cues can steer content exposure at scale.Method: RAVEN (Response Anomaly Vigilance) is a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling semantic entropy over paraphrastic samples with cross-model disagreement. The method was tested by implanting concept-conditioned stances via LoRA fine-tuning and auditing five LLM families across twelve sensitive topics.
Result: In controlled fine-tuning, concept-conditioned stances were successfully implanted without rare token triggers. Auditing 5 LLM families across 12 topics (360 prompts per model) with clustering via bidirectional entailment revealed recurrent, model-specific divergences in 9/12 topics.
Conclusion: Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence. RAVEN effectively detects concept-conditioned semantic divergence that current safety evaluations miss.
Abstract: Large language models (LLMs) can exhibit concept-conditioned semantic divergence: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present RAVEN (Response Anomaly Vigilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling semantic entropy over paraphrastic samples with cross-model disagreement. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.
[61] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti
Main category: cs.CL
TL;DR: Sparse attention can extend Transformer LLM context lengths efficiently, with larger sparse models outperforming smaller dense ones at equivalent cost, and longer sequences tolerating higher sparsity levels.
Details
Motivation: There's a lack of comprehensive evaluation of sparse attention methods' efficiency-accuracy trade-offs, despite their promise for extending long-context capabilities in Transformer LLMs.Method: Conducted largest-scale empirical analysis of training-free sparse attention, evaluating six methods across multiple model families/sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 on nine diverse tasks. Organized sparse attention methods into taxonomy along four design axes.
Result: 1) Sparse attention is effective - larger sparse models outperform smaller dense ones at equivalent cost; 2) Computational constraints make token-to-page estimation unfeasible during prefilling; 3) Longer sequences tolerate higher sparsity, suggesting fixed-budget methods are suboptimal.
Conclusion: Sparse attention improves the Pareto frontier for long-context LLMs, with practical deployment guidance and methodological recommendations for future evaluations provided.
Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective – larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) due to computational constraints, token-to-page importance estimation is unfeasible during prefilling, where the choice of an alternative solution (global-to-token or block-to-block) depends on the task, but is possible during decoding, enabling better generalisation and tolerance to higher sparsity; 3) longer sequences tolerate higher sparsity, suggesting that fixed-budget methods in production are suboptimal. Together, these findings provide practical guidance for deploying sparse attention and methodological recommendations for future evaluations. Our code is available at https://github.com/PiotrNawrot/sparse-frontier.
[62] What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs
Xinlan Yan, Di Wu, Yibin Lei, Christof Monz, Iacer Calixto
Main category: cs.CL
TL;DR: S-MedQA is a new English medical QA dataset with 24k+ examples across 15 specialties, used to study how clinical specialties affect LLM performance in medical QA.
Details
Motivation: To create a benchmark dataset for evaluating LLMs in fine-grained clinical specialties and investigate how specialty-specific training affects medical QA performance.Method: Constructed S-MedQA dataset with 24k+ QA pairs across 15 medical specialties using machine and expert verification. Used this dataset to fine-tune LLMs and analyze performance across specialties, examining token probabilities of clinically relevant terms.
Result: Training on specialty-specific data doesn’t guarantee best performance on that specialty. Token probabilities of clinical terms increase across all specialties regardless of fine-tuning specialty. Gains appear to come from domain shifting (general→medical) rather than specialty-specific knowledge injection.
Conclusion: The role of fine-tuning data in medical domain needs rethinking - improvements may come from general domain adaptation rather than specialty-specific knowledge. S-MedQA and code are released to advance clinical NLP research.
Abstract: In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset designed for benchmarking large language models (LLMs) in fine-grained clinical specialties. S-MedQA consists of over 24k examples, covering 15 medical specialties, with QA pairs that can have multiple specialty annotations, such as when a question is cross-disciplinary. The dataset is constructed using both machine and expert verification to maximize data availability and reliability. We use S-MedQA to investigate the role of clinical specialties in the knowledge-intensive scenario of medical QA. Our results show that training on data from a clinical specialty does not necessarily lead to the best performance on that specialty. Additionally, regardless of the specialty the LLM was fine-tuned on, token probabilities of clinically relevant terms consistently increase across all specialties. Based on these findings, we hypothesize that improvement gains, at least in our settings, are derived primarily from domain shifting (e.g., general to medical) rather than from injecting specialty-specific knowledge. This suggests a need to rethink the role of fine-tuning data in the medical domain. To encourage further advancements in the clinical NLP field, we release S-MedQA along with all the code required to reproduce our experiments for the research community.
[63] SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
Yaoning Yu, Ye Yu, Peiyan Zhang, Kai Wei, Haojing Luo, Haohan Wang
Main category: cs.CL
TL;DR: SIPDO is a closed-loop prompt optimization framework that integrates synthetic data generation to iteratively improve prompts by identifying weaknesses and refining them without external supervision.
Details
Motivation: Most existing prompt optimization methods work with fixed datasets, assuming static input distributions and offering limited support for iterative improvement. There's a need for a more dynamic approach that can systematically improve prompts over time.Method: SIPDO couples a synthetic data generator with a prompt optimizer in a feedback-driven loop. The generator produces new examples that reveal current prompt weaknesses, and the optimizer incrementally refines the prompt in response to these weaknesses.
Result: Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, demonstrating the value of integrating data synthesis into prompt learning workflows.
Conclusion: The SIPDO framework enables systematic improvement of prompt performance without assuming access to external supervision or new tasks, highlighting the importance of dynamic, closed-loop optimization for prompt learning.
Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
[64] Text2Grad: Reinforcement Learning from Natural Language Feedback
Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Main category: cs.CL
TL;DR: Text2Grad is a new RL paradigm that converts free-form textual feedback into span-level gradients for precise model updates, outperforming traditional scalar-reward RL and prompt-only methods across multiple tasks.
Details
Motivation: Traditional RLHF uses coarse scalar rewards that obscure fine-grained reasons for success/failure, leading to slow and opaque learning. Existing text-based approaches improve interpretability but don't update model parameters.Method: Three components: (1) feedback-annotation pipeline pairing critiques with token spans, (2) fine-grained reward model predicting span-level rewards while generating explanatory critiques, (3) span-level policy optimizer that back-propagates natural-language gradients.
Result: Consistently surpasses scalar-reward RL and prompt-only baselines across summarization, code generation, and question answering, achieving higher task metrics and richer interpretability.
Conclusion: Natural-language feedback can serve as actionable training signals for fine-grained alignment, not just explanations. The approach enables precise, feedback-conditioned adjustments instead of global nudges.
Abstract: Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model’s policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answers while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results suggest that natural-language feedback can serve not only as explanations, but also as actionable training signals for fine-grained alignment. The code for our method is available at https://github.com/microsoft/Text2Grad.
[65] Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale
Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z. Leibo, Hoyt Long, Emily Robinson, Adam Sobey
Main category: cs.CL
TL;DR: LLMs can make cultural context and human meaning legible at scale by automating thick descriptions, addressing limitations of traditional AI’s thin descriptions.
Details
Motivation: Traditional AI systems fail to represent human meaning because they rely on "thin descriptions" - numerical representations that strip away cultural context. This paper argues for using LLMs to bridge this gap by automating "thick descriptions" from humanities/social sciences.Method: Proposes using LLMs’ verbal capabilities to generate and process thick descriptions (verbal representations that retain contextual information and accommodate heterogeneity) at scale, moving beyond numerical standardization.
Result: Identifies a new direction for generative AI application: developing representational formats based on thick description rather than just better metrics. Outlines five key challenges for implementation.
Conclusion: LLMs offer unprecedented potential to make human meaning legible in AI systems by automating thick descriptions, representing a crucial shift from thin to thick representation in sociotechnical systems.
Abstract: This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions (numerical representations that enforce standardization and therefore strip human activity of the cultural context which gives it meaning). By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description (verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning). The verbal capabilities of LLMs now provide a means of at least partially automating the generation and processing of thick descriptions, offering new ways to deploy them at scale. We argue that the problem of rendering human meaning legible is not just about selecting better metrics but about developing new representational formats based on thick description. We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static.
[66] Unsupervised Elicitation of Language Models
Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike
Main category: cs.CL
TL;DR: ICM is an unsupervised method that fine-tunes language models using their own generated labels, achieving performance comparable to or better than human supervision on various tasks, especially when models have superhuman capabilities.
Details
Motivation: As language models become more capable, human supervision becomes insufficient or impossible for tasks where models have superhuman abilities. Current post-training methods rely on human-specified behaviors, which creates a bottleneck for eliciting the full capabilities of advanced models.Method: Internal Coherence Maximization (ICM) - an unsupervised algorithm that fine-tunes pretrained language models using their own generated labels without any external supervision. The method leverages the model’s internal consistency and coherence.
Result: ICM matches performance of training on golden labels and outperforms crowdsourced human supervision on GSM8k-verification, TruthfulQA, and Alpaca reward modeling. On superhuman-capability tasks, ICM significantly outperforms human-label training. When applied to train a Claude 4 Sonnet-based assistant via reinforcement learning with unsupervised reward model, it matches production-grade human-label training on average, with better chat and safety scores but lower math and coding scores.
Conclusion: ICM provides an effective unsupervised alternative to human supervision for fine-tuning language models, particularly valuable for eliciting superhuman capabilities where human labels are insufficient. The method enables training frontier models without relying on external human supervision.
Abstract: To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden labels and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 4 Sonnet-based assistant. The resulting assistant matches its counterpart trained on production-grade human labels on average, with higher scores on chat and safety yet lower scores on math and coding.
[67] Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Matthias Keicher, Nassir Navab
Main category: cs.CL
TL;DR: LA-CDM is a hypothesis-driven uncertainty-aware language agent for clinical decision-making that uses hybrid training (supervised + reinforcement learning) to improve diagnostic performance and efficiency by modeling the iterative investigation process.
Details
Motivation: Current LLM applications in clinical decision support either assume unrealistic immediate availability of all patient information (ignoring the iterative investigation process) or rely only on limited "out-of-the-box" capabilities without task-specific training. There's a need for models that better simulate real clinical decision-making dynamics.Method: Proposed LA-CDM: a hypothesis-driven uncertainty-aware language agent that converges towards diagnosis via repeatedly requesting and interpreting relevant tests. Uses hybrid training paradigm combining supervised and reinforcement learning with three objectives: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making.
Result: Evaluated on MIMIC-CDM dataset covering four abdominal diseases with various clinical tests. Shows benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.
Conclusion: LA-CDM effectively models the dynamic, interactive, cyclic nature of clinical decision-making, addressing limitations of current LLM approaches by combining hypothesis-driven reasoning with uncertainty awareness and task-specific training.
Abstract: Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited “out-of-the-box” capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.
[68] High-Layer Attention Pruning with Rescaling
Songtao Liu, Peng Liu
Main category: cs.CL
TL;DR: A novel structured pruning method for LLMs that strategically prunes attention heads in higher layers with adaptive rescaling to maintain representation quality.
Details
Motivation: Conventional training-free structured pruning methods use heuristic metrics that indiscriminately remove attention heads across all layers without considering their positions, potentially harming model performance.Method: Proposes a pruning algorithm that strategically prunes attention heads in higher layers and introduces adaptive rescaling parameters to calibrate representation scale post-pruning to counteract magnitude changes.
Result: Comprehensive experiments on LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B across 27 datasets show the method consistently outperforms existing structured pruning methods, particularly in generation tasks.
Conclusion: Strategic pruning of attention heads in higher layers with adaptive rescaling is an effective approach for compressing LLMs while maintaining performance, especially for generation tasks.
Abstract: Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model’s higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines. Code is available at https://github.com/SongtaoLiu0823/HARP.
[69] UQLM: A Python Package for Uncertainty Quantification in Large Language Models
Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad
Main category: cs.CL
TL;DR: UQLM is a Python package for detecting LLM hallucinations using uncertainty quantification techniques, providing confidence scores for LLM outputs.
Details
Motivation: Hallucinations in LLMs generate false/misleading content, posing safety and trust challenges for downstream applications that need reliable outputs.Method: UQLM uses state-of-the-art uncertainty quantification (UQ) techniques to create a suite of UQ-based scorers that compute response-level confidence scores (0-1).
Result: The toolkit provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance LLM output reliability.
Conclusion: UQLM addresses the hallucination problem in LLMs through uncertainty quantification, offering a practical tool to improve safety and trust in LLM applications.
Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
[70] Why is Your Language Model a Poor Implicit Reward Model?
Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora
Main category: cs.CL
TL;DR: IM-RMs (implicit reward models) generalize worse than EX-RMs (explicit reward models) due to over-reliance on superficial token-level cues, not because IM-RMs can act as both verifiers and generators.
Details
Motivation: To understand why implicit reward models (IM-RMs) generalize worse than explicit reward models (EX-RMs) despite being nearly identical in architecture and training, and to investigate the fundamental implicit biases underlying different reward model types.Method: Theoretical analysis and experimental investigation comparing IM-RMs and EX-RMs, examining their generalization behavior under token-level distribution shifts and in-distribution. The study tests alternative hypotheses about the generalization gap, particularly challenging the claim that IM-RMs struggle because they can function as both verifiers and generators.
Result: IM-RMs rely more heavily on superficial token-level cues compared to EX-RMs, leading to worse generalization under token-level distribution shifts and even in-distribution. Evidence contradicts alternative explanations, showing the gap is not due to IM-RMs’ dual verifier-generator capability.
Conclusion: Seemingly minor design choices in reward model implementation (implicit vs. explicit) substantially impact generalization behavior, with IM-RMs’ over-reliance on token-level cues explaining their inferior generalization compared to EX-RMs.
Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Overall, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
[71] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng
Main category: cs.CL
TL;DR: DeepSieve is an agentic RAG framework that uses LLM-as-a-knowledge-router to decompose complex queries, route sub-questions to appropriate sources, and filter irrelevant information through multi-stage distillation, improving reasoning depth and retrieval precision over conventional RAG.
Details
Motivation: LLMs struggle with knowledge-intensive queries due to inability to access up-to-date or domain-specific information. Existing RAG methods lack fine-grained control over both query and source sides, resulting in noisy retrieval and shallow reasoning.Method: DeepSieve uses LLM-as-a-knowledge-router to decompose complex queries into structured sub-questions, recursively routes each to the most suitable knowledge source, and filters irrelevant information through a multi-stage distillation process. The framework emphasizes modularity, transparency, and adaptability.
Result: Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
Conclusion: DeepSieve provides an effective agentic RAG framework that addresses limitations of existing methods through fine-grained query decomposition and source routing, enabling better knowledge-intensive reasoning with improved transparency and adaptability.
Abstract: Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.
[72] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models
Mingruo Yuan, Shuyi Zhang, Ben Kao
Main category: cs.CL
TL;DR: CRUX is a new framework for LLM confidence estimation that incorporates context faithfulness and consistency through two novel metrics: contextual entropy reduction and unified consistency examination.
Details
Motivation: Current confidence estimation methods for LLMs fail to consider the relevance between responses and contextual information, which is crucial for evaluating output quality, especially when background knowledge is provided.Method: CRUX integrates context faithfulness and consistency via two metrics: (1) contextual entropy reduction (measures data uncertainty through information gain via contrastive sampling with/without context), and (2) unified consistency examination (captures model uncertainty through global consistency of answers with/without context).
Result: Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) show CRUX achieves the highest AUROC compared to existing baselines.
Conclusion: CRUX effectively bridges the gap in context-aware confidence estimation for LLMs by integrating both context faithfulness and consistency, demonstrating superior performance across diverse datasets.
Abstract: Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX’s effectiveness, achieving the highest AUROC than existing baselines.
[73] Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai
Main category: cs.CL
TL;DR: The paper proposes Web-CogKnowledge Framework that decomposes web agent capabilities into knowledge content learning (Memorizing/Understanding) and cognitive processes (Exploring), introduces Web-CogDataset for knowledge acquisition, and develops Web-CogReasoner agent with knowledge-driven CoT reasoning that outperforms existing models.
Details
Motivation: Web agents need sufficient knowledge to effectively engage in cognitive reasoning. Current multimodal large-scale models enable perception and interaction, but agents must first acquire structured knowledge to reason like humans.Method: Proposes Web-CogKnowledge Framework categorizing knowledge as Factual, Conceptual, and Procedural. Creates Web-CogDataset from 14 real-world websites for knowledge acquisition. Develops Web-CogReasoner agent with knowledge-driven Chain-of-Thought reasoning framework.
Result: Web-CogReasoner shows significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. Introduces Web-CogBench for comprehensive evaluation across knowledge domains.
Conclusion: The proposed framework successfully bridges knowledge acquisition with cognitive reasoning for web agents, demonstrating that structured knowledge is crucial for effective web interaction and generalization to unseen tasks.
Abstract: Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent’s capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent’s processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the “what” of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the “how” of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent’s conceptual grounding-the “nouns” upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner
[74] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng
Main category: cs.CL
TL;DR: Four creativity evaluation metrics (perplexity, LLM-as-a-Judge, Creativity Index, syntactic templates) show inconsistent performance across different creative domains, with limited agreement between metrics and poor alignment with human judgments.
Details
Motivation: To systematically evaluate and compare existing creativity measurement approaches across diverse creative domains, identifying their limitations and inconsistencies when compared to human-aligned creative assessments.Method: Compiled datasets with human-aligned creative and uncreative examples across three domains (creative writing, unconventional problem-solving, research ideation). Evaluated four metrics’ ability to discriminate between creative and uncreative sets, analyzing their consistency across domains and agreement with each other.
Result: Found limited consistency both across domains and between metrics: metrics that work in one domain fail in others, and different metrics often disagree on the same data. Identified specific limitations: perplexity measures fluency not novelty; LLM-as-a-Judge shows prompt sensitivity and label bias; Creativity Index measures lexical diversity with implementation sensitivity; syntactic templates fail with formulaic language.
Conclusion: Current creativity evaluation metrics lack robustness and generalizability, showing poor alignment with human judgments. The findings highlight the need for more reliable, domain-agnostic evaluation frameworks that better capture human perceptions of creativity.
Abstract: We examine, analyze, and compare four representative creativity measures–perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)–across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric’s ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
[75] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Ming Liu, Yang Xiang
Main category: cs.CL
TL;DR: CCFQA is a new benchmark for evaluating multimodal LLMs’ factuality across languages and modalities, focusing on speech-text parallel factual questions in 8 languages, revealing current MLLMs’ limitations and proposing effective few-shot transfer learning.
Details
Motivation: Existing MLLM evaluation benchmarks focus mainly on English and textual/visual modalities, creating a gap for multilingual speech understanding and factuality assessment.Method: Created CCFQA benchmark with parallel speech-text factual questions across 8 languages; proposed few-shot transfer learning strategy to transfer English QA capabilities to multilingual spoken QA tasks.
Result: Current MLLMs face substantial challenges on CCFQA; the proposed few-shot approach achieves competitive performance with GPT-4o-mini-Audio using just 5-shot training.
Conclusion: CCFQA fills an important evaluation gap and promotes development of more robust multilingual speech understanding in MLLMs; the benchmark and code are publicly released.
Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs’ cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.
[76] Complex Logical Instruction Generation
Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
Main category: cs.CL
TL;DR: The paper introduces LogicIFGen and LogicIFEval - a framework for generating verifiable logic-rich instructions from code functions and a benchmark to evaluate LLMs’ ability to follow complex logic instructions.
Details
Motivation: As LLMs advance, they need to handle increasingly complex logic structures in instructions, but current evaluation of LLMs' ability to follow logic-rich instructions is insufficient. The authors aim to systematically assess how well LLMs can follow instructions with intricate logic like conditions, loops, and function calls.Method: LogicIFGen is a scalable, automated framework that generates verifiable instructions from code functions, which naturally express rich logic structures. The authors curate complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark with 426 verifiable logic-rich instructions for evaluation.
Result: Experiments show current state-of-the-art LLMs struggle significantly with LogicIFEval, with most models following fewer than 60% of the instructions correctly, revealing substantial deficiencies in instruction-following ability for logic-rich tasks.
Conclusion: The paper highlights a critical gap in LLMs’ instruction-following capabilities for complex logic structures and provides a benchmark for future research. The poor performance on LogicIFEval suggests current LLMs have significant limitations in handling intricate logical instructions despite their advanced capabilities.
Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditions, loops, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
[77] TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action
Chenyue Zhou, Gürkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan Fürst
Main category: cs.CL
TL;DR: TextMineX: First dataset, evaluation framework, and ontology-guided LLM pipeline for extracting structured knowledge from unstructured Humanitarian Mine Action reports.
Details
Motivation: Humanitarian Mine Action (HMA) agencies produce valuable operational knowledge in unstructured reports, limiting information transfer between agencies. Current knowledge is buried in unstructured formats, hindering effective knowledge sharing and reuse.Method: Proposed TextMineX: ontology-guided LLM pipeline that extracts (subject, relation, object)-triples from HMA reports. Uses dataset from Cambodian Mine Action Centre (CMAC) and introduces bias-aware evaluation framework combining human-annotated triples with LLM-as-Judge protocol to mitigate position bias.
Result: Ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models. Dataset and code are publicly released.
Conclusion: TextMineX successfully structures HMA knowledge, enabling better information transfer between agencies. The ontology-guided approach significantly improves extraction performance while reducing hallucinations, making HMA knowledge more accessible and usable.
Abstract: Humanitarian Mine Action (HMA) addresses the challenge of detecting and removing landmines from conflict regions. Much of the life-saving operational knowledge produced by HMA agencies is buried in unstructured reports, limiting the transferability of information between agencies. To address this issue, we propose TextMineX: the first dataset, evaluation framework and ontology-guided large language model (LLM) pipeline for knowledge extraction from text in the HMA domain. TextMineX structures HMA reports into (subject, relation, object)-triples, thus creating domain-specific knowledge. To ensure real-world relevance, we utilized the dataset from our collaborator Cambodian Mine Action Centre (CMAC). We further introduce a bias-aware evaluation framework that combines human-annotated triples with an LLM-as-Judge protocol to mitigate position bias in reference-free scoring. Our experiments show that ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models. We publicly release the dataset and code.
[78] QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
Main category: cs.CL
TL;DR: QWHA integrates Fourier-related transform adapters into quantized LLMs using Walsh-Hadamard Transform with novel initialization, reducing quantization error and computational cost while improving low-bit accuracy.
Details
Motivation: Need to combine quantization (for efficient inference) with parameter-efficient fine-tuning (for low training overhead) while overcoming limitations of existing methods: low-rank adapters have limited capacity, and FT-based adapters have high computational cost and ineffective error reduction in quantized models.Method: Proposes QWHA method that integrates FT-based adapters using Walsh-Hadamard Transform as the transform kernel, with novel adapter initialization scheme featuring adaptive parameter selection and value refinement to effectively mitigate quantization errors.
Result: QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters while effectively reducing quantization errors.
Conclusion: QWHA provides an effective solution for quantization-aware PEFT that balances accuracy and efficiency, demonstrating superior performance and computational efficiency compared to existing methods.
Abstract: The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
[79] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field
Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao
Main category: cs.CL
TL;DR: AECBench is a comprehensive benchmark to evaluate LLMs in Architecture, Engineering, and Construction, revealing performance declines across cognitive levels despite proficiency in basic tasks.
Details
Motivation: While LLMs show promise in AEC field adoption, their robustness and reliability in this safety-critical domain remain unproven, necessitating systematic evaluation.Method: Created AECBench with 5-level cognition framework (Knowledge Memorization, Understanding, Reasoning, Calculation, Application), 23 tasks from authentic AEC practice, 4,800-question dataset, and LLM-as-a-Judge evaluation approach.
Result: Nine LLMs showed clear performance decline across cognitive levels: proficient in basic Knowledge/Understanding tasks but deficient in table interpretation, complex reasoning/calculation, and domain-specific document generation.
Conclusion: The benchmark establishes groundwork for future research toward robust LLM integration in safety-critical engineering, highlighting current limitations in specialized AEC applications.
Abstract: Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark features a five-level, cognition-oriented evaluation framework (i.e., Knowledge Memorization, Understanding, Reasoning, Calculation, and Application). Based on the framework, 23 representative evaluation tasks were defined. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an “LLM-as-a-Judge” approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
[80] Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Jinyeop Song, Song Wang, Julian Shun, Yada Zhu
Main category: cs.CL
TL;DR: KG-R1 is a reinforcement learning-based KG-RAG framework that uses a single agent to interact with knowledge graphs, reducing inference costs while improving accuracy and enabling transferability to new KGs.
Details
Motivation: Current KG-RAG systems use multiple LLM modules (planning, reasoning, responding), which increases inference costs and binds behavior to specific knowledge graphs, limiting flexibility and efficiency.Method: Introduces KG-R1, an agentic KG-RAG framework using reinforcement learning. A single agent interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into reasoning and generation through end-to-end RL optimization.
Result: In KGQA benchmarks, KG-R1 with Qwen-2.5-3B achieves higher answer accuracy with fewer generation tokens than prior multi-module methods using larger models. It also demonstrates transferability - maintaining strong accuracy on new KGs without modification after training.
Conclusion: KG-R1 offers an efficient and transferable KG-RAG framework suitable for real-world deployment, addressing cost and flexibility limitations of previous approaches while maintaining strong performance.
Abstract: Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.
[81] A-IPO: Adaptive Intent-driven Preference Optimization
Wenqing Wang, Muhammad Asif Ali, Ali Shoker, Ruohan Yang, Junyang Chen, Ying Sha, Huan Wang
Main category: cs.CL
TL;DR: A-IPO introduces adaptive intent-driven preference optimization that infers latent user intentions from prompts to better align with diverse preferences, outperforming DPO on multiple benchmarks.
Details
Motivation: Existing alignment methods like DPO default to majority views and overlook minority opinions, failing to capture latent user intentions in prompts and lacking robustness to adversarial preferences.Method: A-IPO introduces an intention module that infers latent intent behind user prompts and explicitly incorporates this inferred intent into the reward function, creating an intention-response similarity term that increases preference margin.
Result: A-IPO achieves substantial improvements: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.
Conclusion: A-IPO facilitates pluralistic preference optimization while enhancing adversarial robustness, consistently surpassing existing baselines by explicitly modeling diverse user intents.
Abstract: Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model’s responses and the user’s underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention–response similarity term increases the preference margin (by a positive shift of $λ,Δ\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.
[82] DND: Boosting Large Language Models with Dynamic Nested Depth
Tieyuan Chen, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Weiyao Lin, Jianguo Li
Main category: cs.CL
TL;DR: DND improves LLM performance by selectively reprocessing critical tokens in a nested depth manner, boosting performance with minimal parameter/compute increase.
Details
Motivation: To enhance off-the-shelf LLM performance without extensive retraining by focusing computational resources on difficult tokens that need "review" while avoiding redundant computation for easier tokens.Method: Dynamic Nested Depth (DND) identifies critical tokens at transformer layer end using a router with loss control for better distinguishability, then feeds them back for extra processing. Includes threshold control for selection stability, integrated into pre-trained models during post-training.
Result: Boosts dense Qwen3-1.7B by 1.88% and MoE Qwen3-30B-A3B by 0.87% on diverse benchmarks with minimal parameter and computing increase.
Conclusion: DND provides an effective post-training method to enhance LLM performance through dynamic token selection and nested reprocessing, offering significant gains with minimal overhead.
Abstract: We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
[83] Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier
Main category: cs.CL
TL;DR: Human psychometric tests show moderate reliability but poor ecological validity when applied to LLMs, with test scores often not correlating or even negatively correlating with actual model behavior in downstream tasks.
Details
Motivation: To systematically evaluate whether human psychometric tests yield meaningful results when applied to large language models, given their increasing use for assessing psychological constructs in LLMs.Method: Evaluated reliability and validity of human psychometric tests on 17 LLMs for three constructs (sexism, racism, morality) using multiple item and prompt variations. Validity assessed through convergent (theory-based inter-test correlations) and ecological approaches (alignment between test scores and behavior in real-world downstream tasks).
Result: Found moderate reliability across variations, but crucially discovered that psychometric test scores do not align with model behavior in downstream tasks, sometimes even showing negative correlations, indicating low ecological validity.
Conclusion: Systematic evaluations of psychometric tests on LLMs are essential before interpreting their scores, and human-designed tests cannot be directly applied to LLMs without adaptation.
Abstract: Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests – originally developed for humans – yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests on 17 LLMs for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests on LLMs are essential before interpreting their scores. Our findings also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.
[84] LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Main category: cs.CL
TL;DR: RAG utility is LLM-specific, not universal; evidence optimized for one model doesn’t transfer well to others, highlighting need for generator-tailored retrieval.
Details
Motivation: Current RAG systems optimize for topical relevance, but success depends on whether retrieved passages are actually useful for specific LLMs to generate correct answers. Utility varies across LLMs due to differences in knowledge, reasoning, and evidence utilization capabilities.Method: Formalized LLM-specific utility as performance improvement when passages are provided vs. answering without evidence. Constructed benchmark of LLM-specific gold utilitarian passages for four LLMs (Qwen3-8B/14B/32B and Llama3.1-8B) on three QA datasets. Introduced LLM-specific utility judgment task to evaluate existing utility-aware selection methods.
Result: Utilitarian passages are model-dependent and non-transferable: each LLM performs best with its own utilitarian evidence, while evidence optimized for other LLMs is consistently suboptimal. Human-annotated evidence remains a strong baseline but doesn’t fully match individual LLM utility needs. Existing utility-aware methods capture model-agnostic usefulness but struggle with LLM-specific utility estimation.
Conclusion: Current utility-aware retrieval has limitations; findings motivate generator-tailored evidence selection for improving RAG systems, as one-size-fits-all approaches don’t account for LLM-specific utility requirements.
Abstract: Retrieval-augmented generation (RAG) is typically optimized for topical relevance, yet its success ultimately depends on whether retrieved passages are useful for a large language model (LLM) to generate correct and complete answers. We argue that such utility is often LLM-specific rather than universal, due to differences in models’ knowledge, reasoning, and ability to leverage evidence. We formalize LLM-specific utility as the performance improvement of a target LLM when a passage is provided, compared to answering without evidence. To systematically study LLM-specific utility, we construct a benchmark of LLM-specific gold utilitarian passages for four LLMs (Qwen3-8B/14B/32B and Llama3.1-8B) on three QA datasets (Natural Questions, TriviaQA, and MS MARCO-FQA). Our analysis shows that utilitarian passages are model-dependent and non-transferable: each LLM performs best with its own utilitarian evidence, while evidence optimized for other LLMs is consistently suboptimal. Human-annotated evidence remains a strong general baseline but does not fully match individual LLM utility needs. We further introduce the LLM-specific utility judgment task and find that existing utility-aware selection and scoring methods largely capture model-agnostic usefulness and struggle to reliably estimate LLM-specific utility. Overall, our findings highlight the limitations of current utility-aware retrieval and motivate generator-tailored evidence selection for improving RAG.
[85] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL
Md Mahadi Hasan Nahid, Davood Rafiei, Weiwei Zhang, Yong Zhang
Main category: cs.CL
TL;DR: A context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem, combining table-first and column-first strategies with question decomposition techniques to improve Text-to-SQL accuracy.
Details
Motivation: Schema linking is a critical but underexplored component of Text-to-SQL systems. Current methods focus on SQL generation while neglecting relevant schema element retrieval, leading to hallucinations and execution failures.Method: Proposes a bidirectional schema retrieval framework with two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. Augmented with question decomposition, keyword extraction, and keyphrase extraction techniques.
Result: Significantly improves schema recall while reducing false positives on BIRD and Spider benchmarks. SQL generation using retrieved schema outperforms full-schema baselines and approaches oracle performance without query refinement. Reduces performance gap between full and perfect schema settings by 50%.
Conclusion: Schema linking is a powerful lever for enhancing Text-to-SQL accuracy and efficiency, and treating it as a standalone problem yields substantial improvements over current approaches.
Abstract: Schema linking – the process of aligning natural language questions with database schema elements – is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
[86] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens
Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, Jundong Li
Main category: cs.CL
TL;DR: SemCoT is a novel implicit Chain-of-Thought framework that improves reasoning efficiency while preserving semantic alignment with ground-truth reasoning through contrastive training and knowledge distillation.
Details
Motivation: Traditional explicit CoT reasoning is too verbose for efficiency-critical applications, while existing implicit CoT methods suffer from poor semantic alignment with ground-truth reasoning and inefficient token generation.Method: Proposes SemCoT with two key components: 1) A contrastively trained sentence transformer to evaluate semantic alignment between implicit and explicit reasoning, and 2) An efficient implicit reasoning generator using knowledge distillation from a lightweight language model guided by the sentence transformer.
Result: Extensive experiments show SemCoT achieves superior performance compared to state-of-the-art methods in both efficiency and effectiveness.
Conclusion: SemCoT successfully addresses the limitations of existing implicit CoT methods by jointly optimizing token-level generation speed and preserving semantic alignment, making it the first approach to enhance CoT efficiency through this dual optimization.
Abstract: The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM’s hidden embeddings (termed ``implicit reasoning’’) rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.
[87] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
Chunyang Jiang, Paola Merlo
Main category: cs.CL
TL;DR: Lightweight models trained on carefully designed input (analogical structure, contrastive learning, minimal context) achieve high F1 on English verb alternation tasks with only hundreds to one-thousand examples, outperforming LLMs in this controlled setting.
Details
Motivation: To understand how specific input design principles support sample-efficient linguistic rule learning, and to compare these cognitively-inspired approaches with LLM performance on the same controlled tasks.Method: Implemented three principles (analogical structure, contrastive learning, minimal contextual cue) in structured sentence completion tasks testing English verb alternations. Trained lightweight models on hundreds to one-thousand examples, conducted ablation studies, and evaluated zero- and few-shot LLMs on same tasks.
Result: Lightweight models achieved high F1 on verb alternation tasks with minimal data. Analogical organization was the main driver of sample efficiency, with contrastive distractors and minimal context providing additional gains. Lightweight models outperformed LLMs in this controlled setting with far fewer task-specific data.
Conclusion: Careful input organization supports sample-efficient learning of linguistic rules, revealing distinct learning signatures between trained lightweight models and prompted LLMs. The contrast highlights different learning regimes rather than a general verdict on LLMs.
Abstract: Large language models achieve strong performance on many tasks, but their training makes it hard to see which properties of the input support efficient linguistic rule learning. We ask how three cognitively-inspired principles of input design support sample-efficient linguistic rule induction: analogical structure, contrastive learning, and minimal contextual cue. We also ask how their effects compare to those of LLMs on the same controlled tasks. We implement these principles in structured sentence completion tasks that test English verb alternations. Lightweight models trained on hundreds to one-thousand such examples learn the alternation rules with high F1 on these tasks. Ablation studies show that analogical organisation is the main driver of sample efficiency, and contrastive distractors and minimal context help further gains. We also evaluate zero- and few-shot LLMs on the same tasks. In this controlled setting, the lightweight models reach higher F1 with far fewer task-specific data. We treat this contrast as a comparison between learning regimes rather than a general verdict on LLMs. Our results show that careful input organisation supports sample-efficient learning of linguistic rules and reveals distinct learning signatures for trained lightweight models and prompted LLMs.
[88] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng
Main category: cs.CL
TL;DR: FusedKV reduces KV cache memory by 50% through learnable fusion of bottom and middle layer KV caches for top layers, outperforming standard Transformers while maintaining performance.
Details
Motivation: Transformer decoders suffer from prohibitive memory requirements for KV cache at long sequences. Existing cross-layer KV cache sharing methods underperform compared to within-layer methods like GQA, creating a need for more effective memory-efficient solutions.Method: Analyzed information flow of keys/values in top layers, finding values come from bottom layers while keys draw from both bottom and middle layers. Proposed FusedKV where top-layer KV caches are learnable fusion of most informative bottom and middle layer caches, operating on post-RoPE keys to preserve positional info. Also created FusedKV-Lite as cross-layer sharing variant using bottom-layer values and middle-layer keys.
Result: Achieved 50% reduction in cache memory while obtaining lower validation perplexity than standard Transformer decoders across LLMs from 332M to 4B parameters. FusedKV-Lite reduces I/O overhead with slight perplexity increase compared to FusedKV.
Conclusion: FusedKV provides a memory-efficient, high-performance architectural alternative to standard Transformer decoders by intelligently fusing KV cache information across layers based on observed information flow patterns.
Abstract: Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
[89] Coupled Variational Reinforcement Learning for Language Model General Reasoning
Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
Main category: cs.CL
TL;DR: CoVRL is a verifier-free RL method that couples variational inference with RL through hybrid sampling to improve reasoning exploration and thought-answer coherence in language models.
Details
Motivation: Existing verifier-free RL methods sample reasoning traces conditioned only on questions, which decouples reasoning from answer information, leading to inefficient exploration and incoherence between reasoning traces and final answers.Method: CoVRL bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy, constructing and optimizing a composite distribution that integrates these two distributions.
Result: Extensive experiments on mathematical and general reasoning benchmarks show CoVRL improves performance by 12.4% over base model and achieves additional 2.3% improvement over state-of-the-art verifier-free RL baselines.
Conclusion: CoVRL provides a principled framework for enhancing general reasoning capabilities of language models by enabling efficient exploration while preserving strong thought-answer coherence.
Abstract: While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
[90] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework
Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak
Main category: cs.CL
TL;DR: Multi-agent system with Moderator and Community Agents improves hate speech detection by incorporating socio-cultural context, outperforming existing methods on fairness and accuracy.
Details
Motivation: Current hate speech detection methods lack socio-cultural context and identity-awareness, leading to poor performance on implicitly hateful speech that requires understanding of specific demographic perspectives.Method: Contextualised detection framework using multi-agent system: central Moderator Agent coordinates dynamically constructed Community Agents representing specific demographic groups, integrating socio-cultural context from public knowledge sources.
Result: Outperforms state-of-the-art prompting methods (zero-shot, few-shot, chain-of-thought) and alternative approaches on ToxiGen dataset, with significant improvements in both classification accuracy and fairness across all target groups.
Conclusion: Community-driven consultative framework with explicit socio-cultural context integration enables more accurate and fair hate speech detection, particularly for implicitly hateful content that requires identity-aware moderation.
Abstract: This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
[91] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts
Prottay Kumar Adhikary, Reena Rawat, Tanmoy Chakraborty
Main category: cs.CL
TL;DR: coTherapist is a framework using small language models to emulate therapeutic competencies through fine-tuning and agentic reasoning, showing improved clinical responses and high empathy compared to baselines.
Details
Motivation: Addressing mental healthcare workforce shortages and rising demand by developing intelligent systems to support mental healthcare experts.Method: Unified framework utilizing small language model with domain-specific fine-tuning, retrieval augmentation, and agentic reasoning to emulate therapeutic competencies.
Result: Generates more relevant and clinically grounded responses than baselines, exhibits high empathy and therapist-consistent personality traits via T-BARS rubric, and receives positive human evaluation from domain experts.
Conclusion: Small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.
Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.
[92] Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou
Main category: cs.CL
TL;DR: JurisMMA is a novel framework for Legal Judgment Prediction that decomposes trial tasks into standardized stages and uses multimodal data, validated on a new large Chinese judicial dataset JurisMM.
Details
Motivation: Traditional LJP methods struggle with complex cases involving multiple allegations, diverse evidence, and lack adaptability. There's a need for more comprehensive frameworks that can handle the complexity of real legal cases and leverage multimodal data.Method: JurisMMA framework decomposes trial tasks, standardizes processes, and organizes them into distinct stages. The authors also built JurisMM, a large dataset with over 100,000 recent Chinese judicial records containing both text and multimodal video-text data.
Result: Experiments on both the new JurisMM dataset and the benchmark LawBench validate the framework’s effectiveness. The results show the framework works well for LJP tasks.
Conclusion: The JurisMMA framework is effective not only for Legal Judgment Prediction but also has broader applications in legal systems, offering new perspectives for future legal methods and datasets development.
Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.
[93] Truth with a Twist: The Rhetoric of Persuasion in Professional vs. Community-Authored Fact-Checks
Olesya Razuvayevskaya, Kalina Bontcheva
Main category: cs.CL
TL;DR: Large-scale analysis shows community-written debunks (Community Notes) don’t use more persuasion techniques than professional fact-checks, despite different rhetorical styles, and crowd raters effectively penalize problematic persuasion methods.
Details
Motivation: To empirically compare persuasion techniques in crowd-sourced vs. professional fact-checking debunks, testing the hypothesis that community-produced content relies more on subjective or persuasive wording than professional efforts.Method: Analyzed extensive datasets from Community Notes (CNs), EUvsDisinfo, and Database of Known Fakes (DBKF) to quantify prevalence and types of persuasion techniques across different fact-checking ecosystems using comparative analysis.
Result: Found no evidence that CNs contain more persuasion techniques than professional fact-checks; identified systematic rhetorical differences reflecting institutional norms; crowd raters slightly favor persuasive elements overall but effectively penalize problematic rhetorical methods.
Conclusion: Community-written debunks are not more persuasive than professional ones, though they differ rhetorically; crowd-sourced evaluation systems can effectively regulate problematic persuasion techniques in fact-checking content.
Abstract: This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means
[94] APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski
Main category: cs.CL
TL;DR: APEX-Agents benchmark assesses AI agents on long-horizon, cross-application tasks from finance/consulting/law domains, with Gemini 3 Flash (Thinking=High) achieving top score of 24.0%.
Details
Motivation: There's a need to evaluate whether AI agents can handle complex, realistic work tasks that span multiple applications and require long-term planning, similar to what professionals in investment banking, management consulting, and corporate law face daily.Method: Created APEX-Agents benchmark with 480 tasks requiring navigation of realistic work environments with files and tools. Tested eight agents using Pass@1 metric. Also developed Archipelago infrastructure for agent execution and evaluation.
Result: Gemini 3 Flash (Thinking=High) achieved highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). All prompts, rubrics, gold outputs, files, and metadata are open-sourced.
Conclusion: APEX-Agents provides a comprehensive benchmark for evaluating AI agents on professional work tasks, with current models showing room for improvement (top score 24.0%). The open-source release enables broader research and development in agent capabilities.
Abstract: We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
[95] Can We Trust LLM Detectors?
Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki
Main category: cs.CL
TL;DR: Current AI text detectors fail outside controlled benchmarks. Supervised detectors degrade out-of-domain, training-free methods are sensitive to proxy choice. Proposed supervised contrastive learning framework improves style discrimination but fundamental challenges remain for domain-agnostic detection.
Details
Motivation: The rapid adoption of LLMs has created a critical need for reliable AI text detection, but existing detectors often fail in real-world scenarios outside controlled benchmarks, highlighting the need for more robust solutions.Method: Systematically evaluated two dominant paradigms (training-free and supervised detectors). Proposed a supervised contrastive learning (SCL) framework that learns discriminative style embeddings to address limitations of existing approaches.
Result: Supervised detectors excel in-domain but degrade sharply out-of-domain. Training-free methods remain highly sensitive to proxy choice. The SCL framework shows improved performance but overall results expose fundamental challenges in building domain-agnostic detectors.
Conclusion: Both existing detection paradigms are brittle under distribution shift, unseen generators, and stylistic perturbations. While the proposed SCL framework offers improvements, fundamental challenges remain in creating truly domain-agnostic AI text detectors.
Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
[96] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung
Main category: cs.CL
TL;DR: RebuttalAgent: First AI framework using Theory of Mind for academic rebuttals, outperforming base models by 18.3% and beating advanced proprietary models.
Details
Motivation: Academic rebuttal is a complex strategic communication challenge under severe information asymmetry, not just technical debate. Current AI approaches fail because they only imitate surface-level linguistics without perspective-taking needed for effective persuasion.Method: Introduces RebuttalAgent with ToM-Strategy-Response pipeline modeling reviewer mental state, formulating persuasion strategy, and generating strategy-grounded responses. Trained on RebuttalBench dataset via critique-and-refine synthesis. Two-stage training: supervised fine-tuning for ToM analysis and strategic planning, then reinforcement learning with self-reward mechanism. Also developed Rebuttal-RM evaluator trained on 100K+ multi-source rebuttal data.
Result: RebuttalAgent outperforms base model by average 18.3% on automated metrics and beats advanced proprietary models in both automated and human evaluations. Rebuttal-RM achieves scoring consistency with human preferences surpassing GPT-4.1.
Conclusion: First successful grounding of academic rebuttal in Theory of Mind, demonstrating significant performance improvements. Generated content is for reference only to inspire authors, not to replace their own critical analysis.
Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author’s own critical analysis and response.
[97] Can professional translators identify machine-generated text?
Michael Farrell
Main category: cs.CL
TL;DR: Professional translators can identify AI-generated Italian short stories with some success (16.2% accuracy), but nearly equal numbers misclassify them, often preferring AI texts. Low burstiness and narrative contradictions are reliable AI indicators, while grammatical accuracy and emotional tone are misleading.
Details
Motivation: To determine if professional translators can reliably identify AI-generated Italian short stories without specialized training, and to understand what linguistic features help or hinder their detection.Method: In-person experiment with 69 translators assessing three anonymized short stories (two by ChatGPT-4o, one by human author). Participants rated likelihood of AI authorship and provided justifications for each story.
Result: While average results were inconclusive, 16.2% successfully distinguished AI from human texts (statistically significant). Nearly equal number misclassified in opposite direction. Low burstiness and narrative contradictions were most reliable AI indicators; grammatical accuracy and emotional tone often led to misclassification.
Conclusion: Professional translators can identify AI-generated texts with some analytical skill, but subjective impressions often lead to misclassification. Findings question the role of synthetic-text editing in professional contexts, as translators may actually prefer AI-generated content.
Abstract: This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.
[98] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
Obed Junias, Maria Leonor Pacheco
Main category: cs.CL
TL;DR: LOGICAL-COMMONSENSEQA is a new benchmark that reframes commonsense reasoning as logical composition over statement pairs using AND, OR, NEITHER/NOR operators, exposing models’ limitations in compositional reasoning.
Details
Motivation: Current commonsense reasoning benchmarks rely on single-label evaluation, which obscures whether statements are jointly plausible, mutually exclusive, or jointly implausible. This fails to capture the nuanced nature of commonsense reasoning that often involves evaluating multiple plausible interpretations.Method: The authors introduce LOGICAL-COMMONSENSEQA, a benchmark that frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). They evaluate various models including instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting.
Result: Models perform reasonably on conjunctive (AND) reasoning and moderately on disjunctive (OR) reasoning, but performance degrades sharply on negation-based (NEITHER/NOR) questions. The benchmark exposes fundamental reasoning limitations in current models.
Conclusion: LOGICAL-COMMONSENSEQA provides a controlled framework for advancing compositional commonsense reasoning and reveals critical gaps in models’ ability to handle logical composition, particularly negation operations.
Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
cs.CV
[99] Dynamic Mask-Based Backdoor Attack Against Vision AI Models: A Case Study on Mushroom Detection
Zeineb Dridi, Jihen Bennaceur, Amine Ben Hassouna
Main category: cs.CV
TL;DR: A novel dynamic mask-based backdoor attack method for object detection models using dataset poisoning and SAM segmentation for stealthy trigger placement.
Details
Motivation: Deep learning models are increasingly vulnerable to adversarial attacks like backdoor attacks, especially in critical real-life domains. The paper aims to demonstrate practical risks of backdoor attacks on object detection models and highlight risks associated with outsourcing practices.Method: Uses dataset poisoning to embed malicious triggers, leverages SAM (Segment Anything Model) for dynamic mask creation to place triggers stealthily, and focuses on mushroom detection as a practical scenario. The approach targets YOLOv7 object detection model.
Result: The attack maintains high accuracy on clean data while achieving high attack success rates on poisoned samples. It surpasses traditional static pattern-based backdoor injection methods in stealth and effectiveness.
Conclusion: The work demonstrates significant risks of evolving adversarial threats and underscores the urgent need for robust countermeasures to protect deep learning models from sophisticated backdoor attacks.
Abstract: Deep learning has revolutionized numerous tasks within the computer vision field, including image classification, image segmentation, and object detection. However, the increasing deployment of deep learning models has exposed them to various adversarial attacks, including backdoor attacks. This paper presents a novel dynamic mask-based backdoor attack method, specifically designed for object detection models. We exploit a dataset poisoning technique to embed a malicious trigger, rendering any models trained on this compromised dataset vulnerable to our backdoor attack. We particularly focus on a mushroom detection dataset to demonstrate the practical risks posed by such attacks on critical real-life domains. Our work also emphasizes the importance of creating a detailed backdoor attack scenario to illustrate the significant risks associated with the outsourcing practice. Our approach leverages SAM, a recent and powerful image segmentation AI model, to create masks for dynamic trigger placement, introducing a new and stealthy attack method. Through extensive experimentation, we show that our sophisticated attack scenario maintains high accuracy on clean data with the YOLOv7 object detection model while achieving high attack success rates on poisoned samples. Our approach surpasses traditional methods for backdoor injection, which are based on static and consistent patterns. Our findings underscore the urgent need for robust countermeasures to protect deep learning models from these evolving adversarial threats.
[100] Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding
Yuhui Zhang, Hui Yu, Wei Liang, Sunjie Zhang
Main category: cs.CV
TL;DR: Proposes an automatic method using blink embedding and hash grid landmarks encoding to enhance talking face fidelity in Dynamic Neural Radiance Fields (NeRF).
Details
Motivation: Existing Dynamic NeRF methods for talking portraits struggle with accurately and efficiently capturing mouth movements, despite advances in rendering speed and generation quality.Method: Uses blink embedding and hash grid landmarks encoding; integrates facial features as conditional features and audio features as residual terms via Dynamic Landmark Transformer; employs neural radiance fields for full face modeling.
Result: Experimental evaluations show superiority over existing methods in generating lifelike talking face representations.
Conclusion: The proposed approach effectively enhances the fidelity of talking faces in Dynamic NeRF by better capturing mouth movements through innovative landmark encoding and feature integration techniques.
Abstract: Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.
[101] SelfieAvatar: Real-time Head Avatar reenactment from a Selfie Video
Wei Liang, Hui Yu, Derui Ding, Rachael E. Jack, Philippe G. Schyns
Main category: cs.CV
TL;DR: A method combining 3DMM with StyleGAN for detailed head avatar reenactment from single selfie video, achieving high-fidelity reconstruction with rich textures.
Details
Motivation: Existing methods have limitations: 3DMM-based approaches can't capture full head details in real-time, GAN-based methods struggle with fine-grained details like wrinkles and hair, and most require large training datasets rather than working from just a selfie video.Method: Combines 3D Morphable Models (3DMM) with StyleGAN-based generator. Uses detailed reconstruction model with mixed loss functions for foreground reconstruction and avatar image generation during adversarial training to recover high-frequency details.
Result: Qualitative and quantitative evaluations on self-reenactment and cross-reenactment tasks show superior head avatar reconstruction with rich and intricate textures compared to existing approaches.
Conclusion: The proposed method successfully addresses limitations of existing approaches by enabling detailed head avatar reenactment from just a selfie video, achieving high-fidelity reconstruction with fine-grained details.
Abstract: Head avatar reenactment focuses on creating animatable personal avatars from monocular videos, serving as a foundational element for applications like social signal understanding, gaming, human-machine interaction, and computer vision. Recent advances in 3D Morphable Model (3DMM)-based facial reconstruction methods have achieved remarkable high-fidelity face estimation. However, on the one hand, they struggle to capture the entire head, including non-facial regions and background details in real time, which is an essential aspect for producing realistic, high-fidelity head avatars. On the other hand, recent approaches leveraging generative adversarial networks (GANs) for head avatar generation from videos can achieve high-quality reenactments but encounter limitations in reproducing fine-grained head details, such as wrinkles and hair textures. In addition, existing methods generally rely on a large amount of training data, and rarely focus on using only a simple selfie video to achieve avatar reenactment. To address these challenges, this study introduces a method for detailed head avatar reenactment using a selfie video. The approach combines 3DMMs with a StyleGAN-based generator. A detailed reconstruction model is proposed, incorporating mixed loss functions for foreground reconstruction and avatar image generation during adversarial training to recover high-frequency details. Qualitative and quantitative evaluations on self-reenactment and cross-reenactment tasks demonstrate that the proposed method achieves superior head avatar reconstruction with rich and intricate textures compared to existing approaches.
[102] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)
Ghazaleh Serati, Samuel Foucher, Jerome Theau
Main category: cs.CV
TL;DR: Weakly supervised patch-level pretraining improves caribou detection in aerial imagery by addressing challenges like background heterogeneity, class imbalance, and small targets, achieving high accuracy across different herds and years.
Details
Motivation: Caribou populations across the Arctic are declining, requiring scalable and accurate monitoring methods. Manual interpretation of aerial imagery is labor-intensive and error-prone, necessitating automatic detection systems that can handle challenges like severe background heterogeneity, class imbalance, small/occluded targets, and varying density/scale.Method: Proposed HerdNet with weakly supervised patch-level pretraining based on detection network architecture. Uses detection dataset from five caribou herds in Alaska, learning from empty vs. non-empty labels to produce early weakly supervised knowledge for enhanced detection compared to initialization from generic weights.
Result: Patch-based pretraining achieved high accuracy on multi-herd imagery (2017) and independent year (2019) test sets (F1: 93.7%/92.6%). Transfer to detection showed consistent gains over ImageNet weights on positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%) and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%).
Conclusion: Weakly supervised pretraining on coarse labels prior to detection enables reliable caribou mapping even with limited labeled data, achieving results comparable to generic-weight initialization. Main limitations are false positives from animal-like background clutter and false negatives from low-density occlusions.
Abstract: Caribou across the Arctic has declined in recent decades, motivating scalable and accurate monitoring approaches to guide evidence-based conservation actions and policy decisions. Manual interpretation from this imagery is labor-intensive and error-prone, underscoring the need for automatic and reliable detection across varying scenes. Yet, such automatic detection is challenging due to severe background heterogeneity, dominant empty terrain (class imbalance), small or occluded targets, and wide variation in density and scale. To make the detection model (HerdNet) more robust to these challenges, a weakly supervised patch-level pretraining based on a detection network’s architecture is proposed. The detection dataset includes five caribou herds distributed across Alaska. By learning from empty vs. non-empty labels in this dataset, the approach produces early weakly supervised knowledge for enhanced detection compared to HerdNet, which is initialized from generic weights. Accordingly, the patch-based pretrain network attained high accuracy on multi-herd imagery (2017) and on an independent year’s (2019) test sets (F1: 93.7%/92.6%, respectively), enabling reliable mapping of regions containing animals to facilitate manual counting on large aerial imagery. Transferred to detection, initialization from weakly supervised pretraining yielded consistent gains over ImageNet weights on both positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%), and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%). Remaining limitations are false positives from animal-like background clutter and false negatives related to low animal density occlusions. Overall, pretraining on coarse labels prior to detection makes it possible to rely on weakly-supervised pretrained weights even when labeled data are limited, achieving results comparable to generic-weight initialization.
[103] RealStats: A Rigorous Real-Only Statistical Framework for Fake Image Detection
Haim Zisman, Uri Shaham
Main category: cs.CV
TL;DR: A statistically grounded framework for fake image detection that produces interpretable probability scores by combining multiple detectors through p-value aggregation.
Details
Motivation: Current AI-generated image detection methods lack formal interpretability and rely on implicit assumptions about fake content, limiting their robustness to distributional shifts.Method: Leverages multiple existing detectors by combining training-free statistics, computes p-values over a range of test statistics, and aggregates them using classical statistical ensembling to assess alignment with real-image distribution.
Result: The framework produces probability scores interpretable with respect to the real-image population, offering a generic, flexible, and training-free approach.
Conclusion: This statistically grounded framework provides robust fake image detection across diverse and evolving settings with formal interpretability.
Abstract: As generative models continue to evolve, detecting AI-generated images remains a critical challenge. While effective detection methods exist, they often lack formal interpretability and may rely on implicit assumptions about fake content, potentially limiting robustness to distributional shifts. In this work, we introduce a rigorous, statistically grounded framework for fake image detection that focuses on producing a probability score interpretable with respect to the real-image population. Our method leverages the strengths of multiple existing detectors by combining training-free statistics. We compute p-values over a range of test statistics and aggregate them using classical statistical ensembling to assess alignment with the unified real-image distribution. This framework is generic, flexible, and training-free, making it well-suited for robust fake image detection across diverse and evolving settings.
[104] VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics
Zhiyu Yin, Zhipeng Liu, Kehai Chen, Lemao Liu, Jin Liu, Hong-Dong Li, Yang Xiang, Min Zhang
Main category: cs.CV
TL;DR: VC-Bench: A new benchmark for video connecting task that generates smooth transitions between start and end clips, with comprehensive evaluation metrics beyond just video quality.
Details
Motivation: Current video generation focuses on text/image conditions, but practical applications like video editing and vlogging need seamless connections between separate clips. Lack of standardized evaluation benchmarks has hindered development of video connecting tasks.Method: Proposed VC-Bench benchmark with 1,579 high-quality videos from public platforms covering 15 main categories and 72 subcategories for diversity. Introduces three core evaluation metrics: Video Quality Score (VQS), Start-End Consistency Score (SECS), and Transition Smoothness Score (TSS).
Result: Evaluation of state-of-the-art video generation models on VC-Bench reveals significant limitations in maintaining start-end consistency and transition smoothness, leading to lower overall coherence and fluidity in generated transitions.
Conclusion: VC-Bench serves as a pioneering benchmark to inspire and guide future research in video connecting, addressing the gap in standardized evaluation for this practical video generation task.
Abstract: While current video generation focuses on text or image conditions, practical applications like video editing and vlogging often need to seamlessly connect separate clips. In our work, we introduce Video Connecting, an innovative task that aims to generate smooth intermediate video content between given start and end clips. However, the absence of standardized evaluation benchmarks has hindered the development of this task. To bridge this gap, we proposed VC-Bench, a novel benchmark specifically designed for video connecting. It includes 1,579 high-quality videos collected from public platforms, covering 15 main categories and 72 subcategories to ensure diversity and structure. VC-Bench focuses on three core aspects: Video Quality Score VQS, Start-End Consistency Score SECS, and Transition Smoothness Score TSS. Together, they form a comprehensive framework that moves beyond conventional quality-only metrics. We evaluated multiple state-of-the-art video generation models on VC-Bench. Experimental results reveal significant limitations in maintaining start-end consistency and transition smoothness, leading to lower overall coherence and fluidity. We expect that VC-Bench will serve as a pioneering benchmark to inspire and guide future research in video connecting. The evaluation metrics and dataset are publicly available at: https://anonymous.4open.science/r/VC-Bench-1B67/.
[105] On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training
John J. Han, Adam Schmidt, Muhammad Abdullah Jamal, Chinedu Nwoye, Anita Rau, Jie Ying Wu, Omid Mohareri
Main category: cs.CV
TL;DR: Multimodal RGB-D pre-training with geometric tokenization outperforms unimodal RGB approaches in surgical vision tasks, enabling better performance with significantly less labeled data.
Details
Motivation: Current surgical vision foundation models rely on unimodal RGB pre-training, ignoring the complex 3D geometry of surgical environments. The benefits of incorporating depth information in surgical settings remain underexplored despite available multimodal architectures.Method: Large-scale empirical study comparing eight ViT-based vision foundation models varying in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). Used 1.4 million robotic surgical images with depth maps for pre-training, evaluated under frozen-backbone and end-to-end fine-tuning across eight surgical datasets spanning detection, segmentation, depth estimation, and pose estimation.
Result: Models with explicit geometric tokenization (like MultiMAE) substantially outperform unimodal baselines across all tasks. Geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on full datasets. Gains require no architectural or runtime changes at inference.
Conclusion: Multimodal pre-training with depth information offers a viable path toward building more capable surgical vision systems, providing significant performance improvements and data efficiency without requiring inference-time modifications.
Abstract: Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection, segmentation, depth estimation, and pose estimation. Our experiments yield several consistent findings. Models incorporating explicit geometric tokenization, such as MultiMAE, substantially outperform unimodal baselines across all tasks. Notably, geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on the full dataset. Importantly, these gains require no architectural or runtime changes at inference; depth is used only during pre-training, making adoption straightforward. These findings suggest that multimodal pre-training offers a viable path towards building more capable surgical vision systems.
[106] Smart Split-Federated Learning over Noisy Channels for Embryo Image Segmentation
Zahra Hafezi Kafshgari, Ivan V. Bajic, Parvaneh Saeedi
Main category: cs.CV
TL;DR: SplitFed learning with smart averaging improves resilience to communication channel noise by two orders of magnitude while maintaining model accuracy.
Details
Motivation: Split-Federated learning reduces client hardware requirements but involves transferring feature values, gradients, and model updates across noisy communication channels, which can degrade learning quality.Method: Proposed a smart averaging strategy for SplitFed learning to improve resilience against channel noise, tested on a segmentation model for embryo images.
Result: Smart averaging tolerates two orders of magnitude stronger noise in communication channels compared to conventional averaging while maintaining final model accuracy.
Conclusion: Smart averaging strategy significantly enhances SplitFed learning’s robustness to communication channel noise, making it more practical for real-world deployments with noisy channels.
Abstract: Split-Federated (SplitFed) learning is an extension of federated learning that places minimal requirements on the clients computing infrastructure, since only a small portion of the overall model is deployed on the clients hardware. In SplitFed learning, feature values, gradient updates, and model updates are transferred across communication channels. In this paper, we study the effects of noise in the communication channels on the learning process and the quality of the final model. We propose a smart averaging strategy for SplitFed learning with the goal of improving resilience against channel noise. Experiments on a segmentation model for embryo images shows that the proposed smart averaging strategy is able to tolerate two orders of magnitude stronger noise in the communication channels compared to conventional averaging, while still maintaining the accuracy of the final model.
[107] GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
Shentong Mo, Zehua Chen, Jun Zhu
Main category: cs.CV
TL;DR: GMS-CAVP is a novel framework that enhances video-audio correspondence modeling through multi-scale contrastive learning and diffusion-based pretraining, outperforming previous methods in generation and retrieval tasks.
Details
Motivation: Existing video-audio joint embedding methods have limitations in modeling the dense, multi-scale nature of video and audio signals, where correspondences span fine- to coarse-grained spatial-temporal structures that are underutilized in current frameworks.Method: Proposes GMS-CAVP with two key components: 1) Multi-scale contrastive learning strategy capturing semantic and temporal relations across varying granularities, and 2) Diffusion-based generative objective enabling modality translation and synthesis between video and audio through a unified discriminative-generative formulation.
Result: Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in both generation and retrieval tasks.
Conclusion: GMS-CAVP’s unified discriminative-generative approach with multi-scale modeling facilitates deeper cross-modal understanding and enables high-fidelity generation, advancing video-audio correspondence learning beyond traditional contrastive methods.
Abstract: Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
[108] Pay Attention to Where You Look
Alex Beriand, JhihYang Wu, Daniel Brignac, Natnael Daba, Abhijit Mahalanobis
Main category: cs.CV
TL;DR: The paper introduces a camera-weighting mechanism for few-shot novel view synthesis that adaptively weights source views based on their relevance to the target view, improving synthesis quality.
Details
Motivation: Existing few-shot NVS methods assume equal importance for all input views relative to the target view, which leads to suboptimal results when some views are more relevant than others.Method: Two approaches: 1) deterministic weighting using geometric properties (Euclidean distance and angular differences), and 2) cross-attention-based learning scheme that optimizes view weighting. The mechanism can be integrated into various NVS algorithms and further trained to refine view relevance understanding.
Result: Adaptive view weighting enhances accuracy and realism in novel view synthesis, demonstrating improved performance over methods that treat all input views equally.
Conclusion: The camera-weighting mechanism offers a promising direction for improving few-shot NVS by adaptively weighting source views based on their relevance to the target view, with applications across various NVS algorithms.
Abstract: Novel view synthesis (NVS) has advanced with generative modeling, enabling photorealistic image generation. In few-shot NVS, where only a few input views are available, existing methods often assume equal importance for all input views relative to the target, leading to suboptimal results. We address this limitation by introducing a camera-weighting mechanism that adjusts the importance of source views based on their relevance to the target. We propose two approaches: a deterministic weighting scheme leveraging geometric properties like Euclidean distance and angular differences, and a cross-attention-based learning scheme that optimizes view weighting. Additionally, models can be further trained with our camera-weighting scheme to refine their understanding of view relevance and enhance synthesis quality. This mechanism is adaptable and can be integrated into various NVS algorithms, improving their ability to synthesize high-quality novel views. Our results demonstrate that adaptive view weighting enhances accuracy and realism, offering a promising direction for improving NVS.
[109] FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Geometry-Complete 4D Reconstruction
Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu
Main category: cs.CV
TL;DR: FreeOrbit4D is a training-free framework for large-angle camera redirection from monocular videos by recovering a geometry-complete 4D proxy to guide video generation.
Details
Motivation: Large-angle camera redirection from monocular videos is ill-posed due to limited spatio-temporal observations, causing geometric ambiguity and temporal inconsistency in existing diffusion-based methods when viewpoints deviate far from the original trajectory.Method: Decouples foreground/background reconstruction: unprojects video into static background and geometry-incomplete foreground point clouds, uses object-centric multi-view diffusion to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical space, aligns them via 3D-3D correspondences, and projects the 4D proxy to guide conditional video diffusion.
Result: Produces more faithful redirected videos under challenging large-angle trajectories compared to existing methods, and enables practical applications like edit propagation and 4D data generation.
Conclusion: FreeOrbit4D effectively addresses geometric ambiguity in large-angle camera redirection by recovering a geometry-complete 4D proxy as structural grounding, outperforming previous approaches and opening new application avenues.
Abstract: Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing highly partial observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive results, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. To address this, we present FreeOrbit4D, an effective training-free framework that tackles this geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and geometry-incomplete foreground point clouds in a unified global space, then leverage an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences and projecting the geometry-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful redirected videos under challenging large-angle trajectories, and our geometry-complete 4D proxy further opens a potential avenue for practical applications such as edit propagation and 4D data generation. Project page and code will be released soon.
[110] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
Main category: cs.CV
TL;DR: FetalMind is a medical AI system for fetal ultrasound that uses Salient Epistemic Disentanglement to decouple view-disease associations and align with clinical workflow, trained on the first large-scale fetal ultrasound dataset FetalSigma-1M.
Details
Motivation: Existing medical vision-language models underperform in fetal ultrasound due to challenges like multi-view image reasoning, numerous diseases, and image diversity. There's a gap in specialized AI systems for fetal ultrasound report generation and diagnosis.Method: Proposes Salient Epistemic Disentanglement (SED) that injects an expert-curated bipartite graph to decouple view-disease associations and steers preference selection via reinforcement learning. Uses FetalSigma-1M dataset (20K reports from 12 medical centers) for training.
Result: FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable.
Conclusion: FetalMind successfully bridges the gap in fetal ultrasound AI by addressing domain-specific challenges through clinical workflow alignment and large-scale dataset curation, demonstrating superior performance in both report generation and diagnosis tasks.
Abstract: Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model’s inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
[111] Towards Gold-Standard Depth Estimation for Tree Branches in UAV Forestry: Benchmarking Deep Stereo Matching Methods
Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green
Main category: cs.CV
TL;DR: First systematic zero-shot evaluation of eight stereo depth estimation methods across urban/indoor benchmarks plus novel vegetation dataset, revealing DEFOM as best for cross-domain generalization in forestry applications.
Details
Motivation: Autonomous UAV forestry operations need robust depth estimation with strong cross-domain generalization, but existing evaluations focus on urban/indoor scenarios, leaving a critical gap for vegetation-dense environments.Method: Systematic zero-shot evaluation of eight stereo methods (iterative refinement, foundation model, diffusion-based, 3D CNN) using officially released pretrained weights trained on Scene Flow, evaluated on four standard benchmarks plus novel Canterbury Tree Branches dataset.
Result: Foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D; DEFOM: 4.65 px on Middlebury), while iterative methods show variable cross-benchmark performance. DEFOM established as gold-standard baseline for vegetation depth estimation with superior cross-domain consistency.
Conclusion: DEFOM is identified as the most robust method for cross-domain generalization in vegetation-dense environments, with its predictions serving as pseudo-ground-truth for future forestry depth estimation benchmarking.
Abstract: Autonomous UAV forestry operations require robust depth estimation with strong cross-domain generalization, yet existing evaluations focus on urban and indoor scenarios, leaving a critical gap for vegetation-dense environments. We present the first systematic zero-shot evaluation of eight stereo methods spanning iterative refinement, foundation model, diffusion-based, and 3D CNN paradigms. All methods use officially released pretrained weights (trained on Scene Flow) and are evaluated on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury Tree Branches dataset ($1920 \times 1080$). Results reveal scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D; DEFOM: 4.65 px on Middlebury), while iterative methods show variable cross-benchmark performance (IGEV++: 0.36 px on ETH3D but 6.77 px on Middlebury; IGEV: 0.33 px on ETH3D but 4.99 px on Middlebury). Qualitative evaluation on the Tree Branches dataset establishes DEFOM as the gold-standard baseline for vegetation depth estimation, with superior cross-domain consistency (consistently ranking 1st-2nd across benchmarks, average rank 1.75). DEFOM predictions will serve as pseudo-ground-truth for future benchmarking.
[112] Anatomically-aware conformal prediction for medical image segmentation with random walks
Mélanie Gaillochet, Christian Desrosiers, Hervé Lombaert
Main category: cs.CV
TL;DR: RW-CP is a conformal prediction framework for medical image segmentation that enforces spatial coherence using random walk diffusion on vision foundation model features, improving anatomical validity while maintaining statistical coverage guarantees.
Details
Motivation: Standard conformal prediction methods for medical image segmentation produce fragmented, spatially incoherent prediction sets that lack anatomical meaning, limiting clinical utility despite providing statistical error guarantees.Method: Proposes Random-Walk Conformal Prediction (RW-CP) that constructs k-nearest neighbor graphs from pre-trained vision foundation model features and applies random walk diffusion to regularize non-conformity scores, enforcing spatial coherence in prediction sets.
Result: RW-CP maintains rigorous marginal coverage while improving segmentation quality by up to 35.4% compared to standard CP baselines at α=0.1 error rate, producing more stable and continuous anatomical boundaries.
Conclusion: RW-CP bridges the gap between statistical validity and anatomical meaningfulness in medical image segmentation uncertainty quantification, offering a model-agnostic framework that ensures both rigorous error guarantees and clinically useful segmentation sets.
Abstract: The reliable deployment of deep learning in medical imaging requires uncertainty quantification that provides rigorous error guarantees while remaining anatomically meaningful. Conformal prediction (CP) is a powerful distribution-free framework for constructing statistically valid prediction intervals. However, standard applications in segmentation often ignore anatomical context, resulting in fragmented, spatially incoherent, and over-segmented prediction sets that limit clinical utility. To bridge this gap, this paper proposes Random-Walk Conformal Prediction (RW-CP), a model-agnostic framework which can be added on top of any segmentation method. RW-CP enforces spatial coherence to generate anatomically valid sets. Our method constructs a k-nearest neighbour graph from pre-trained vision foundation model features and applies a random walk to diffuse uncertainty. The random walk diffusion regularizes the non-conformity scores, making the prediction sets less sensitive to the conformal calibration parameter $λ$, ensuring more stable and continuous anatomical boundaries. RW-CP maintains rigorous marginal coverage while significantly improving segmentation quality. Evaluations on multi-modal public datasets show improvements of up to $35.4%$ compared to standard CP baselines, given an allowable error rate of $α=0.1$.
[113] MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models
Tian-Yi Zhou, Xuan-Hao Liu, Bao-Liang Lu, Wei-Long Zheng
Main category: cs.CV
TL;DR: MindCine: A novel EEG-to-video reconstruction framework using multimodal joint learning and pre-trained large EEG models to overcome single modality limitations and data scarcity issues.
Details
Motivation: Reconstructing human dynamic visual perception from EEG signals is valuable due to EEG's non-invasiveness and high temporal resolution, but current methods face challenges with single modality alignment (only text) and data scarcity issues.Method: Proposes MindCine framework with: 1) multimodal joint learning strategy incorporating beyond-text modalities, 2) leveraging pre-trained large EEG model to address data scarcity for semantic decoding, and 3) Seq2Seq model with causal attention specifically designed for perceptual information decoding.
Result: Extensive experiments show the model outperforms state-of-the-art methods both qualitatively and quantitatively. Results demonstrate effectiveness of complementary strengths of different modalities and that leveraging large-scale EEG models enhances reconstruction performance by alleviating limited data challenges.
Conclusion: MindCine successfully addresses key challenges in EEG-to-video reconstruction through multimodal integration and pre-trained model utilization, achieving high-fidelity video reconstructions even with limited EEG-video data.
Abstract: Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG’s non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
[114] Non-Invasive 3D Wound Measurement with RGB-D Imaging
Lena Harkämper, Leo Lebrat, David Ahmedt-Aristizabal, Olivier Salvado, Mattias Heinrich, Rodrigo Santa Cruz
Main category: cs.CV
TL;DR: A fast, non-invasive 3D wound measurement algorithm using RGB-D imaging with B-spline surface reconstruction for automated clinical wound assessment.
Details
Motivation: Chronic wound monitoring requires accurate and efficient measurement methods for clinical assessment. Current approaches need improvement in speed, accuracy, and automation for both clinical and remote healthcare settings.Method: Combines RGB-D odometry with B-spline surface reconstruction to generate detailed 3D wound meshes from RGB-D images, enabling automatic computation of clinically relevant measurements (perimeter, surface area, dimensions).
Result: Achieved sub-millimeter 3D reconstruction accuracy compared to high-resolution ground-truth scans, with low variability across repeated captures and strong agreement with manual assessments. Outperformed state-of-the-art object-centric RGB-D reconstruction methods while maintaining real-time clinical deployment runtimes.
Conclusion: The proposed pipeline offers a promising tool for automated wound assessment suitable for both clinical and remote healthcare settings, providing fast, accurate, and non-invasive wound measurement.
Abstract: Chronic wound monitoring and management require accurate and efficient wound measurement methods. This paper presents a fast, non-invasive 3D wound measurement algorithm based on RGB-D imaging. The method combines RGB-D odometry with B-spline surface reconstruction to generate detailed 3D wound meshes, enabling automatic computation of clinically relevant wound measurements such as perimeter, surface area, and dimensions. We evaluated our system on realistic silicone wound phantoms and measured sub-millimetre 3D reconstruction accuracy compared with high-resolution ground-truth scans. The extracted measurements demonstrated low variability across repeated captures and strong agreement with manual assessments. The proposed pipeline also outperformed a state-of-the-art object-centric RGB-D reconstruction method while maintaining runtimes suitable for real-time clinical deployment. Our approach offers a promising tool for automated wound assessment in both clinical and remote healthcare settings.
[115] NC-Reg : Neural Cortical Maps for Rigid Registration
Ines Vati, Pierrick Bourgeat, Rodrigo Santa Cruz, Vincent Dore, Olivier Salvado, Clinton Fookes, Léo Lebrat
Main category: cs.CV
TL;DR: Neural cortical maps offer a continuous neural representation for cortical feature maps, enabling faster optimization on spheres and achieving sub-degree accuracy in cortical surface registration.
Details
Motivation: Traditional discrete structures like grids and meshes for cortical feature maps have limitations in flexibility and efficiency. The authors aim to create a more continuous, compact representation that can learn from arbitrary-sized meshes and provide features at any resolution.Method: The paper introduces neural cortical maps as a continuous neural representation alternative to discrete structures. It proposes NC-Reg, an iterative algorithm combining neural cortical feature maps, gradient descent optimization, and simulated annealing for rigid registration of cortical surfaces.
Result: Neural cortical maps achieve runtimes up to 30 times faster than classic barycentric interpolation (for same iterations). NC-Reg demonstrates sub-degree accuracy (<1° from global optimum) in subject-to-template experiments, serving as a robust pre-alignment strategy for clinical settings.
Conclusion: Neural cortical maps provide an efficient continuous representation for cortical feature mapping, enabling faster optimization and accurate registration. The NC-Reg algorithm shows promise as a robust pre-alignment tool for clinical neuroimaging applications.
Abstract: We introduce neural cortical maps, a continuous and compact neural representation for cortical feature maps, as an alternative to traditional discrete structures such as grids and meshes. It can learn from meshes of arbitrary size and provide learnt features at any resolution. Neural cortical maps enable efficient optimization on the sphere and achieve runtimes up to 30 times faster than classic barycentric interpolation (for the same number of iterations). As a proof of concept, we investigate rigid registration of cortical surfaces and propose NC-Reg, a novel iterative algorithm that involves the use of neural cortical feature maps, gradient descent optimization and a simulated annealing strategy. Through ablation studies and subject-to-template experiments, our method demonstrates sub-degree accuracy ($<1^\circ$ from the global optimum), and serves as a promising robust pre-alignment strategy, which is critical in clinical settings.
[116] NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation
Han-Hung Lee, Cheng-Yu Yang, Yu-Lun Liu, Angel X. Chang
Main category: cs.CV
TL;DR: NuiWorld is a framework for world generation that addresses controllability, scalability, and efficiency challenges through generative bootstrapping, variable scene chunk representation, and pseudo sketch controls.
Details
Motivation: Existing world generation approaches face three main obstacles: 1) Controllability limitations, 2) Scalability issues with fixed-resolution representations degrading fidelity for larger scenes, and 3) Efficiency problems where training-free approaches are slow and computationally expensive at inference time. End-to-end models suffer from data scarcity.Method: 1) Generative bootstrapping strategy that starts from few input images and uses 3D reconstruction and expandable scene generation to synthesize diverse training data. 2) Variable scene chunk representation where scenes are decomposed into chunks compressed into flattened vector-sets, reducing token length. 3) Controllability through pseudo sketch labels for scene layout guidance.
Result: The framework produces sufficient training data to train end-to-end models, enables consistent geometric fidelity across different scene sizes, improves training and inference efficiency, and demonstrates generalization to unseen sketches through pseudo sketch controls.
Conclusion: NuiWorld addresses key challenges in world generation by combining generative bootstrapping for data scarcity, variable chunk representation for scalability, and sketch-based controls for controllability, offering a more efficient and flexible framework for applications in gaming, simulation, and robotics.
Abstract: World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
[117] Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models
Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong
Main category: cs.CV
TL;DR: PixSearch is an end-to-end Segmenting Large Multimodal Model that unifies region-level perception with retrieval-augmented reasoning, using pixel-level masks as visual queries and learning when to retrieve through supervised fine-tuning.
Details
Motivation: Visual Question Answering requires both fine-grained perception and external factual knowledge. Existing multimodal RAG systems lack internal policies for determining when and how to retrieve information, relying on modular pipelines that can be inefficient.Method: PixSearch emits
Result: PixSearch achieves 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while maintaining competitive performance on various VQA and text-only QA tasks. It substantially improves factual consistency and generalization on egocentric and entity-centric VQA benchmarks.
Conclusion: PixSearch successfully unifies region-level perception with retrieval-augmented reasoning in an end-to-end model, eliminating reliance on modular pipelines and improving factual grounding in VQA tasks through learned retrieval policies.
Abstract: Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits
[118] m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Yosub Shin, Michael Buriek, Igor Molybog
Main category: cs.CV
TL;DR: m2sv is a new benchmark for map-to-street-view spatial reasoning that tests VLMs’ ability to align overhead maps with street-level images to infer camera direction, revealing significant gaps in spatial reasoning capabilities.
Details
Motivation: Vision-language models perform well on many multimodal tasks but struggle with spatial reasoning that requires aligning abstract overhead representations with egocentric views. Current benchmarks don't adequately test this capability.Method: Created m2sv-20k benchmark with geographically diverse map-to-street-view alignment tasks, plus m2sv-sft-11k for supervised fine-tuning. Evaluated VLMs on spatial reasoning, conducted supervised fine-tuning and reinforcement learning experiments, and performed systematic difficulty analysis.
Result: Best VLM achieved only 65.2% accuracy vs human baseline of 95%. Fine-tuning and RL improved performance but showed limited transfer across benchmarks. Analysis revealed persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency.
Conclusion: The paper highlights significant weaknesses in VLMs’ spatial reasoning capabilities and introduces a benchmark to drive future work on grounded spatial reasoning across different viewpoints.
Abstract: Vision–language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.
[119] Glance and Focus Reinforcement for Pan-cancer Screening
Linshan Wu, Jiaxin Zhuang, Hao Chen
Main category: cs.CV
TL;DR: GF-Screen: A reinforcement learning framework for pan-cancer CT screening using glance (localization) and focus (segmentation) models, with group relative learning to improve efficiency and reduce false positives.
Details
Motivation: Pan-cancer screening in large CT volumes is challenging due to difficulty localizing diverse tiny lesions and extreme foreground-background imbalance, causing models to waste computation on healthy regions and produce false positives.Method: Two-stage framework: Glance model localizes diseased regions by cropping sub-volumes and selecting those with lesions; Focus model segments lesions. Reinforcement learning uses segmentation results to reward Glance model. Group relative learning prioritizes high-advantage predictions within sub-volume groups.
Result: Extensive experiments on 16 internal and 7 external datasets across 9 lesion types show effectiveness. Leads MICCAI FLARE25 public validation leaderboard, surpassing FLARE24 champion by +25.6% DSC and +28.2% NSD.
Conclusion: GF-Screen successfully extends RL techniques to pan-cancer screening challenges, improving lesion detection efficiency and accuracy while reducing false positives through glance-focus strategy and group relative learning.
Abstract: Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists’ glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).
[120] Reg-TTR, Test-Time Refinement for Fast, Robust and Accurate Image Registration
Lin Chen, Yue He, Fengting Zhang, Yaonan Wang, Fengming Lin, Xiang Chen, Min Liu
Main category: cs.CV
TL;DR: Reg-TTR is a test-time refinement framework that combines deep learning speed with traditional registration robustness to improve accuracy of registration foundation models with minimal computational overhead.
Details
Motivation: Traditional registration methods are robust but slow, while deep learning is fast but struggles with domain shifts. Registration foundation models offer a balance but can't match specialized model accuracy. There's a need to bridge this performance gap efficiently.Method: Proposes Reg-TTR, a test-time refinement framework that refines predictions of pre-trained registration foundation models at inference time by synergizing deep learning and conventional registration techniques.
Result: Achieves state-of-the-art performance with only 21% additional inference time (0.56s), significantly improving registration accuracy while maintaining near-deep-learning inference speeds.
Conclusion: Reg-TTR offers an efficient strategy to narrow the performance gap between registration foundation models and specialized SOTA methods, providing a practical solution as foundation models continue to emerge.
Abstract: Traditional image registration methods are robust but slow due to their iterative nature. While deep learning has accelerated inference, it often struggles with domain shifts. Emerging registration foundation models offer a balance of speed and robustness, yet typically cannot match the peak accuracy of specialized models trained on specific datasets. To mitigate this limitation, we propose Reg-TTR, a test-time refinement framework that synergizes the complementary strengths of both deep learning and conventional registration techniques. By refining the predictions of pre-trained models at inference, our method delivers significantly improved registration accuracy at a modest computational cost, requiring only 21% additional inference time (0.56s). We evaluate Reg-TTR on two distinct tasks and show that it achieves state-of-the-art (SOTA) performance while maintaining inference speeds close to previous deep learning methods. As foundation models continue to emerge, our framework offers an efficient strategy to narrow the performance gap between registration foundation models and SOTA methods trained on specialized datasets. The source code will be publicly available following the acceptance of this work.
[121] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities
Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri
Main category: cs.CV
TL;DR: A multimodal person identification system using upper-body motion, face, and voice with hybrid fusion achieves near-perfect accuracy, maintaining robustness even when modalities are missing.
Details
Motivation: Real-world person identification often faces missing or degraded modalities, requiring robust systems that can handle incomplete data while leveraging multiple cues.Method: Unified hybrid fusion combining feature-level and score-level information with multi-task learning, cross-attention, gated fusion, and confidence-weighted adaptation for missing data.
Result: Achieves 99.51% Top-1 accuracy on CANDOR dataset (trimodal), 99.92% on VoxCeleb1 (bimodal), and maintains high performance even with missing modalities.
Conclusion: The proposed multimodal framework provides a robust solution for real-world person recognition by effectively combining body motion with traditional modalities and handling missing data.
Abstract: Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently present with missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework incorporating upper-body motion, face, and voice. Experimental results demonstrate that body motion outperforms traditional modalities such as face and voice in within-session evaluations, while serving as a complementary cue that enhances performance in multi-session scenarios. Our model employs a unified hybrid fusion strategy, fusing both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms to exploit both unimodal information and cross-modal interactions. Finally, a confidence-weighted strategy and mistake-correction mechanism dynamically adapt to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a widely used evaluation protocol and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
[122] FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation
Xiang Gao, Yunpeng Jia
Main category: cs.CV
TL;DR: FBSDiff/FBSDiff++: Plug-and-play frequency-domain framework adapting T2I diffusion models for controllable image-to-image translation without training, using dynamic frequency band substitution for appearance/layout/contour guidance.
Details
Motivation: Extend text-to-image diffusion models to image-to-image translation where a source image provides visual guidance alongside text prompts, enabling more controllable image generation with existing models.Method: Frequency-domain approach using dynamic frequency band substitution of diffusion features. Low-frequency substitution for appearance guidance, mid-frequency for layout, high-frequency for contours. FBSDiff++ adds architectural improvements, arbitrary resolution support, and localized manipulation capabilities.
Result: FBSDiff++ achieves 8.9× inference speedup, handles arbitrary resolution/aspect ratio inputs, enables localized editing and style-specific creation. Superior visual quality, efficiency, versatility, and controllability compared to advanced approaches.
Conclusion: Frequency-domain perspective provides effective plug-and-play solution for adapting T2I diffusion models to I2I translation with versatile control, no training required, and significant performance improvements in FBSDiff++.
Abstract: With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$\times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.
[123] Implicit Non-Causal Factors are Out via Dataset Splitting for Domain Generalization Object Detection
Zhilong Zhang, Lei Zhang, Qing He, Shuyin Xia, Guoyin Wang, Fuxiang Huang
Main category: cs.CV
TL;DR: GB-DAL improves domain generalization for open world object detection by addressing implicit non-causal factors through granular-ball splitting and simulated non-causal factor augmentation.
Details
Motivation: Current domain adversarial learning methods for domain generalization focus on domain-invariant information but overlook implicit non-causal factors caused by data bias, limiting their effectiveness in open world object detection.Method: Proposes GB-DAL with two modules: 1) Prototype-based Granular Ball Splitting (PGBS) to generate dense domains from limited datasets, and 2) Simulated Non-causal Factors (SNF) module for data augmentation to reduce implicitness of non-causal factors.
Result: Comparative experiments on numerous benchmarks demonstrate better generalization performance in novel circumstances compared to existing methods.
Conclusion: GB-DAL effectively addresses the limitations of conventional domain adversarial learning by better capturing implicit non-causal factors, leading to improved domain generalization for open world object detection.
Abstract: Open world object detection faces a significant challenge in domain-invariant representation, i.e., implicit non-causal factors. Most domain generalization (DG) methods based on domain adversarial learning (DAL) pay much attention to learn domain-invariant information, but often overlook the potential non-causal factors. We unveil two critical causes: 1) The domain discriminator-based DAL method is subject to the extremely sparse domain label, i.e., assigning only one domain label to each dataset, thus can only associate explicit non-causal factor, which is incredibly limited. 2) The non-causal factors, induced by unidentified data bias, are excessively implicit and cannot be solely discerned by conventional DAL paradigm. Based on these key findings, inspired by the Granular-Ball perspective, we propose an improved DAL method, i.e., GB-DAL. The proposed GB-DAL utilizes Prototype-based Granular Ball Splitting (PGBS) module to generate more dense domains from limited datasets, akin to more fine-grained granular balls, indicating more potential non-causal factors. Inspired by adversarial perturbations akin to non-causal factors, we propose a Simulated Non-causal Factors (SNF) module as a means of data augmentation to reduce the implicitness of non-causal factors, and facilitate the training of GB-DAL. Comparative experiments on numerous benchmarks demonstrate that our method achieves better generalization performance in novel circumstances.
[124] Resolving Primitive-Sharing Ambiguity in Long-Tailed Industrial Point Cloud Segmentation via Spatial Context Constraints
Chao Yin, Qing Han, Zhiwei Hou, Yue Liu, Anjin Dai, Hongda Hu, Ji Yang, Wei Yao
Main category: cs.CV
TL;DR: Novel spatial context constraints added to Class-Balanced Loss to resolve both class imbalance AND geometric ambiguity in industrial point cloud segmentation, dramatically improving safety-critical component detection.
Details
Motivation: Industrial point cloud segmentation for Digital Twins systematically misclassifies safety-critical components like reducers and valves due to dual crisis: extreme class imbalance (215:1 ratio) compounded by geometric ambiguity where tail classes share identical local geometry (cylindrical primitives) with dominant structures like pipes.Method: Extends Class-Balanced Loss framework with two architecture-agnostic spatial context constraints: (1) Boundary-CB - entropy-based constraint emphasizing ambiguous boundaries, and (2) Density-CB - density-based constraint compensating for scan-dependent variations. Both integrate as plug-and-play modules requiring only loss function replacement.
Result: On Industrial3D dataset (610M points from water treatment facilities): 55.74% mIoU with 21.7% relative improvement on tail-class performance (29.59% vs. 24.32% baseline) while preserving head-class accuracy (88.14%). Safety-critical components show dramatic gains: reducer improves from 0% to 21.12% IoU; valve improves by 24.3% relative.
Conclusion: Resolves geometric ambiguity without typical head-tail trade-off, enabling reliable identification of safety-critical components for automated knowledge extraction in Digital Twin applications. Spatial context constraints effectively address both statistical imbalance and geometric ambiguity unique to industrial 3D data.
Abstract: Industrial point cloud segmentation for Digital Twin construction faces a persistent challenge: safety-critical components such as reducers and valves are systematically misclassified. These failures stem from two compounding factors: such components are rare in training data, yet they share identical local geometry with dominant structures like pipes. This work identifies a dual crisis unique to industrial 3D data extreme class imbalance 215:1 ratio compounded by geometric ambiguity where most tail classes share cylindrical primitives with head classes. Existing frequency-based re-weighting methods address statistical imbalance but cannot resolve geometric ambiguity. We propose spatial context constraints that leverage neighborhood prediction consistency to disambiguate locally similar structures. Our approach extends the Class-Balanced (CB) Loss framework with two architecture-agnostic mechanisms: (1) Boundary-CB, an entropy-based constraint that emphasizes ambiguous boundaries, and (2) Density-CB, a density-based constraint that compensates for scan-dependent variations. Both integrate as plug-and-play modules without network modifications, requiring only loss function replacement. On the Industrial3D dataset (610M points from water treatment facilities), our method achieves 55.74% mIoU with 21.7% relative improvement on tail-class performance (29.59% vs. 24.32% baseline) while preserving head-class accuracy (88.14%). Components with primitive-sharing ambiguity show dramatic gains: reducer improves from 0% to 21.12% IoU; valve improves by 24.3% relative. This resolves geometric ambiguity without the typical head-tail trade-off, enabling reliable identification of safety-critical components for automated knowledge extraction in Digital Twin applications.
[125] CLIP-Guided Unsupervised Semantic-Aware Exposure Correction
Puzhen Wu, Han Weng, Quan Zheng, Yi Zhan, Hewei Wang, Yiming Li, Jiahui Han, Rui Xu
Main category: cs.CV
TL;DR: Unsupervised semantic-aware exposure correction network using FastSAM and CLIP to address color shift artifacts and lack of ground-truth labels.
Details
Motivation: Exposure correction faces two main challenges: (1) ignorance of object-wise regional semantic information causes color shift artifacts, and (2) real-world exposure images lack ground-truth labels, requiring massive manual editing for labeling.Method: Proposes an unsupervised semantic-aware exposure correction network with: (1) adaptive semantic-aware fusion module integrating FastSAM semantic information, (2) multi-scale residual spatial mamba group for detail restoration and exposure adjustment, (3) CLIP-guided pseudo-ground truth generator for automatic exposure identification, and (4) semantic-prompt consistency loss leveraging FastSAM and CLIP priors for unsupervised training.
Result: Comprehensive experiments show the method effectively corrects real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.
Conclusion: The proposed unsupervised semantic-aware exposure correction network successfully addresses the challenges of color shift artifacts and lack of ground-truth labels by leveraging semantic information from FastSAM and CLIP, achieving superior performance compared to existing unsupervised methods.
Abstract: Improper exposure often leads to severe loss of details, color distortion, and reduced contrast. Exposure correction still faces two critical challenges: (1) the ignorance of object-wise regional semantic information causes the color shift artifacts; (2) real-world exposure images generally have no ground-truth labels, and its labeling entails massive manual editing. To tackle the challenges, we propose a new unsupervised semantic-aware exposure correction network. It contains an adaptive semantic-aware fusion module, which effectively fuses the semantic information extracted from a pre-trained Fast Segment Anything Model into a shared image feature space. Then the fused features are used by our multi-scale residual spatial mamba group to restore the details and adjust the exposure. To avoid manual editing, we propose a pseudo-ground truth generator guided by CLIP, which is fine-tuned to automatically identify exposure situations and instruct the tailored corrections. Also, we leverage the rich priors from the FastSAM and CLIP to develop a semantic-prompt consistency loss to enforce semantic consistency and image-prompt alignment for unsupervised training. Comprehensive experimental results illustrate the effectiveness of our method in correcting real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.
[126] QA-ReID: Quality-Aware Query-Adaptive Convolution Leveraging Fused Global and Structural Cues for Clothes-Changing ReID
Yuxiang Wang, Kunming Jiang, Tianxiang Zhang, Ke Tian, Gaozhe Jiang
Main category: cs.CV
TL;DR: QA-ReID: A quality-aware dual-branch matching framework for clothes-changing person re-identification that fuses RGB and parsing features with adaptive attention and robust matching.
Details
Motivation: Clothes-changing ReID presents severe challenges due to substantial appearance variations from clothing changes, requiring methods that can handle these variations while maintaining identity recognition.Method: Proposes Quality-Aware Dual-Branch Matching (QA-ReID) with: 1) RGB-based features and parsing-based representations for global appearance and clothing-invariant structural cues, 2) multi-modal attention module for adaptive fusion, 3) Quality-Aware Query Adaptive Convolution (QAConv-QA) with pixel-level importance weighting and bidirectional consistency constraints for robust matching.
Result: Achieves state-of-the-art performance on multiple benchmarks (PRCC, LTCC, VC-Clothes) and significantly outperforms existing approaches under cross-clothing scenarios.
Conclusion: QA-ReID effectively addresses clothes-changing ReID challenges by combining appearance and structural features with quality-aware adaptive matching, demonstrating superior performance in handling clothing variations.
Abstract: Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.
[127] TFFM: Topology-Aware Feature Fusion Module via Latent Graph Reasoning for Retinal Vessel Segmentation
Iftekhar Ahmed, Shakib Absar, Aftar Ahmad Sami, Shadman Sakib, Debojyoti Biswas, Seraj Al Mahmud Mostafa
Main category: cs.CV
TL;DR: A topology-aware framework for retinal vessel segmentation that maintains vascular connectivity using graph attention networks and hybrid loss functions, reducing fragmentation by 38%.
Details
Motivation: Standard convolutional architectures produce topologically disjointed segmentations with gaps and discontinuities, making reliable graph-based clinical analysis impossible despite high pixel-level accuracy.Method: Introduces a topology-aware framework with a Topological Feature Fusion Module (TFFM) that maps local features into latent graph space using Graph Attention Networks to capture global structural dependencies. Uses hybrid objective function combining Tversky loss for class imbalance with soft clDice loss to penalize topological disconnects.
Result: State-of-the-art performance on Fundus-AVSeg dataset: combined Dice score of 90.97%, 95% Hausdorff Distance of 3.50 pixels. Reduces vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees.
Conclusion: The proposed topology-aware framework successfully maintains vascular connectivity, enabling reliable automated biomarker quantification for cardiovascular diagnosis from retinal images.
Abstract: Precise segmentation of retinal arteries and veins carries the diagnosis of systemic cardiovascular conditions. However, standard convolutional architectures often yield topologically disjointed segmentations, characterized by gaps and discontinuities that render reliable graph-based clinical analysis impossible despite high pixel-level accuracy. To address this, we introduce a topology-aware framework engineered to maintain vascular connectivity. Our architecture fuses a Topological Feature Fusion Module (TFFM) that maps local feature representations into a latent graph space, deploying Graph Attention Networks to capture global structural dependencies often missed by fixed receptive fields. Furthermore, we drive the learning process with a hybrid objective function, coupling Tversky loss for class imbalance with soft clDice loss to explicitly penalize topological disconnects. Evaluation on the Fundus-AVSeg dataset reveals state-of-the-art performance, achieving a combined Dice score of 90.97% and a 95% Hausdorff Distance of 3.50 pixels. Notably, our method decreases vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees viable for automated biomarker quantification. We open-source our code at https://tffm-module.github.io/.
[128] GTFMN: Guided Texture and Feature Modulation Network for Low-Light Image Enhancement and Super-Resolution
Yongsong Huang, Tzu-Hsuan Peng, Tomo Miyazaki, Xiaofeng Liu, Chun-Ting Chou, Ai-Chun Pang, Shinichiro Omachi
Main category: cs.CV
TL;DR: GTFMN is a novel framework for low-light image super-resolution that decouples the problem into illumination estimation and texture restoration using guided modulation.
Details
Motivation: Low-light image super-resolution is challenging due to the coupled degradation of low resolution and poor illumination. Existing methods struggle with this dual problem, requiring a solution that can handle both illumination correction and detail restoration simultaneously.Method: Proposes Guided Texture and Feature Modulation Network (GTFMN) with two streams: Illumination Stream predicts a spatially varying illumination map, and Texture Stream uses Illumination Guided Modulation Blocks (IGM Blocks) to dynamically modulate features based on the illumination map for spatially adaptive restoration.
Result: GTFMN achieves state-of-the-art performance on OmniNormal5 and OmniNormal15 datasets, outperforming competing methods in both quantitative metrics and visual quality.
Conclusion: The proposed decoupling approach with illumination-guided modulation effectively addresses the coupled degradation in low-light super-resolution, enabling spatially adaptive enhancement that intensifies poorly lit regions while preserving details in well-exposed areas.
Abstract: Low-light image super-resolution (LLSR) is a challenging task due to the coupled degradation of low resolution and poor illumination. To address this, we propose the Guided Texture and Feature Modulation Network (GTFMN), a novel framework that decouples the LLSR task into two sub-problems: illumination estimation and texture restoration. First, our network employs a dedicated Illumination Stream whose purpose is to predict a spatially varying illumination map that accurately captures lighting distribution. Further, this map is utilized as an explicit guide within our novel Illumination Guided Modulation Block (IGM Block) to dynamically modulate features in the Texture Stream. This mechanism achieves spatially adaptive restoration, enabling the network to intensify enhancement in poorly lit regions while preserving details in well-exposed areas. Extensive experiments demonstrate that GTFMN achieves the best performance among competing methods on the OmniNormal5 and OmniNormal15 datasets, outperforming them in both quantitative metrics and visual quality.
[129] SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing
Lifan Jiang, Boxi Wu, Yuhang Pei, Tianrun Wu, Yongyuan Chen, Yan Zhao, Shiyu Yu, Deng Cai
Main category: cs.CV
TL;DR: SNR-Edit is a training-free framework for inversion-free image editing that corrects latent trajectory drift via adaptive noise control, improving structural preservation without model tuning.
Details
Motivation: Existing inversion-free image editing approaches using flow-based models rely on fixed Gaussian noise for source trajectory construction, which leads to biased trajectory dynamics causing structural degradation and quality loss in edited images.Method: SNR-Edit introduces structure-aware noise rectification that injects segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image’s implicit inversion position. This reduces trajectory drift during source-target transport without requiring model training or inversion.
Result: Evaluations on SD3 and FLUX models using PIE-Bench and SNR-Bench show SNR-Edit achieves strong performance on pixel-level metrics and VLM-based scoring while adding only about 1 second overhead per image.
Conclusion: SNR-Edit provides an effective training-free solution for inversion-free image editing that addresses trajectory drift issues through adaptive noise control, enabling high-fidelity structural preservation with minimal computational overhead.
Abstract: Inversion-free image editing using flow-based generative models challenges the prevailing inversion-based pipelines. However, existing approaches rely on fixed Gaussian noise to construct the source trajectory, leading to biased trajectory dynamics and causing structural degradation or quality loss. To address this, we introduce SNR-Edit, a training-free framework achieving faithful Latent Trajectory Correction via adaptive noise control. Mechanistically, SNR-Edit uses structure-aware noise rectification to inject segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image’s implicit inversion position and reducing trajectory drift during source–target transport. This lightweight modification yields smoother latent trajectories and ensures high-fidelity structural preservation without requiring model tuning or inversion. Across SD3 and FLUX, evaluations on PIE-Bench and SNR-Bench show that SNR-Edit delivers performance on pixel-level metrics and VLM-based scoring, while adding only about 1s overhead per image.
[130] Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP
Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen
Main category: cs.CV
TL;DR: CSR is an efficient test-time defense that uses spectral-guided contrastive optimization to realign adversarial examples with natural data manifolds, achieving superior robustness against strong attacks with minimal inference overhead.
Details
Motivation: Current test-time defenses for vision-language models lack sufficient robustness against strong adversarial attacks, suffer from high inference latency, and have limited task-specific applicability. The paper aims to address these limitations by leveraging insights about adversarial examples' spectral properties.Method: CSR (Contrastive Spectral Rectification) optimizes a rectification perturbation to realign adversarial inputs with the natural data manifold using a spectral-guided contrastive objective. It exploits the observation that adversarial examples exhibit severe feature inconsistency under progressive frequency attenuation due to the model’s inherent spectral bias.
Result: CSR outperforms state-of-the-art methods by an average of 18.1% against strong AutoAttack across 16 classification benchmarks, while maintaining modest inference overhead. It also demonstrates broad applicability across diverse visual tasks.
Conclusion: CSR provides an effective and efficient test-time defense against adversarial attacks on vision-language models by leveraging spectral properties and contrastive optimization, offering superior robustness with practical inference efficiency.
Abstract: Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model’s inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.
[131] UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection
Fuxiang Sun, Xi Jiang, Jiansheng Wu, Haigang Zhang, Feng Zheng, Jinfeng Yang
Main category: cs.CV
TL;DR: UniPCB is the first unified vision-language benchmark for PCB quality inspection, addressing MLLMs’ limitations in complex industrial scenarios, with PCB-GPT model achieving state-of-the-art performance.
Details
Motivation: Current MLLMs struggle with PCB inspection due to densely packed components, complex wiring, and subtle defects requiring domain expertise. There's no unified benchmark for evaluating MLLMs on PCB tasks due to fragmented datasets and inconsistent standards.Method: 1) Created UniPCB benchmark via systematic pipeline curating/standardizing data from disparate sources across three annotated scenarios. 2) Developed PCB-GPT MLLM trained on new instruction dataset generated by the pipeline. 3) Used novel progressive curriculum mimicking human expert learning process.
Result: PCB-GPT establishes new baseline on UniPCB benchmark, more than doubling performance on fine-grained defect localization compared to strongest competitors. Shows significant advantages in localization and analysis tasks.
Conclusion: UniPCB fills critical gap in PCB inspection evaluation, PCB-GPT demonstrates MLLMs can excel in specialized industrial domains with proper domain adaptation. Resources (instruction data, benchmark, model) will be released to advance research.
Abstract: Multimodal Large Language Models (MLLMs) show promise for general industrial quality inspection, but fall short in complex scenarios, such as Printed Circuit Board (PCB) inspection. PCB inspection poses unique challenges due to densely packed components, complex wiring structures, and subtle defect patterns that require specialized domain expertise. However, a high-quality, unified vision-language benchmark for quantitatively evaluating MLLMs across PCB inspection tasks remains absent, stemming not only from limited data availability but also from fragmented datasets and inconsistent standardization. To fill this gap, we propose UniPCB, the first unified vision-language benchmark for open-ended PCB quality inspection. UniPCB is built via a systematic pipeline that curates and standardizes data from disparate sources across three annotated scenarios. Furthermore, we introduce PCB-GPT, an MLLM trained on a new instruction dataset generated by this pipeline, utilizing a novel progressive curriculum that mimics the learning process of human experts. Evaluations on the UniPCB benchmark show that while existing MLLMs falter on domain-specific tasks, PCB-GPT establishes a new baseline. Notably, it more than doubles the performance on fine-grained defect localization compared to the strongest competitors, with significant advantages in localization and analysis. We will release the instruction data, benchmark, and model to facilitate future research.
[132] Towards Pixel-Level VLM Perception via Simple Points Prediction
Tianhui Song, Haoyu Lu, Hao Yang, Lin Sui, Haoning Wu, Zaida Zhou, Zhiqi Huang, Yiping Bao, Y. Charles, Xinyu Zhou, Limin Wang
Main category: cs.CV
TL;DR: SimpleSeg enables MLLMs to perform pixel-level segmentation by predicting point sequences as textual coordinates, using a two-stage training pipeline with reinforcement learning for refinement.
Details
Motivation: To demonstrate that MLLMs have inherent low-level perception capabilities that can be unlocked without specialized architectures, challenging the prevailing need for complex task-specific designs.Method: Reframes segmentation as sequence generation where MLLMs directly predict point sequences (textual coordinates) of object boundaries. Uses two-stage SF→RL training: Supervised Fine-tuning followed by Reinforcement Learning with IoU-based reward to refine point sequences.
Result: Achieves performance comparable to or surpassing methods with complex task-specific designs on segmentation benchmarks, showing MLLMs’ strong inherent capacity for low-level perception.
Conclusion: Precise spatial understanding can emerge from simple point prediction, challenging the need for auxiliary components and paving the way for more unified and capable vision-language models.
Abstract: We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/
[133] TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment
Jiarun Liu, Qifeng Chen, Yiru Zhao, Minghua Liu, Baorui Ma, Sheng Yang
Main category: cs.CV
TL;DR: TIGaussian is a framework that uses 3D Gaussian Splatting characteristics to improve cross-modality alignment between 3D data (point clouds, 3D Gaussians) and other modalities (text, images) for 3D-related tasks.
Details
Motivation: While visual-language models have connected text and images, incorporating 3D modality data enables pretraining for 3D tasks. However, challenges remain in extracting 3D features and bridging gaps between different modalities.Method: Proposes TIGaussian with: 1) Multi-branch 3DGS tokenizer that decouples intrinsic properties of 3DGS structures into compact latent representations; 2) Bidirectional cross-modal alignment strategies including multi-view feature fusion using diffusion priors for image-3D alignment, and text-3D projection module for text-3D alignment.
Result: Extensive experiments on various datasets demonstrate state-of-the-art performance in multiple tasks.
Conclusion: TIGaussian effectively bridges modality gaps and enables better cross-modality alignment for 3D-related tasks through specialized 3DGS tokenization and alignment strategies.
Abstract: While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.
[134] Handcrafted Feature Fusion for Reliable Detection of AI-Generated Images
Syed Mehedi Hasan Nirob, Moqsadur Rahman, Shamim Ehsan, Summit Haque
Main category: cs.CV
TL;DR: Systematic evaluation of handcrafted image features for synthetic image detection shows LightGBM with mixed features achieves best performance, highlighting continued relevance of interpretable feature engineering.
Details
Motivation: With generative models creating highly realistic synthetic images, there's an urgent need for reliable fake content detection. While deep learning dominates, handcrafted features offer advantages in interpretability, efficiency, and generalizability.Method: Systematic evaluation of seven handcrafted descriptors (raw pixels, color histograms, DCT, HOG, LBP, GLCM, wavelet features) on CIFAKE dataset (50k train, 10k test). Benchmarking seven classifiers from Logistic Regression to gradient-boosted ensembles (LightGBM, XGBoost, CatBoost) across three configurations: baseline, advanced, and mixed features.
Result: LightGBM consistently outperformed alternatives, achieving PR-AUC 0.9879, ROC-AUC 0.9878, F1 0.9447, and Brier score 0.0414 with mixed features. Performance improved monotonically across configurations, with mixed features yielding substantial benefits over simpler descriptors.
Conclusion: Carefully engineered handcrafted features combined with ensemble learning remain highly relevant for synthetic image detection, particularly when interpretability and computational efficiency are critical considerations.
Abstract: The rapid progress of generative models has enabled the creation of highly realistic synthetic images, raising concerns about authenticity and trust in digital media. Detecting such fake content reliably is an urgent challenge. While deep learning approaches dominate current literature, handcrafted features remain attractive for their interpretability, efficiency, and generalizability. In this paper, we conduct a systematic evaluation of handcrafted descriptors, including raw pixels, color histograms, Discrete Cosine Transform (DCT), Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gray-Level Co-occurrence Matrix (GLCM), and wavelet features, on the CIFAKE dataset of real versus synthetic images. Using 50,000 training and 10,000 test samples, we benchmark seven classifiers ranging from Logistic Regression to advanced gradient-boosted ensembles (LightGBM, XGBoost, CatBoost). Results demonstrate that LightGBM consistently outperforms alternatives, achieving PR-AUC 0.9879, ROC-AUC 0.9878, F1 0.9447, and a Brier score of 0.0414 with mixed features, representing strong gains in calibration and discrimination over simpler descriptors. Across three configurations (baseline, advanced, mixed), performance improves monotonically, confirming that combining diverse handcrafted features yields substantial benefit. These findings highlight the continued relevance of carefully engineered features and ensemble learning for detecting synthetic images, particularly in contexts where interpretability and computational efficiency are critical.
[135] A Multi-View Consistency Framework with Semi-Supervised Domain Adaptation
Yuting Hong, Li Dong, Xiaojie Qiu, Hui Xiao, Baochen Yao, Siming Zheng, Chengbin Peng
Main category: cs.CV
TL;DR: A multi-view consistency framework for Semi-Supervised Domain Adaptation that uses debiasing strategies and pseudo-negative labels to address class similarity issues in target domains, with cross-domain affinity learning for feature alignment.
Details
Motivation: In SSDA, limited labeled target samples cause intrinsic class similarity in feature space, leading to biased predictions even with balanced training data.Method: Multi-view consistency framework with two training views: 1) debiasing strategy correcting class-wise probabilities based on model performance, 2) pseudo-negative labels from model predictions, plus cross-domain affinity learning for same-class feature alignment.
Result: Outperforms competing methods on DomainNet and Office-Home datasets.
Conclusion: Combining unsupervised domain adaptation with semi-supervised learning enhances model adaptability, reduces annotation costs, and improves performance for industrial applications.
Abstract: Semi-Supervised Domain Adaptation (SSDA) leverages knowledge from a fully labeled source domain to classify data in a partially labeled target domain. Due to the limited number of labeled samples in the target domain, there can be intrinsic similarity of classes in the feature space, which may result in biased predictions, even when the model is trained on a balanced dataset. To overcome this limitation, we introduce a multi-view consistency framework, which includes two views for training strongly augmented data. One is a debiasing strategy for correcting class-wise prediction probabilities according to the prediction performance of the model. The other involves leveraging pseudo-negative labels derived from the model predictions. Furthermore, we introduce a cross-domain affinity learning aimed at aligning features of the same class across different domains, thereby enhancing overall performance. Experimental results demonstrate that our method outperforms the competing methods on two standard domain adaptation datasets, DomainNet and Office-Home. Combining unsupervised domain adaptation and semi-supervised learning offers indispensable contributions to the industrial sector by enhancing model adaptability, reducing annotation costs, and improving performance.
[136] ProMist-5K: A Comprehensive Dataset for Digital Emulation of Cinematic Pro-Mist Filter Effects
Yingtie Lei, Zimeng Li, Chi-Man Pun, Wangyu Wu, Junke Yang, Xuhang Chen
Main category: cs.CV
TL;DR: ProMist-5K is a dataset of 20,000 high-resolution image pairs for emulating Pro-Mist filter effects, covering different filter densities and focal lengths to support cinematic style transformation models.
Details
Motivation: Pro-Mist filters create unique cinematic effects (soft halation, lower contrast, atmospheric style) that are difficult to reproduce digitally due to complex light diffusion behavior. There's a need for a physically grounded dataset to bridge digital flexibility with traditional lens aesthetics.Method: Built using a physically inspired pipeline in scene-referred linear space with 20,000 image pairs across four configurations (two filter densities: 1/2 and 1/8; two focal lengths: 20mm and 50mm). Uses multiple blur layers and carefully tuned weighting to model varying intensity and spread of optical diffusion.
Result: The dataset provides a consistent and controllable target domain that works well across different training settings and image translation models, helping capture both subtle and strong cinematic appearances.
Conclusion: ProMist-5K offers a practical, physically grounded resource for film-inspired image transformation, bridging the gap between digital flexibility and traditional lens aesthetics, with the dataset publicly available on Kaggle.
Abstract: Pro-Mist filters are widely used in cinematography for their ability to create soft halation, lower contrast, and produce a distinctive, atmospheric style. These effects are difficult to reproduce digitally due to the complex behavior of light diffusion. We present ProMist-5K, a dataset designed to support cinematic style emulation. It is built using a physically inspired pipeline in a scene-referred linear space and includes 20,000 high-resolution image pairs across four configurations, covering two filter densities (1/2 and 1/8) and two focal lengths (20mm and 50mm). Unlike general style datasets, ProMist-5K focuses on realistic glow and highlight diffusion effects. Multiple blur layers and carefully tuned weighting are used to model the varying intensity and spread of optical diffusion. The dataset provides a consistent and controllable target domain that supports various image translation models and learning paradigms. Experiments show that the dataset works well across different training settings and helps capture both subtle and strong cinematic appearances. ProMist-5K offers a practical and physically grounded resource for film-inspired image transformation, bridging the gap between digital flexibility and traditional lens aesthetics. The dataset is available at https://www.kaggle.com/datasets/yingtielei/promist5k.
[137] Beyond Shadows: A Large-Scale Benchmark and Multi-Stage Framework for High-Fidelity Facial Shadow Removal
Tailong Luo, Jiesong Bai, Jinyang Huang, Junyu Xia, Wangyu Wu, Xuhang Chen
Main category: cs.CV
TL;DR: First large-scale real-world facial shadow removal dataset (ASFW) with 1,081 paired shadow/shadow-free images created via professional Photoshop workflow, plus Face Shadow Eraser method for improved real-world performance.
Details
Motivation: Facial shadows degrade image quality and vision algorithm performance. Existing methods struggle with texture preservation under complex lighting and lack real-world paired datasets for training.Method: Created ASFW dataset using professional Photoshop workflow for photorealistic shadow variations and accurate ground truths. Introduced Face Shadow Eraser (FSE) method to demonstrate dataset effectiveness.
Result: ASFW bridges synthetic-real domain gap. Deep models trained on ASFW show improved shadow removal in real-world conditions. Experiments demonstrate enhanced performance and new standards for facial shadow removal.
Conclusion: ASFW dataset enables better facial shadow removal in real-world scenarios, addressing previous limitations of synthetic datasets and texture preservation challenges.
Abstract: Facial shadows often degrade image quality and the performance of vision algorithms. Existing methods struggle to remove shadows while preserving texture, especially under complex lighting conditions, and they lack real-world paired datasets for training. We present the Augmented Shadow Face in the Wild (ASFW) dataset, the first large-scale real-world dataset for facial shadow removal, containing 1,081 paired shadow and shadow-free images created via a professional Photoshop workflow. ASFW offers photorealistic shadow variations and accurate ground truths, bridging the gap between synthetic and real domains. Deep models trained on ASFW demonstrate improved shadow removal in real-world conditions. We also introduce the Face Shadow Eraser (FSE) method to showcase the effectiveness of the dataset. Experiments demonstrate that ASFW enhances the performance of facial shadow removal models, setting new standards for this task.
[138] Instance-Guided Radar Depth Estimation for 3D Object Detection
Chen-Chou Lo, Patrick Vandewalle
Main category: cs.CV
TL;DR: InstaRadar enhances Radar density using instance segmentation, integrated with RCDPT depth estimation to improve monocular 3D object detection in autonomous driving.
Details
Motivation: Monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness in challenging conditions, while Radar provides resilience but has sparsity and low resolution limitations, motivating effective Radar-camera fusion.Method: Two key components: 1) InstaRadar - instance segmentation-guided expansion method using pre-trained segmentation masks to enhance Radar density and semantic alignment; 2) Integration of pre-trained RCDPT into BEVDepth framework as replacement for its depth module.
Result: InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, and the RCDPT integration consistently improves 3D detection performance with steady gains over baseline BEVDepth model.
Conclusion: The framework demonstrates effectiveness of InstaRadar and advantage of explicit depth supervision in 3D object detection, though lags behind Radar-camera fusion models that directly extract BEV features, highlighting potential for improvement through dedicated Radar branch with temporal cues.
Abstract: Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.
[139] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
Zichen Wen, Boxue Yang, Shuang Chen, Yaojie Zhang, Yuhang Han, Junlong Ke, Cong Wang, Yicheng Fu, Jiawang Zhao, Jiangchao Yao, Xi Fang, Zhen Wang, Henxing Cai, Lin Yao, Zhifeng Gao, Yanhui Hong, Nang Yuan, Yixuan Li, Guojiang Zhao, Haoyi Tao, Nan Wang, Han Lyu, Guolin Ke, Ning Liao, Xiaoxing Wang, Kai Chen, Zhiyu Li, Feiyu Xiong, Sihan Hu, Kun Chen, Yanfeng Wang, Weinan E, Linfeng Zhang, Linfeng Zhang
Main category: cs.CV
TL;DR: Innovator-VL is a scientific multimodal LLM that achieves strong scientific reasoning with data-efficient training and transparent methodology, maintaining general vision capabilities without massive domain-specific pretraining.
Details
Motivation: To advance scientific understanding and reasoning across diverse domains while challenging the trend of relying on massive domain-specific pretraining and opaque pipelines. The goal is to demonstrate that principled training design and transparent methodology can yield strong scientific intelligence with reduced data requirements.Method: Provides a fully transparent, end-to-end reproducible training pipeline covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation. Uses fewer than five million curated samples without large-scale pretraining, emphasizing principled data selection over indiscriminate scaling.
Result: Achieves competitive performance on various scientific tasks with remarkable data efficiency, demonstrates strong generalization across general vision, multimodal reasoning, and scientific benchmarks, and shows that scientific alignment can be integrated without compromising general-purpose capabilities.
Conclusion: Efficient, reproducible, and high-performing scientific multimodal models can be built without large-scale data, providing a practical foundation for future research that prioritizes principled methodology over brute-force scaling.
Abstract: We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.
[140] Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation
Jinming Zhang, Xi Yang, Youpeng Yang, Haosen Shi, Yuyao Yan, Qiufeng Wang, Guangliang Cheng, Kaizhu Huang
Main category: cs.CV
TL;DR: Proposes a region-wise curriculum learning strategy with Pareto-consistent loss and fuzzy labeling to address non-uniform uncertainty in medical image segmentation, particularly boundary ambiguity.
Details
Motivation: Medical image segmentation uncertainty is non-uniform (higher at boundaries), conventional training treats all pixels equally causing unstable optimization, and this instability hinders convergence to Pareto-optimal solutions.Method: 1) Region-wise curriculum strategy prioritizing certain regions first, gradually incorporating uncertain ones; 2) Pareto-consistent loss balancing regional uncertainties by adaptively reshaping loss landscape; 3) Fuzzy labeling mechanism maintaining binary confidence in non-boundary areas while enabling smooth transitions near boundaries.
Result: Experiments on brain metastasis and non-metastatic tumor segmentation show consistent improvements across multiple configurations, outperforming traditional crisp-set approaches in all tumor subregions.
Conclusion: The proposed approach effectively addresses non-uniform uncertainty in medical segmentation through curriculum learning, Pareto optimization, and fuzzy labeling, leading to more stable optimization and better performance.
Abstract: Uncertainty in medical image segmentation is inherently non-uniform, with boundary regions exhibiting substantially higher ambiguity than interior areas. Conventional training treats all pixels equally, leading to unstable optimization during early epochs when predictions are unreliable. We argue that this instability hinders convergence toward Pareto-optimal solutions and propose a region-wise curriculum strategy that prioritizes learning from certain regions and gradually incorporates uncertain ones, reducing gradient variance. Methodologically, we introduce a Pareto-consistent loss that balances trade-offs between regional uncertainties by adaptively reshaping the loss landscape and constraining convergence dynamics between interior and boundary regions; this guides the model toward Pareto-approximate solutions. To address boundary ambiguity, we further develop a fuzzy labeling mechanism that maintains binary confidence in non-boundary areas while enabling smooth transitions near boundaries, stabilizing gradients, and expanding flat regions in the loss surface. Experiments on brain metastasis and non-metastatic tumor segmentation show consistent improvements across multiple configurations, with our method outperforming traditional crisp-set approaches in all tumor subregions.
[141] Establishing dermatopathology encyclopedia DermpathNet with Artificial Intelligence-Based Workflow
Ziyang Xu, Mingquan Lin, Yiliang Zhou, Zihan Xu, Seth J. Orlow, Zihan Xu, Shane A. Meehan, Alexandra Flamm, Ata S. Moshiri, Yifan Peng
Main category: cs.CV
TL;DR: Researchers created DermpathNet, a large open-access dermatopathology image dataset using a hybrid curation workflow combining deep learning and caption analysis, achieving 90.4% accuracy in image classification.
Details
Motivation: Clinicians and trainees face challenges accessing high-quality dermatopathology image datasets for education, cross-referencing, and machine learning applications.Method: Hybrid workflow using PubMed Central repository: keyword-based image extraction combined with deep learning image modality classification and figure caption analysis for categorization.
Result: Created DermpathNet with 7,772 images across 166 diagnoses, validated with 90.4% F-score for hybrid approach. Found current OpenAI image analysis inadequate for dermatopathology tasks.
Conclusion: Successfully developed a large, peer-reviewed, open-access dermatopathology dataset with semi-automated curation workflow for educational and research purposes.
Abstract: Accessing high-quality, open-access dermatopathology image datasets for learning and cross-referencing is a common challenge for clinicians and dermatopathology trainees. To establish a comprehensive open-access dermatopathology dataset for educational, cross-referencing, and machine-learning purposes, we employed a hybrid workflow to curate and categorize images from the PubMed Central (PMC) repository. We used specific keywords to extract relevant images, and classified them using a novel hybrid method that combined deep learning-based image modality classification with figure caption analyses. Validation on 651 manually annotated images demonstrated the robustness of our workflow, with an F-score of 89.6% for the deep learning approach, 61.0% for the keyword-based retrieval method, and 90.4% for the hybrid approach. We retrieved over 7,772 images across 166 diagnoses and released this fully annotated dataset, reviewed by board-certified dermatopathologists. Using our dataset as a challenging task, we found the current image analysis algorithm from OpenAI inadequate for analyzing dermatopathology images. In conclusion, we have developed a large, peer-reviewed, open-access dermatopathology image dataset, DermpathNet, which features a semi-automated curation workflow.
[142] Tri-Reader: An Open-Access, Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT
Fakrul Islam Tushar, Joseph Y. Lo
Main category: cs.CV
TL;DR: Tri-Reader is a free, open-source pipeline for lung cancer screening that combines lung segmentation, nodule detection, and malignancy classification in a three-stage workflow optimized for sensitivity and reduced annotation burden.
Details
Motivation: To create an accessible, comprehensive lung cancer screening tool that addresses the need for sensitive detection while minimizing the workload for human annotators, using publicly available models and datasets.Method: Developed a tri-stage pipeline integrating lung segmentation, nodule detection, and malignancy classification using multiple open-access models trained on public datasets. Designed to prioritize sensitivity while reducing candidate burden for annotators.
Result: Evaluated on multiple internal and external datasets compared with expert annotations and dataset-provided reference standards to ensure accuracy and generalizability across diverse practices.
Conclusion: Tri-Reader provides a comprehensive, freely available solution for lung cancer screening that balances sensitivity with practical annotation efficiency, validated across diverse clinical settings.
Abstract: Using multiple open-access models trained on public datasets, we developed Tri-Reader, a comprehensive, freely available pipeline that integrates lung segmentation, nodule detection, and malignancy classification into a unified tri-stage workflow. The pipeline is designed to prioritize sensitivity while reducing the candidate burden for annotators. To ensure accuracy and generalizability across diverse practices, we evaluated Tri-Reader on multiple internal and external datasets as compared with expert annotations and dataset-provided reference standards.
[143] Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection
Yao Xiao, Weiyan Chen, Jiahao Chen, Zijie Cao, Weijian Deng, Binbin Yang, Ziyi Dong, Xiangyang Ji, Wei Ke, Pengxu Wei, Liang Lin
Main category: cs.CV
TL;DR: X-AIGD is a new benchmark for explainable AI-generated image detection with pixel-level artifact annotations across three levels: low-level distortions, high-level semantics, and cognitive-level counterfactuals.
Details
Motivation: Current AIGI detection methods lack interpretability and convincing evidence for their decisions due to limited benchmark coverage of artifact diversity and absence of detailed localized annotations.Method: Introduces X-AIGD benchmark with comprehensive pixel-level, categorized annotations of perceptual artifacts spanning three levels: low-level distortions, high-level semantics, and cognitive-level counterfactuals.
Result: Three key findings: (1) Existing detectors show negligible reliance on perceptual artifacts; (2) While trainable to identify specific artifacts, detectors still heavily use uninterpretable features; (3) Aligning model attention with artifact regions improves interpretability and generalization.
Conclusion: X-AIGD enables fine-grained interpretability evaluation and deeper insight into AIGI detection models, revealing current limitations and providing a path toward more explainable detection systems through artifact-aware attention alignment.
Abstract: Current AI-Generated Image (AIGI) detection approaches predominantly rely on binary classification to distinguish real from synthetic images, often lacking interpretable or convincing evidence to substantiate their decisions. This limitation stems from existing AIGI detection benchmarks, which, despite featuring a broad collection of synthetic images, remain restricted in their coverage of artifact diversity and lack detailed, localized annotations. To bridge this gap, we introduce a fine-grained benchmark towards eXplainable AI-Generated image Detection, named X-AIGD, which provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals. These comprehensive annotations facilitate fine-grained interpretability evaluation and deeper insight into model decision-making processes. Our extensive investigation using X-AIGD provides several key insights: (1) Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level. (2) While AIGI detectors can be trained to identify specific artifacts, they still substantially base their judgment on uninterpretable features. (3) Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors. The data and code are available at: https://github.com/Coxy7/X-AIGD.
[144] RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming
Jisheng Chu, Wenrui Li, Rui Zhao, Wangmeng Zuo, Shifeng Chen, Xiaopeng Fan
Main category: cs.CV
TL;DR: RoamScene3D is a novel framework that generates consistent 3D scenes from text by using semantic reasoning and adaptive camera trajectories, overcoming spatial blindness in existing methods.
Details
Motivation: Existing text-to-3D scene generation methods suffer from spatial blindness, rely on predefined trajectories that ignore object relationships, and struggle with 2D inpainting for camera motion holes. There's a need for better semantic understanding and adaptive scene exploration.Method: Uses a vision-language model to construct scene graphs encoding object relations, guides cameras to perceive object boundaries and plan adaptive roaming trajectories, and introduces a Motion-Injected Inpainting model fine-tuned on synthetic panoramic data with authentic camera trajectories.
Result: Extensive experiments show the method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes through semantic reasoning and geometric constraints.
Conclusion: RoamScene3D successfully bridges semantic guidance with spatial generation, enabling adaptive scene exploration and consistent 3D scene generation from text descriptions.
Abstract: Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.
[145] DSTCS: Dual-Student Teacher Framework with Segment Anything Model for Semi-Supervised Pubic Symphysis Fetal Head Segmentation
Yalin Luo, Shun Long, Huijin Wang, Jieyun Bai
Main category: cs.CV
TL;DR: Proposes DSTCS framework combining CNN and SAM with dual student-teacher architecture for pubic symphysis and fetal head segmentation in ultrasound images, achieving state-of-the-art performance on MICCAI benchmarks.
Details
Motivation: PSFH segmentation is critical for intrapartum monitoring but challenging due to class imbalance, ambiguous boundaries, noise in ultrasound images, and limited annotated data. Current methods rely mainly on CNN/Transformers, leaving more powerful models underexplored.Method: Dual-Student and Teacher framework combining CNN and SAM (DSTCS) with cooperative learning between branches, specialized data augmentation for boundary processing, and novel loss function.
Result: Extensive experiments on MICCAI 2023 and 2024 PSFH segmentation benchmarks show superior robustness and significantly outperforms existing techniques.
Conclusion: Provides a reliable segmentation tool for clinical practice by effectively addressing PSFH segmentation challenges through integration of SAM with dual student-teacher architecture.
Abstract: Segmentation of the pubic symphysis and fetal head (PSFH) is a critical procedure in intrapartum monitoring and is essential for evaluating labor progression and identifying potential delivery complications. However, achieving accurate segmentation remains a significant challenge due to class imbalance, ambiguous boundaries, and noise interference in ultrasound images, compounded by the scarcity of high-quality annotated data. Current research on PSFH segmentation predominantly relies on CNN and Transformer architectures, leaving the potential of more powerful models underexplored. In this work, we propose a Dual-Student and Teacher framework combining CNN and SAM (DSTCS), which integrates the Segment Anything Model (SAM) into a dual student-teacher architecture. A cooperative learning mechanism between the CNN and SAM branches significantly improves segmentation accuracy. The proposed scheme also incorporates a specialized data augmentation strategy optimized for boundary processing and a novel loss function. Extensive experiments on the MICCAI 2023 and 2024 PSFH segmentation benchmarks demonstrate that our method exhibits superior robustness and significantly outperforms existing techniques, providing a reliable segmentation tool for clinical practice.
[146] Dynamic Worlds, Dynamic Humans: Generating Virtual Human-Scene Interaction Motion in Dynamic Scenes
Yin Wang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang
Main category: cs.CV
TL;DR: Dyn-HSI is a cognitive architecture for dynamic human-scene interaction that gives virtual humans vision, memory, and control capabilities to handle changing environments.
Details
Motivation: Existing human-scene interaction methods treat scenes as static, which doesn't reflect real-world dynamics where environments continuously change. There's a need for models that can adapt to dynamic scenes.Method: Dyn-HSI uses a three-component cognitive architecture: 1) Dynamic Scene-Aware Navigation (vision) for perceiving environmental changes and predicting waypoints, 2) Hierarchical Experience Memory (memory) for storing and updating training experiences to enable context-aware motion priming, and 3) Human-Scene Interaction Diffusion Model (control) for generating high-fidelity motions conditioned on multimodal inputs.
Result: The method outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings. A new benchmark Dyn-Scenes was created by extending existing static datasets to evaluate dynamic scene performance.
Conclusion: Dyn-HSI successfully addresses the limitation of static scene assumptions in human-scene interaction generation by providing virtual humans with cognitive capabilities to handle dynamic environments, demonstrating superior performance over existing methods.
Abstract: Scenes are continuously undergoing dynamic changes in the real world. However, existing human-scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn-HSI, the first cognitive architecture for dynamic human-scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene-Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context-aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human-Scene Interaction Diffusion Model, which generates high-fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human-scene interaction datasets to construct a dynamic benchmark, Dyn-Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn-HSI, showing that our method consistently outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings.
[147] Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation
Yizhao Han, Tianxing Shi, Zhao Wang, Zifan Xu, Zhiyuan Pu, Mingxiao Li, Qian Zhang, Wei Yin, Xiao-Xiao Long
Main category: cs.CV
TL;DR: ENkG sampling adapts candidate sizes based on token entropy to address limitations of static top-k/top-p sampling in video generation, improving long-horizon quality.
Details
Motivation: Static top-k/top-p sampling strategies that work well for LLMs are ineffective for video generation due to fundamental differences: video tokens have low semantic density and high spatio-temporal redundancy, causing either unnecessary randomness in static backgrounds or error compounding in dynamic foregrounds.Method: Proposes Entropy-Guided k-Guard (ENkG) sampling that adapts token candidate sizes based on token-wise dispersion measured by entropy. Low-entropy regions use fewer candidates to suppress noise, while high-entropy regions use more candidates to mitigate error compounding.
Result: Experiments show consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies. The method is model-agnostic, training-free, and adds negligible overhead.
Conclusion: ENkG sampling effectively addresses the mismatch between LLM sampling strategies and video generation requirements by adapting to token uncertainty, preventing error accumulation in long-horizon video generation.
Abstract: Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
[148] Fast Converging 3D Gaussian Splatting for 1-Minute Reconstruction
Ziyu Zhang, Tianle Liu, Diantao Tu, Shuhan Shen
Main category: cs.CV
TL;DR: Fast 3DGS reconstruction pipeline that converges within 1 minute, winning SIGGRAPH Asia 3DGS Fast Reconstruction Challenge with top PSNR of 28.43.
Details
Motivation: To develop a fast 3D Gaussian Splatting (3DGS) reconstruction pipeline that can handle heterogeneous camera pose settings (noisy SLAM vs. accurate COLMAP) while achieving high-fidelity results within a strict one-minute time budget for the SIGGRAPH Asia competition.Method: Two-stage approach: First round uses reverse per-Gaussian parallel optimization, compact forward splatting, load-balanced tiling, anchor-based Neural-Gaussian representation, initialization from monocular depth and feed-forward 3DGS models, and global pose refinement for noisy SLAM trajectories. Final round disables pose refinement, reverts to standard 3DGS, introduces multi-view consistency-guided Gaussian splitting, and adds depth estimator supervision.
Result: Achieved top performance in the competition with PSNR of 28.43, ranking first place. Successfully demonstrated high-fidelity reconstruction under the strict one-minute time constraint.
Conclusion: The proposed two-stage pipeline effectively handles different camera pose quality scenarios (noisy vs. accurate) and achieves state-of-the-art fast 3DGS reconstruction, winning the SIGGRAPH Asia challenge by balancing speed and quality through adaptive techniques.
Abstract: We present a fast 3DGS reconstruction pipeline designed to converge within one minute, developed for the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge. The challenge consists of an initial round using SLAM-generated camera poses (with noisy trajectories) and a final round using COLMAP poses (highly accurate). To robustly handle these heterogeneous settings, we develop a two-stage solution. In the first round, we use reverse per-Gaussian parallel optimization and compact forward splatting based on Taming-GS and Speedy-splat, load-balanced tiling, an anchor-based Neural-Gaussian representation enabling rapid convergence with fewer learnable parameters, initialization from monocular depth and partially from feed-forward 3DGS models, and a global pose refinement module for noisy SLAM trajectories. In the final round, the accurate COLMAP poses change the optimization landscape; we disable pose refinement, revert from Neural-Gaussians back to standard 3DGS to eliminate MLP inference overhead, introduce multi-view consistency-guided Gaussian splitting inspired by Fast-GS, and introduce a depth estimator to supervise the rendered depth. Together, these techniques enable high-fidelity reconstruction under a strict one-minute budget. Our method achieved the top performance with a PSNR of 28.43 and ranked first in the competition.
[149] Cortex-Grounded Diffusion Models for Brain Image Generation
Fabian Bongratz, Yitong Li, Sama Elbaroudy, Christian Wachinger
Main category: cs.CV
TL;DR: Cor2Vox is a cortex-grounded generative framework for brain MRI synthesis that uses cortical surface priors to guide a 3D shape-to-image diffusion process, enabling anatomically faithful synthesis with precise control over brain anatomy.
Details
Motivation: Existing generative models for neuroimaging data rely on weak conditioning signals (labels/text) that lack anatomical grounding and produce biologically implausible outputs. Synthetic data could address limitations of real datasets including scarcity of rare phenotypes, domain shifts across scanners, and insufficient longitudinal coverage.Method: Cor2Vox leverages high-resolution cortical surfaces to guide a 3D shape-to-image Brownian bridge diffusion process. It uses a large-scale statistical shape model of cortical morphology derived from over 33,000 UK Biobank scans to support generation of new, realistic brain shapes.
Result: Cor2Vox outperformed baseline methods on traditional image quality metrics, cortical surface reconstruction, and whole-brain segmentation quality. It preserved fine-grained cortical morphology at sub-voxel level and showed robustness to variations in cortical geometry and disease phenotypes without retraining.
Conclusion: Cor2Vox enables anatomically consistent brain MRI synthesis with precise control over underlying anatomies, demonstrating effectiveness in applications including progressive gray matter atrophy simulation and harmonization of frontotemporal dementia scans with public datasets.
Abstract: Synthetic neuroimaging data can mitigate critical limitations of real-world datasets, including the scarcity of rare phenotypes, domain shifts across scanners, and insufficient longitudinal coverage. However, existing generative models largely rely on weak conditioning signals, such as labels or text, which lack anatomical grounding and often produce biologically implausible outputs. To this end, we introduce Cor2Vox, a cortex-grounded generative framework for brain magnetic resonance image (MRI) synthesis that ties image generation to continuous structural priors of the cerebral cortex. It leverages high-resolution cortical surfaces to guide a 3D shape-to-image Brownian bridge diffusion process, enabling topologically faithful synthesis and precise control over underlying anatomies. To support the generation of new, realistic brain shapes, we developed a large-scale statistical shape model of cortical morphology derived from over 33,000 UK Biobank scans. We validated the fidelity of Cor2Vox based on traditional image quality metrics, advanced cortical surface reconstruction, and whole-brain segmentation quality, outperforming many baseline methods. Across three applications, namely (i) anatomically consistent synthesis, (ii) simulation of progressive gray matter atrophy, and (iii) harmonization of in-house frontotemporal dementia scans with public datasets, Cor2Vox preserved fine-grained cortical morphology at the sub-voxel level, exhibiting remarkable robustness to variations in cortical geometry and disease phenotype without retraining.
[150] Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration
Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu
Main category: cs.CV
TL;DR: Pref-Restore: A hierarchical framework for blind face restoration that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration by addressing information asymmetry between sparse inputs and dense outputs.
Details
Motivation: Current generative approaches for blind face restoration suffer from information asymmetry - the disparity between information-sparse low quality inputs and information-dense high quality outputs. This imbalance creates a one-to-many mapping problem leading to stochastic uncertainty and hallucinatory artifacts in restoration results.Method: Pref-Restore uses two complementary strategies: 1) Augmenting Input Density: An auto-regressive integrator reformulates textual instructions into dense latent queries to inject high-level semantic stability. 2) Pruning Output Distribution: Integration of on-policy reinforcement learning directly into the diffusion restoration loop to transform human preferences into differentiable constraints, penalizing stochastic deviations.
Result: Extensive experiments demonstrate state-of-the-art performance across synthetic and real-world benchmarks. Empirical analysis confirms that the preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.
Conclusion: Pref-Restore successfully addresses the information asymmetry problem in blind face restoration through hierarchical integration of semantic logic and texture generation, achieving deterministic, preference-aligned restoration with reduced stochastic uncertainty.
Abstract: Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry – the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbf{Pref-Restore}, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.
[151] Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild)
Ofir Abramovich, Ariel Shamir, Andreas Aristidou
Main category: cs.CV
TL;DR: A novel motion capture system reconstructs full-body 3D motion using only sparse pairwise distance measurements from body-mounted UWB sensors, eliminating need for external cameras and enabling robust outdoor operation.
Details
Motivation: Traditional motion capture systems (optical/inertial) require external cameras or are sensitive to environmental constraints like lighting and magnetic interference. There's a need for robust, shape-invariant motion capture that works in uncontrolled outdoor environments.Method: Uses body-mounted UWB sensors with time-of-flight ranging for pairwise distance measurements. Core is Wild-Poser (WiP), a compact real-time Transformer-based architecture that directly predicts 3D joint positions from noisy PWD measurements. Joint rotations can be reconstructed later via learned methods.
Result: WiP achieves low joint position error, operates in real-time, and generalizes across subjects of varying morphologies (including non-human species) without requiring individual body measurements or shape fitting. Demonstrates accurate 3D motion reconstruction for human and animal subjects in-the-wild.
Conclusion: The system enables scalable, low-cost, general purpose motion capture in real-world settings, overcoming limitations of traditional optical/inertial systems for uncontrolled outdoor environments.
Abstract: We introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.
[152] A Non-Invasive 3D Gait Analysis Framework for Quantifying Psychomotor Retardation in Major Depressive Disorder
Fouad Boutaleb, Emery Pierson, Mohamed Daoudi, Clémence Nineuil, Ali Amad, Fabien D’Hondt
Main category: cs.CV
TL;DR: Non-invasive computational framework transforms monocular RGB video into 3D gait kinematics to objectively detect Psychomotor Retardation in Major Depressive Disorder with 83.3% accuracy.
Details
Motivation: Current assessment of Psychomotor Retardation (PMR) in Major Depressive Disorder is largely subjective, and while 3D motion capture offers objectivity, it requires specialized hardware unsuitable for routine clinical use. There's a need for objective, interpretable features that can be extracted automatically for detailed patient analysis.Method: Proposes a computational framework that transforms monocular RGB video into clinically relevant 3D gait kinematics using Gravity-View Coordinates and a novel trajectory-correction algorithm that leverages the closed-loop topology of an adapted Timed Up and Go protocol to mitigate monocular depth errors. Extracts 297 explicit gait biomechanical biomarkers. Also introduces a stability-based machine learning framework to identify robust motor signatures while preventing overfitting on small clinical datasets.
Result: Achieves 83.3% accuracy in detecting Psychomotor Retardation and explains 64% of the variance in overall depression severity (R^2=0.64) when validated on the CALYPSO dataset. Reveals strong link between reduced ankle propulsion and restricted pelvic mobility to the depressive motor phenotype.
Conclusion: Physical movement serves as a robust proxy for cognitive state, offering a transparent and scalable tool for objective monitoring of depression in standard clinical environments. The framework provides an objective, non-invasive alternative to subjective clinical assessments of PMR.
Abstract: Predicting the status of Major Depressive Disorder (MDD) from objective, non-invasive methods is an active research field. Yet, extracting automatically objective, interpretable features for a detailed analysis of the patient state remains largely unexplored. Among MDD’s symptoms, Psychomotor retardation (PMR) is a core item, yet its clinical assessment remains largely subjective. While 3D motion capture offers an objective alternative, its reliance on specialized hardware often precludes routine clinical use. In this paper, we propose a non-invasive computational framework that transforms monocular RGB video into clinically relevant 3D gait kinematics. Our pipeline uses Gravity-View Coordinates along with a novel trajectory-correction algorithm that leverages the closed-loop topology of our adapted Timed Up and Go (TUG) protocol to mitigate monocular depth errors. This novel pipeline enables the extraction of 297 explicit gait biomechanical biomarkers from a single camera capture. To address the challenges of small clinical datasets, we introduce a stability-based machine learning framework that identifies robust motor signatures while preventing overfitting. Validated on the CALYPSO dataset, our method achieves an 83.3% accuracy in detecting PMR and explains 64% of the variance in overall depression severity (R^2=0.64). Notably, our study reveals a strong link between reduced ankle propulsion and restricted pelvic mobility to the depressive motor phenotype. These results demonstrate that physical movement serves as a robust proxy for the cognitive state, offering a transparent and scalable tool for the objective monitoring of depression in standard clinical environments.
[153] The S3LI Vulcano Dataset: A Dataset for Multi-Modal SLAM in Unstructured Planetary Environments
Riccardo Giubilato, Marcus Gerhard Müller, Marco Sewtz, Laura Alejandra Encinar Gonzalez, John Folkesson, Rudolph Triebel
Main category: cs.CV
TL;DR: S3LI Vulcano dataset release for multi-modal SLAM and place recognition benchmarking with visual and LiDAR data from volcanic environments.
Details
Motivation: Need for multi-modal datasets to develop and benchmark SLAM and place recognition algorithms using both visual and LiDAR modalities in diverse, challenging environments.Method: Recorded sequences on volcanic island of Vulcano (Aeolian Islands, Sicily) capturing diverse environments including basaltic/iron-rich rocks, lava channels, dry vegetation, and water.
Result: Released S3LI Vulcano dataset with accompanying open source toolkit for ground truth pose generation and labeled sample preparation for place recognition tasks.
Conclusion: Dataset provides valuable multi-modal resource for SLAM and place recognition research in challenging volcanic environments, supported by open source tools for data processing.
Abstract: We release the S3LI Vulcano dataset, a multi-modal dataset towards development and benchmarking of Simultaneous Localization and Mapping (SLAM) and place recognition algorithms that rely on visual and LiDAR modalities. Several sequences are recorded on the volcanic island of Vulcano, from the Aeolian Islands in Sicily, Italy. The sequences provide users with data from a variety of environments, textures and terrains, including basaltic or iron-rich rocks, geological formations from old lava channels, as well as dry vegetation and water. The data (rmc.dlr.de/s3li_dataset) is accompanied by an open source toolkit (github.com/DLR-RM/s3li-toolkit) providing tools for generating ground truth poses as well as preparation of labelled samples for place recognition tasks.
[154] MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation
Ronglai Zuo, Rolandos Alexandros Potamias, Qi Sun, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Main category: cs.CV
TL;DR: MaDiS is a masked-diffusion language model for sign language generation that uses bidirectional context modeling and parallel token generation, achieving better performance with 30% faster inference.
Details
Motivation: Current autoregressive language models for sign language generation suffer from unidirectional context modeling and slow token-by-token inference, limiting their effectiveness and efficiency.Method: Proposes MaDiS: a masked-diffusion-based language model with tri-level cross-modal pretraining (token-, latent-, and 3D physical-space objectives), temporal checkpoint unmasking strategy, and mixture-of-parts embedding layer with learnable gates.
Result: Achieves superior performance on CSL-Daily, Phoenix-2014T, and How2Sign datasets across multiple metrics (DTW error, SiBLEU, SiCLIP) while reducing inference latency by nearly 30%.
Conclusion: MaDiS effectively addresses limitations of autoregressive models for SLG through bidirectional context modeling, parallel generation, and innovative training strategies, offering both improved performance and efficiency.
Abstract: Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives, leading to richer and more grounded sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, reducing the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through learnable gates and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%. Code and models will be released on our project page.
[155] QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture
Cuong Le, Pavlo Melnyk, Urs Waldmann, Mårten Wadenbäck, Bastian Wandt
Main category: cs.CV
TL;DR: QuaMo: A novel quaternion-based motion capture method using quaternion differential equations with acceleration enhancement for continuous, stable 3D human kinematics estimation from videos.
Details
Motivation: Traditional 3D pose estimation ignores temporal consistency, causing jittery motion. Current kinematics approaches rely on Euler angles which suffer from discontinuity, especially in online settings. Quaternions offer continuous transitions but haven't been properly utilized in motion capture systems.Method: Proposes QuaMo using quaternion differential equations (QDE) with state-space modeling. Uses quaternion state and QDE describing quaternion velocity. Computes angular acceleration via meta-PD controller with novel acceleration enhancement that adaptively regulates control signals during pose changes. Solves QDE under quaternion unit-sphere constraint for accuracy.
Result: Outperforms state-of-the-art methods on Human3.6M, Fit3D, SportsPose and AIST datasets. Accurately estimates 3D human kinematics with no discontinuity and minimal implausibilities.
Conclusion: QuaMo’s novel QDE formulation with acceleration enhancement provides superior continuous motion capture by overcoming Euler angle discontinuity issues, enabling more stable and accurate 3D human kinematics estimation from videos.
Abstract: Vision-based 3D human motion capture from videos remains a challenge in computer vision. Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion. The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead. A major drawback in current kinematics approaches is their reliance on Euler angles. Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable. Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses. In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture. We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity. The corresponding angular acceleration is computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly changes to a new pose. Unlike previous work, our QDE is solved under the quaternion unit-sphere constraint that results in more accurate estimations. Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausibilities. QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and AIST. The code is available at https://github.com/cuongle1206/QuaMo
[156] ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving
Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Daxin Tian, Bingzhao Gao, Jianqiang Wang, Hong Chen
Main category: cs.CV
TL;DR: ScenePilot-Bench is a large-scale first-person driving benchmark for evaluating vision-language models in autonomous driving scenarios, built on 3,847 hours of driving videos with multi-granular annotations.
Details
Motivation: There's a need for comprehensive evaluation frameworks to assess vision-language models in safety-critical autonomous driving contexts, particularly for understanding their capabilities and limitations in driving-oriented reasoning.Method: Built upon ScenePilot-4K dataset with 3,847 hours of driving videos annotated with scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. Features a four-axis evaluation suite assessing scene understanding, spatial perception, motion planning, and GPT-Score with safety-aware metrics and cross-region generalization settings.
Result: Benchmarked representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning.
Conclusion: ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts, addressing the need for standardized assessment of driving-oriented AI capabilities.
Abstract: In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
[157] Localized Latent Editing for Dose-Response Modeling in Botulinum Toxin Injection Planning
Estèphe Arnaud, Mohamed Daoudi, Pierre Guerreschi
Main category: cs.CV
TL;DR: A generative AI framework that simulates Botox injection effects for treatment planning by learning localized muscle relaxation trajectories in StyleGAN2’s latent space and correlating them with toxin doses.
Details
Motivation: Current Botox injection planning relies on clinician intuition rather than quantitative methods, leading to suboptimal outcomes. There's a need for a systematic approach to predict injection effects and optimize dosage.Method: Proposes a localized latent editing framework with Region-Specific Latent Axis Discovery that learns muscle relaxation trajectories in StyleGAN2’s latent space. Compares two approaches: direct metric regression vs. image-based generative simulation on clinical data (N=360 images from 46 patients).
Result: The framework shows moderate-to-strong structural correlations for geometric asymmetry metrics on hold-out test data. The generative model correctly captures the direction of morphological changes, though biological variability limits absolute precision.
Conclusion: Introduces a hybrid “Human-in-the-Loop” workflow where clinicians interactively refine simulations, bridging the gap between pathological reconstruction and cosmetic planning for more precise Botox treatment.
Abstract: Botulinum toxin (Botox) injections are the gold standard for managing facial asymmetry and aesthetic rejuvenation, yet determining the optimal dosage remains largely intuitive, often leading to suboptimal outcomes. We propose a localized latent editing framework that simulates Botulinum Toxin injection effects for injection planning through dose-response modeling. Our key contribution is a Region-Specific Latent Axis Discovery method that learns localized muscle relaxation trajectories in StyleGAN2’s latent space, enabling precise control over specific facial regions without global side effects. By correlating these localized latent trajectories with injected toxin units, we learn a predictive dose-response model. We rigorously compare two approaches: direct metric regression versus image-based generative simulation on a clinical dataset of N=360 images from 46 patients. On a hold-out test set, our framework demonstrates moderate-to-strong structural correlations for geometric asymmetry metrics, confirming that the generative model correctly captures the direction of morphological changes. While biological variability limits absolute precision, we introduce a hybrid “Human-in-the-Loop” workflow where clinicians interactively refine simulations, bridging the gap between pathological reconstruction and cosmetic planning.
[158] The role of self-supervised pretraining in differentially private medical image analysis
Soroosh Tayebi Arasteh, Mina Farajiamiri, Mahshad Lotfinia, Behrus Hinrichs-Puladi, Jonas Bienzeisler, Mohamed Alhaskir, Mirabela Rusu, Christiane Kuhl, Sven Nebelung, Daniel Truhn
Main category: cs.CV
TL;DR: DINOv3 self-supervised initialization improves DP medical imaging performance over ImageNet but domain-specific supervised pretraining works best, with initialization strongly affecting fairness, generalization, and robustness under privacy constraints.
Details
Motivation: Differential privacy (DP) protects sensitive medical data but hurts diagnostic performance. Model initialization helps mitigate this degradation, but the role of modern self-supervised learning under full-model DP is not well understood in medical imaging.Method: Large-scale evaluation of initialization strategies for DP medical image analysis using chest radiograph classification with 800k+ images. Compared three approaches: 1) non-domain-specific supervised ImageNet initialization, 2) non-domain-specific self-supervised DINOv3 initialization, and 3) domain-specific supervised pretraining on MIMIC-CXR. Used state-of-the-art ConvNeXt models trained with DP-SGD across realistic privacy regimes, evaluated across five external datasets from diverse institutions.
Result: DINOv3 initialization consistently improves diagnostic utility over ImageNet initialization under DP, but domain-specific supervised pretraining performs best, achieving performance closest to non-private baselines. Initialization choice strongly influences demographic fairness, cross-dataset generalization, and robustness to data scale and model capacity under privacy constraints.
Conclusion: Initialization strategy is a central determinant of utility, fairness, and generalization in differentially private medical imaging, with domain-specific supervised pretraining being most effective for balancing privacy protection with diagnostic performance.
Abstract: Differential privacy (DP) provides formal protection for sensitive data but typically incurs substantial losses in diagnostic performance. Model initialization has emerged as a critical factor in mitigating this degradation, yet the role of modern self-supervised learning under full-model DP remains poorly understood. Here, we present a large-scale evaluation of initialization strategies for differentially private medical image analysis, using chest radiograph classification as a representative benchmark with more than 800,000 images. Using state-of-the-art ConvNeXt models trained with DP-SGD across realistic privacy regimes, we compare non-domain-specific supervised ImageNet initialization, non-domain-specific self-supervised DINOv3 initialization, and domain-specific supervised pretraining on MIMIC-CXR, the largest publicly available chest radiograph dataset. Evaluations are conducted across five external datasets spanning diverse institutions and acquisition settings. We show that DINOv3 initialization consistently improves diagnostic utility relative to ImageNet initialization under DP, but remains inferior to domain-specific supervised pretraining, which achieves performance closest to non-private baselines. We further demonstrate that initialization choice strongly influences demographic fairness, cross-dataset generalization, and robustness to data scale and model capacity under privacy constraints. The results establish initialization strategy as a central determinant of utility, fairness, and generalization in differentially private medical imaging.
[159] Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework
Hao Chang, Zhihui Wang, Lingxiang Wu, Peijin Wang, Wenhui Diao, Jinqiao Wang
Main category: cs.CV
TL;DR: GovLA-10K is the first management-oriented multi-modal benchmark for low-altitude intelligence, with GovLA-Reasoner framework for governance-aware aerial perception, focusing on functionally salient targets and actionable management suggestions.
Details
Motivation: Existing object-centric perception and loosely coupled vision-language pipelines fail to support management-oriented anomaly understanding needed for real-world urban governance through low-altitude vision systems.Method: Introduces GovLA-10K benchmark designed around functionally salient targets with management suggestions, and GovLA-Reasoner framework with efficient feature adapter that coordinates visual detector and LLM without fine-tuning individual components.
Result: Extensive experiments show significant performance improvement while avoiding task-specific fine-tuning of individual components.
Conclusion: The work offers new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
Abstract: Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
[160] KeepLoRA: Continual Learning with Residual Gradient Adaptation
Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei, Min-Ling Zhang
Main category: cs.CV
TL;DR: KeepLoRA: A continual learning method for vision-language models that balances pre-trained knowledge retention, task knowledge preservation, and plasticity by restricting LoRA updates to residual subspaces.
Details
Motivation: Continual learning for pre-trained vision-language models needs to balance three competing objectives: retaining pre-trained knowledge, preserving knowledge from learned tasks, and maintaining plasticity for new tasks. Current approaches struggle to effectively balance these objectives.Method: KeepLoRA analyzes parameter spaces and finds general knowledge in principal subspaces while task-specific knowledge in residual subspaces. It learns new tasks by restricting LoRA parameter updates to the residual subspace, projecting gradients onto a subspace orthogonal to both pre-trained principal subspace and previous task feature directions.
Result: Theoretical and empirical analyses confirm that KeepLoRA effectively balances the three objectives and achieves state-of-the-art performance in continual learning for vision-language models.
Conclusion: KeepLoRA presents a simple but effective approach for continual learning that successfully balances knowledge retention, preservation, and plasticity through subspace-aware parameter updates, with code publicly available.
Abstract: Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The implementation code is available at https://github.com/MaolinLuo/KeepLoRA.
[161] A new Image Similarity Metric for a Perceptual and Transparent Geometric and Chromatic Assessment
Antonio Di Marino, Vincenzo Bevilacqua, Emanuel Di Nardo, Angelo Ciaramella, Ivanoe De Falco, Giovanna Sannino
Main category: cs.CV
TL;DR: Proposes a new perceptual image similarity metric with texture and color components, outperforms state-of-the-art on complex distortions, and provides visual explanations for transparency.
Details
Motivation: Current image similarity metrics are not truly perceptual, struggle with texture distortions, and lack transparency - they provide scores without explaining the differences between images.Method: Two-component perceptual metric: 1) Texture dissimilarity using Earth Mover’s Distance, 2) Chromatic dissimilarity in Oklab perceptual color space. Evaluated on Berkeley-Adobe Perceptual Patch Similarity dataset with complex shape and color distortions.
Result: Outperforms state-of-the-art metrics, especially for images with shape distortions, confirming better perceptiveness. Provides visual explanations to support similarity scores, making assessment transparent and justified.
Conclusion: The proposed perceptual metric effectively addresses limitations of existing methods by combining texture and color analysis, achieving superior performance on complex distortions while offering transparent, explainable similarity assessments.
Abstract: In the literature, several studies have shown that state-of-the-art image similarity metrics are not perceptual metrics; moreover, they have difficulty evaluating images, especially when texture distortion is also present. In this work, we propose a new perceptual metric composed of two terms. The first term evaluates the dissimilarity between the textures of two images using Earth Mover’s Distance. The second term evaluates the chromatic dissimilarity between two images in the Oklab perceptual color space. We evaluated the performance of our metric on a non-traditional dataset, called Berkeley-Adobe Perceptual Patch Similarity, which contains a wide range of complex distortions in shapes and colors. We have shown that our metric outperforms the state of the art, especially when images contain shape distortions, confirming also its greater perceptiveness. Furthermore, although deep black-box metrics could be very accurate, they only provide similarity scores between two images, without explaining their main differences and similarities. Our metric, on the other hand, provides visual explanations to support the calculated score, making the similarity assessment transparent and justified.
[162] SharpNet: Enhancing MLPs to Represent Functions with Controlled Non-differentiability
Hanting Niu, Junkai Deng, Fei Hou, Wencheng Wang, Ying He
Main category: cs.CV
TL;DR: SharpNet is a modified MLP architecture that can represent functions with user-defined sharp features by incorporating an auxiliary feature function from Poisson equations, enabling joint optimization of feature locations and network parameters.
Details
Motivation: Standard MLPs produce globally smooth outputs and struggle to represent continuous but non-differentiable functions with prescribed sharp features without ad hoc post-processing.Method: Enriches MLP with auxiliary feature function defined as solution to Poisson equation with jump Neumann boundary conditions, evaluated via efficient local integral that’s differentiable with respect to feature locations, enabling joint optimization of both feature locations and MLP parameters.
Result: Validated on 2D problems and 3D CAD model reconstruction, accurately recovers sharp edges and corners while maintaining smooth behavior away from features, outperforming state-of-the-art baselines that tend to smooth out gradient discontinuities.
Conclusion: SharpNet provides precise control over C^0-continuity, ensuring C^0-continuity at feature locations and smoothness elsewhere, offering a principled approach to representing functions with sharp features using neural networks.
Abstract: Multi-layer perceptrons (MLPs) are a standard tool for learning and function approximation, but they inherently yield outputs that are globally smooth. As a result, they struggle to represent functions that are continuous yet deliberately non-differentiable (i.e., with prescribed $C^0$ sharp features) without relying on ad hoc post-processing. We present SharpNet, a modified MLP architecture capable of encoding functions with user-defined sharp features by enriching the network with an auxiliary feature function, which is defined as the solution to a Poisson equation with jump Neumann boundary conditions. It is evaluated via an efficient local integral that is fully differentiable with respect to the feature locations, enabling our method to jointly optimize both the feature locations and the MLP parameters to recover the target functions/models. The $C^0$-continuity of SharpNet is precisely controllable, ensuring $C^0$-continuity at the feature locations and smoothness elsewhere. We validate SharpNet on 2D problems and 3D CAD model reconstruction, and compare it against several state-of-the-art baselines. In both types of tasks, SharpNet accurately recovers sharp edges and corners while maintaining smooth behavior away from those features, whereas existing methods tend to smooth out gradient discontinuities. Both qualitative and quantitative evaluations highlight the benefits of our approach.
[163] Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, Xudong Jiang
Main category: cs.CV
TL;DR: Video-KTR: A modality-aware policy shaping framework that performs selective token-level reinforcement learning for video reasoning by combining visual, temporal, and uncertainty attribution signals to focus learning on semantically informative content.
Details
Motivation: Existing video reasoning methods using RL rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, which limits both accuracy and interpretability.Method: Video-KTR combines three attribution signals for selective token-level RL: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. The framework reinforces only these key tokens to focus learning on semantically informative, modality-sensitive content.
Result: Achieves state-of-the-art or highly competitive results across five challenging benchmarks, with 42.7% on Video-Holmes (surpassing GPT-4o) and consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of attribution signals and robustness of targeted token-level updates.
Conclusion: Video-KTR improves accuracy and interpretability for video reasoning, offering a simple, drop-in extension to RL that focuses learning on semantically informative content through modality-aware token selection.
Abstract: Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at https://github.com/zywang0104/Video-KTR.
[164] DSVM-UNet : Enhancing VM-UNet with Dual Self-distillation for Medical Image Segmentation
Renrong Shao, Dongyang Li, Dong Xia, Lin Shao, Jiangdong Lu, Fen Zheng, Lulu Zhang
Main category: cs.CV
TL;DR: DSVM-UNet improves Vision Mamba-based UNet for medical image segmentation using dual self-distillation at global and local levels without complex architectural changes.
Details
Motivation: Existing Vision Mamba UNet approaches focus on complex architectural designs to enhance semantic feature perception, but this paper aims to improve performance through a simpler distillation-based approach without architectural complexity.Method: Proposes Dual Self-distillation for VM-UNet (DSVM-UNet) with double self-distillation methods that align features at both global and local levels, maintaining the original architecture while enhancing performance.
Result: Extensive experiments on ISIC2017, ISIC2018, and Synapse benchmarks demonstrate state-of-the-art performance while maintaining computational efficiency.
Conclusion: The proposed DSVM-UNet approach effectively improves Vision Mamba-based medical image segmentation through simple yet effective dual self-distillation, achieving superior performance without architectural complexity.
Abstract: Vision Mamba models have been extensively researched in various fields, which address the limitations of previous models by effectively managing long-range dependencies with a linear-time overhead. Several prospective studies have further designed Vision Mamba based on UNet(VM-UNet) for medical image segmentation. These approaches primarily focus on optimizing architectural designs by creating more complex structures to enhance the model’s ability to perceive semantic features. In this paper, we propose a simple yet effective approach to improve the model by Dual Self-distillation for VM-UNet (DSVM-UNet) without any complex architectural designs. To achieve this goal, we develop double self-distillation methods to align the features at both the global and local levels. Extensive experiments conducted on the ISIC2017, ISIC2018, and Synapse benchmarks demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. Code is available at https://github.com/RoryShao/DSVM-UNet.git.
[165] Self-Supervised Weight Templates for Scalable Vision Model Initialization
Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
Main category: cs.CV
TL;DR: SWEET is a self-supervised framework for scalable model initialization that learns a shared weight template and size-specific scalers via Tucker factorization, enabling flexible adaptation to varying model architectures with minimal training data.
Details
Motivation: Modern models require varying architecture sizes for deployment, but conventional pre-training and fine-tuning methods are limited to fixed-size models, creating inefficiencies when adapting to different computational constraints and application needs.Method: Uses Tucker-based factorization to learn a shared weight template and size-specific weight scalers, with width-wise stochastic scaling for regularization. Target models are initialized by composing and reweighting the template through lightweight scalers learned from minimal data.
Result: Achieves state-of-the-art performance on classification, detection, segmentation, and generation tasks for initializing variable-sized vision models, demonstrating superior cross-width generalization and efficiency.
Conclusion: SWEET provides an effective self-supervised framework for scalable model initialization that addresses the limitations of conventional pre-training by enabling flexible adaptation to varying model architectures with minimal training overhead.
Abstract: The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
[166] DiffStyle3D: Consistent 3D Gaussian Stylization via Attention Optimization
Yitong Yang, Xuexin Liu, Yinglin Wang, Jing Wang, Hao Dou, Changshuo Wang, Shuting He
Main category: cs.CV
TL;DR: DiffStyle3D is a novel diffusion-based 3D Gaussian Splatting style transfer method that directly optimizes in latent space using attention-aware loss and geometry-guided multi-view consistency, outperforming existing methods.
Details
Motivation: Existing 3D style transfer methods have limitations: VGG- and CLIP-based approaches struggle with multi-view consistency modeling, while diffusion-based methods capture consistency but suffer from unstable training due to reliance on denoising directions.Method: 1) Direct optimization in latent space; 2) Attention-Aware Loss that aligns style features in self-attention space while preserving content features; 3) Geometry-Guided Multi-View Consistency that integrates geometric information into self-attention for cross-view correspondence; 4) Geometry-aware mask to prevent redundant optimization in overlapping regions.
Result: Extensive experiments show DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism with better multi-view consistency.
Conclusion: DiffStyle3D provides an effective diffusion-based paradigm for 3DGS style transfer that addresses multi-view consistency challenges through latent space optimization and geometry-guided attention mechanisms.
Abstract: 3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGG- and CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.
[167] WaterClear-GS: Optical-Aware Gaussian Splatting for Underwater Reconstruction and Restoration
Xinrui Zhang, Yufeng Wang, Shuangkang Fang, Zesheng Wang, Dacheng Qi, Wenrui Ding
Main category: cs.CV
TL;DR: WaterClear-GS is a 3D Gaussian Splatting-based framework that integrates underwater optical properties into Gaussian primitives for real-time 3D reconstruction and appearance restoration, achieving state-of-the-art performance on novel view synthesis and underwater image restoration.
Details
Motivation: Existing NeRF-based methods for underwater 3D reconstruction suffer from slow rendering speeds and poor color restoration, while 3D Gaussian Splatting lacks the ability to model complex volumetric scattering effects inherent to underwater environments.Method: A pure 3DGS framework that explicitly integrates underwater optical properties (local attenuation and scattering) into Gaussian primitives without needing auxiliary networks. Uses dual-branch optimization for photometric consistency and water-free appearance recovery, enhanced by depth-guided geometry regularization, perception-driven image loss, exposure constraints, spatially-adaptive regularization, and physically guided spectral regularization.
Result: Achieves outstanding performance on both novel view synthesis and underwater image restoration tasks on standard benchmarks and newly collected datasets, while maintaining real-time rendering capabilities.
Conclusion: WaterClear-GS successfully addresses the limitations of existing methods by integrating underwater optical properties directly into 3D Gaussian Splatting, enabling efficient and accurate underwater 3D reconstruction and appearance restoration with real-time rendering.
Abstract: Underwater 3D reconstruction and appearance restoration are hindered by the complex optical properties of water, such as wavelength-dependent attenuation and scattering. Existing Neural Radiance Fields (NeRF)-based methods struggle with slow rendering speeds and suboptimal color restoration, while 3D Gaussian Splatting (3DGS) inherently lacks the capability to model complex volumetric scattering effects. To address these issues, we introduce WaterClear-GS, the first pure 3DGS-based framework that explicitly integrates underwater optical properties of local attenuation and scattering into Gaussian primitives, eliminating the need for an auxiliary medium network. Our method employs a dual-branch optimization strategy to ensure underwater photometric consistency while naturally recovering water-free appearances. This strategy is enhanced by depth-guided geometry regularization and perception-driven image loss, together with exposure constraints, spatially-adaptive regularization, and physically guided spectral regularization, which collectively enforce local 3D coherence and maintain natural visual perception. Experiments on standard benchmarks and our newly collected dataset demonstrate that WaterClear-GS achieves outstanding performance on both novel view synthesis (NVS) and underwater image restoration (UIR) tasks, while maintaining real-time rendering. The code will be available at https://buaaxrzhang.github.io/WaterClear-GS/.
[168] PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification
Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
Main category: cs.CV
TL;DR: PaW-ViT is a preprocessing method that aligns vision transformer tokens with ear anatomical features to improve ear recognition by reducing positional sensitivity and enhancing robustness to shape/size/pose variations.
Details
Motivation: Standard vision transformers use rectangular tokens that often include irrelevant background information, which can degrade performance for ear recognition where anatomical feature alignment is crucial. There's a disconnect between ear biometric morphological variation and transformer architecture's positional sensitivity.Method: PaW-ViT (Patch-based Warping Vision Transformer) is a preprocessing approach that normalizes ear images using anatomical knowledge. It aligns token boundaries to detected ear feature boundaries and warps features to match natural ear curvature, creating more consistent token representations.
Result: Experiments show PaW-ViT effectively improves various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) with reasonable alignment robustness to shape, size, and pose variations.
Conclusion: PaW-ViT addresses the mismatch between ear biometric variation and transformer positional sensitivity, offering a promising approach for ear authentication schemes by creating anatomically-aligned token representations.
Abstract: The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.
[169] GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance
Haozhi Zhu, Miaomiao Zhao, Dingyao Liu, Runze Tian, Yan Zhang, Jie Guo, Fenggen Yu
Main category: cs.CV
TL;DR: GeoDiff3D is a self-supervised 3D scene generation framework that uses coarse geometry as structural anchor and geometry-constrained 2D diffusion for texture-rich references, enabling efficient high-quality scene generation without strict multi-view consistency requirements.
Details
Motivation: Existing 3D scene generation methods (indirect 2D-to-3D reconstruction and direct 3D generation) suffer from weak structural modeling, heavy reliance on large-scale ground-truth supervision, structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes.Method: Uses coarse geometry as structural anchor, geometry-constrained 2D diffusion model for texture-rich reference images (doesn’t require strict multi-view consistency), voxel-aligned 3D feature aggregation, and dual self-supervision to maintain scene coherence and fine details while reducing labeled data dependence.
Result: Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, with low computational cost and fast, high-quality 3D scene generation.
Conclusion: GeoDiff3D offers a practical solution for accessible and efficient 3D scene construction, addressing limitations of existing methods while maintaining scene coherence and fine details with reduced supervision requirements.
Abstract: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.
[170] Diffusion for De-Occlusion: Accessory-Aware Diffusion Inpainting for Robust Ear Biometric Recognition
Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
Main category: cs.CV
TL;DR: Diffusion-based ear inpainting improves transformer-based ear recognition by reconstructing occluded ear regions from accessories like earrings and earphones.
Details
Motivation: Ear occlusions from accessories (earrings, earphones) degrade performance in ear biometric recognition systems, especially in unconstrained imaging scenarios.Method: Use diffusion-based ear inpainting as pre-processing: given ear image and automatically derived accessory mask, reconstruct clean ear regions while preserving anatomical structures (helix, antihelix, concha, lobule). Evaluate with transformer-based recognition systems across various ViT models and patch sizes.
Result: Experiments show diffusion-based inpainting effectively alleviates ear accessory occlusions and improves overall recognition performance across benchmark datasets.
Conclusion: Diffusion-based inpainting serves as a useful pre-processing aid to mitigate ear accessory occlusions and enhance transformer-based ear recognition systems.
Abstract: Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.
[171] Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li
Main category: cs.CV
TL;DR: Youtu-VL introduces a Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm that treats visual signals as supervisory targets rather than just conditional inputs, improving fine-grained visual understanding in VLMs.
Details
Motivation: Current Vision-Language Models have limitations in retaining fine-grained visual information due to a text-dominant optimization bias where visual signals are treated as passive conditional inputs rather than supervisory targets.Method: Youtu-VL uses Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm that shifts optimization from “vision-as-input” to “vision-as-target,” integrating visual tokens directly into the prediction stream with unified autoregressive supervision for both visual details and linguistic content.
Result: Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, enabling standard VLMs to perform vision-centric tasks without task-specific additions.
Conclusion: The VLUAS paradigm establishes a robust foundation for developing comprehensive generalist visual agents by fundamentally changing how visual information is processed in vision-language models.
Abstract: Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from vision-as-input'' to vision-as-target.’’ By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
[172] Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
Kun Li, Michael Ying Yang, Sami Sebastian Brandt
Main category: cs.CV
TL;DR: QSTar method enhances AVQA by integrating question-guided clues with audio frequency analysis and spatial-temporal perception, achieving state-of-the-art performance on AVQA benchmarks.
Details
Motivation: Existing AVQA approaches treat audio as complementary to video and integrate textual questions only in final stages, limiting multimodal reasoning. The paper aims to better incorporate question-guided clues and exploit audio's frequency-domain characteristics for improved audio-visual understanding.Method: Proposes Query-guided Spatial-Temporal-Frequency (QSTar) interaction method that incorporates question-guided clues and exploits audio frequency-domain characteristics alongside spatial and temporal perception. Also introduces Query Context Reasoning (QCR) block inspired by prompting to focus on semantically relevant audio and visual features.
Result: Extensive experiments on several AVQA benchmarks demonstrate significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
Conclusion: The proposed QSTar method effectively enhances audio-visual understanding by better integrating question guidance and exploiting audio frequency characteristics, achieving state-of-the-art performance in AVQA tasks.
Abstract: Audio–Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio–visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial–Temporal–Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio–visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.
[173] HexFormer: Hyperbolic Vision Transformer with Exponential Map Aggregation
Haya Alyoussef, Ahmad Bdeir, Diego Coello de Portugal Mecke, Tom Hanika, Niels Landwehr, Lars Schmidt-Thieme
Main category: cs.CV
TL;DR: HexFormer introduces hyperbolic vision transformers for image classification using exponential map aggregation in attention, showing improved performance and gradient stability over Euclidean baselines.
Details
Motivation: Hierarchical and relational structures in multimodal data (images, text, graphs) are challenging to model in Euclidean geometry, while hyperbolic geometry provides a natural framework for such structures.Method: Two designs: hyperbolic ViT (HexFormer) and hybrid variant (HexFormer-Hybrid) with hyperbolic encoder + Euclidean linear head. Novel attention mechanism uses exponential map aggregation instead of standard centroid averaging.
Result: Consistent performance improvements across multiple datasets over Euclidean baselines and prior hyperbolic ViTs. Hybrid variant achieves strongest results. Hyperbolic models show more stable gradients and reduced sensitivity to warmup strategies.
Conclusion: Hyperbolic geometry enhances vision transformers by improving gradient stability and accuracy. Simple mechanisms like exponential map aggregation provide strong practical benefits for modeling hierarchical structures.
Abstract: Data across modalities such as images, text, and graphs often contains hierarchical and relational structures, which are challenging to model within Euclidean geometry. Hyperbolic geometry provides a natural framework for representing such structures. Building on this property, this work introduces HexFormer, a hyperbolic vision transformer for image classification that incorporates exponential map aggregation within its attention mechanism. Two designs are explored: a hyperbolic ViT (HexFormer) and a hybrid variant (HexFormer-Hybrid) that combines a hyperbolic encoder with an Euclidean linear classification head. HexFormer incorporates a novel attention mechanism based on exponential map aggregation, which yields more accurate and stable aggregated representations than standard centroid based averaging, showing that simpler approaches retain competitive merit. Experiments across multiple datasets demonstrate consistent performance improvements over Euclidean baselines and prior hyperbolic ViTs, with the hybrid variant achieving the strongest overall results. Additionally, this study provides an analysis of gradient stability in hyperbolic transformers. The results reveal that hyperbolic models exhibit more stable gradients and reduced sensitivity to warmup strategies compared to Euclidean architectures, highlighting their robustness and efficiency in training. Overall, these findings indicate that hyperbolic geometry can enhance vision transformer architectures by improving gradient stability and accuracy. In addition, relatively simple mechanisms such as exponential map aggregation can provide strong practical benefits.
[174] EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu, Muzammal Naseer, Chi-Wing Fu, Pheng-Ann Heng
Main category: cs.CV
TL;DR: EgoHandICL is the first in-context learning framework for 3D hand reconstruction in egocentric vision that improves semantic alignment, visual consistency, and robustness under challenging conditions.
Details
Motivation: Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods struggle in unseen contexts despite scaling training data or adding auxiliary cues.Method: EgoHandICL introduces complementary exemplar retrieval guided by vision-language models, an ICL-tailored tokenizer for multimodal context, and a masked autoencoder-based architecture trained with hand-guided geometric and perceptual objectives.
Result: Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. The method demonstrates real-world generalization and improves EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts.
Conclusion: EgoHandICL represents a significant advancement in 3D hand reconstruction for egocentric vision through in-context learning, offering improved robustness and generalization capabilities.
Abstract: Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL
[175] SONIC: Spectral Oriented Neural Invariant Convolutions
Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
Main category: cs.CV
TL;DR: SONIC introduces a continuous spectral parameterization for convolutional operators using orientation-selective components, achieving global receptive fields with fewer parameters while improving robustness to transformations.
Details
Motivation: CNNs have limited global context capture and require deep architectures for long-range dependencies, while Vision Transformers lack spatial inductive bias and depend on explicit positional encodings. There's a need for representations that are both structured and global.Method: SONIC uses a continuous spectral parameterization that models convolutional operators with a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions.
Result: Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters.
Conclusion: Continuous, orientation-aware spectral parameterizations provide a principled and scalable alternative to conventional spatial and spectral operators, bridging the gap between local structure and global connectivity.
Abstract: Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
[176] VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction
Dominic Maggio, Luca Carlone
Main category: cs.CV
TL;DR: VGGT-SLAM 2.0 improves upon VGGT-SLAM with better drift handling, attention-based loop closure verification, and achieves state-of-the-art accuracy with real-time performance.
Details
Motivation: To address limitations in VGGT-SLAM including high-dimensional drift, planar degeneracy, and reconstruction ambiguity, while improving loop closure verification and enabling real-time performance.Method: 1) New factor graph design to remove 15-DoF drift and planar degeneracy while handling reconstruction ambiguity; 2) Leveraging VGGT attention layers for free image retrieval verification to reject false positives and enable more loop closures; 3) Real-time implementation tested on Jetson Thor.
Result: Achieves highest accuracy on TUM dataset with ~23% less pose error than VGGT-SLAM, demonstrates real-time performance on ground robot, works in diverse environments (indoor apartments, offices, 4200 sq ft barn), and can be adapted for open-set object detection.
Conclusion: VGGT-SLAM 2.0 substantially improves SLAM performance through better drift handling and attention-based verification, achieving state-of-the-art accuracy while maintaining real-time capability on embedded hardware.
Abstract: We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.
[177] DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
Shubham Patle, Sara Ghaboura, Hania Tariq, Mohammad Usman Khan, Omkar Thawakar, Rao Muhammad Anwer, Salman Khan
Main category: cs.CV
TL;DR: DuwatBench is a new benchmark dataset for evaluating multimodal AI models on Arabic calligraphy, containing 1,272 samples across 6 calligraphic styles with sentence-level annotations, revealing that current models struggle with artistic variations despite performing well on clean text.
Details
Motivation: Arabic calligraphy represents a rich visual tradition that combines linguistic meaning with artistic form, but multimodal AI models have largely unexplored capabilities in processing stylized Arabic script, creating a gap in culturally grounded multimodal research.Method: Created DuwatBench - a curated dataset of 1,272 samples containing about 1,475 unique words across six classical and modern calligraphic styles, each with sentence-level detection annotations. The dataset captures real-world challenges like complex stroke patterns, dense ligatures, and stylistic variations.
Result: Evaluation of 13 leading Arabic and multilingual multimodal models showed they perform well on clean text but struggle significantly with calligraphic variation, artistic distortions, and precise visual-text alignment in Arabic calligraphy.
Conclusion: By publicly releasing DuwatBench and its evaluation suite, the researchers aim to advance culturally grounded multimodal research, foster fair inclusion of Arabic language and visual heritage in AI systems, and support continued progress in Arabic calligraphy processing.
Abstract: Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across six classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems. Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well on clean text, they struggle with calligraphic variation, artistic distortions, and precise visual-text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of the Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset (https://huggingface.co/datasets/MBZUAI/DuwatBench) and evaluation suit (https://github.com/mbzuai-oryx/DuwatBench) are publicly available.
[178] Image deblurring based on lightweight multi-information fusion network
Yanni Zhang, Yiming Liu, Qiang Li, Miao Qi, Dahong Xu, Jun Kong, Jianzhong Wang
Main category: cs.CV
TL;DR: Lightweight multi-information fusion network (LMFN) for efficient image deblurring with fewer parameters.
Details
Motivation: Existing deep learning deblurring methods require many parameters, leading to high computational burden. Need lightweight solution that maintains performance.Method: Encoder-decoder architecture with multi-scale information extraction in encoding stage, distillation network in decoding stage for residual learning, and attention-based information fusion between distillation modules and feature channels.
Result: Achieves state-of-the-art deblurring results with smaller number of parameters, outperforms existing methods in model complexity.
Conclusion: LMFN provides an effective lightweight solution for image deblurring that balances performance and computational efficiency through multi-information fusion strategies.
Abstract: Recently, deep learning based image deblurring has been well developed. However, exploiting the detailed image features in a deep learning framework always requires a mass of parameters, which inevitably makes the network suffer from high computational burden. To solve this problem, we propose a lightweight multiinformation fusion network (LMFN) for image deblurring. The proposed LMFN is designed as an encoder-decoder architecture. In the encoding stage, the image feature is reduced to various smallscale spaces for multi-scale information extraction and fusion without a large amount of information loss. Then, a distillation network is used in the decoding stage, which allows the network benefit the most from residual learning while remaining sufficiently lightweight. Meanwhile, an information fusion strategy between distillation modules and feature channels is also carried out by attention mechanism. Through fusing different information in the proposed approach, our network can achieve state-of-the-art image deblurring result with smaller number of parameters and outperforms existing methods in model complexity.
[179] GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering
Jiaye Wu, Saeed Hadadan, Geng Lin, Matthias Zwicker, David Jacobs, Roni Sengupta
Main category: cs.CV
TL;DR: GaNI is a two-stage neural inverse rendering technique that reconstructs geometry, albedo, and roughness from images captured with co-located light and camera, addressing limitations of existing methods by handling global illumination and near-field lighting in multi-object scenes.
Details
Motivation: Existing inverse rendering techniques with co-located light-camera focus only on single objects and fail to model global illumination and near-field lighting, which are more prominent in scenes with multiple objects.Method: Two-stage approach: 1) Geometry reconstruction using neural volumetric rendering (NeuS), 2) Inverse neural radiosity using predicted geometry to estimate albedo and roughness. Key innovations include implicit modeling of near-field illumination effects, surface angle loss for specular reflections, light position-aware radiance cache network, and smoothness priors on roughness.
Result: Outperforms existing co-located light-camera-based inverse rendering techniques on both synthetic and real data. Produces significantly better reflectance and slightly better geometry than capture strategies that don’t require a dark room.
Conclusion: GaNI successfully addresses the limitations of existing methods by handling global illumination and near-field lighting in multi-object scenes through a two-stage neural approach with novel technical contributions for both geometry reconstruction and reflectance estimation.
Abstract: In this paper, we present GaNI, a Global and Near-field Illumination-aware neural inverse rendering technique that can reconstruct geometry, albedo, and roughness parameters from images of a scene captured with co-located light and camera. Existing inverse rendering techniques with co-located light-camera focus on single objects only, without modeling global illumination and near-field lighting more prominent in scenes with multiple objects. We introduce a system that solves this problem in two stages; we first reconstruct the geometry powered by neural volumetric rendering NeuS, followed by inverse neural radiosity that uses the previously predicted geometry to estimate albedo and roughness. However, such a naive combination fails and we propose multiple technical contributions that enable this two-stage approach. We observe that NeuS fails to handle near-field illumination and strong specular reflections from the flashlight in a scene. We propose to implicitly model the effects of near-field illumination and introduce a surface angle loss function to handle specular reflections. Similarly, we observe that invNeRad assumes constant illumination throughout the capture and cannot handle moving flashlights during capture. We propose a light position-aware radiance cache network and additional smoothness priors on roughness to reconstruct reflectance. Experimental evaluation on synthetic and real data shows that our method outperforms the existing co-located light-camera-based inverse rendering techniques. Our approach produces significantly better reflectance and slightly better geometry than capture strategies that do not require a dark room.
[180] Joint Diffusion for Universal Hand-Object Grasp Generation
Jinkun Cao, Jingyuan Liu, Kris Kitani, Yi Zhou
Main category: cs.CV
TL;DR: JHOD is a unified diffusion model that generates both hand and object in grasping scenarios, leveraging large-scale object datasets for better generalization to unseen shapes.
Details
Motivation: Current methods for hand grasp generation typically focus on hand-only generation or require separate object models. There's a need for a unified approach that can generate both hand and object together, with better generalization to diverse object shapes beyond limited hand-object grasp datasets.Method: Proposes Joint Hand-Object Diffusion (JHOD) that models hand and object in a unified latent representation. It uses hand-object grasping data to learn plausible grasps and leverages large-scale object datasets to learn inclusive object latent embeddings. The model can generate grasps both unconditionally and conditionally (with or without given object).
Result: The method achieves good visual plausibility and diversity in both conditional and unconditional grasp generation. It generalizes well to unseen object shapes due to the inclusive object representation learned from large-scale datasets, outperforming methods trained only on hand-object grasp data.
Conclusion: JHOD provides an effective unified framework for hand-object grasp generation that benefits from diverse object data, enabling better generalization to novel object shapes while maintaining grasp plausibility and diversity.
Abstract: Predicting and generating human hand grasp over objects is critical for animation and robotic tasks. In this work, we focus on generating both the hand and objects in a grasp by a single diffusion model. Our proposed Joint Hand-Object Diffusion (JHOD) models the hand and object in a unified latent representation. It uses the hand-object grasping data to learn to accommodate hand and object to form plausible grasps. Also, to enforce the generalizability over diverse object shapes, it leverages large-scale object datasets to learn an inclusive object latent embedding. With or without a given object as an optional condition, the diffusion model can generate grasps unconditionally or conditional to the object. Compared to the usual practice of learning object-conditioned grasp generation from only hand-object grasp data, our method benefits from more diverse object data used for training to handle grasp generation more universally. According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieves good visual plausibility and diversity. With the extra inclusiveness of object representation learned from large-scale object datasets, the proposed method generalizes well to unseen object shapes.
[181] There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models
Łukasz Staniszewski, Łukasz Kuciński, Kamil Deja
Main category: cs.CV
TL;DR: DDIM inversion for diffusion models creates structured latent encodings that limit editability; replacing early inversion steps with forward diffusion improves latent space quality for editing.
Details
Motivation: Diffusion models lack a low-dimensional, editable latent space. While inversion methods exist to map images to noise, they produce latent encodings with structural patterns that limit manipulation capabilities.Method: Analyze DDIM inversion process, identify that early inversion steps fail to provide accurate/diverse noise, propose simple fix: replace first DDIM inversion steps with forward diffusion process to decorrelate latent encodings.
Result: The proposed method successfully decorrelates latent encodings, enabling higher quality image editions and interpolations compared to prior inversion methods.
Conclusion: Structural patterns in DDIM inversion latents limit editability; replacing early inversion steps with forward diffusion creates better latent space for editing tasks.
Abstract: Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations. The code is available at https://github.com/luk-st/taba.
[182] Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models
Yanchen Wang, Adam Turnbull, Tiange Xiang, Yunlong Xu, Sa Zhou, Adnan Masoud, Shekoofeh Azizi, Feng Vankee Lin, Ehsan Adeli
Main category: cs.CV
TL;DR: A novel fMRI decoding approach using transformer-based encoders and image generative models with whole-brain analysis, achieving 43% improvement in semantic accuracy by incorporating default mode network contributions beyond visual cortex.
Details
Motivation: Traditional neural decoding focuses primarily on visual cortex mapping, but "seeing" involves the entire brain as different scenes evoke various emotions and cognitive states. The authors argue for a whole-brain approach to understanding visual processes.Method: Transformer-based large-scale fMRI encoders and image generative models (encoders & decoders) pre-trained on large public datasets, fine-tuned through Image-fMRI contrastive learning. Uses whole-brain activation maps during visual stimulus exposure.
Result: 43% improvement in predictive semantic accuracy compared to state-of-the-art approaches on BOLD5000 dataset. Network ablation analysis reveals significant contributions from default mode network beyond visual cortex.
Conclusion: Visual experience decoding should extend beyond visual cortex to include whole-brain analysis, with default mode network playing crucial role in semantic processing and sense-making during visual perception.
Abstract: Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advances in functional Magnetic Resonance Imaging (fMRI) and machine learning have greatly improved our ability to map visual stimuli to brain activity, especially in the visual cortex. Concurrently, research has expanded to decode more complex processes, such as language and memory across the whole brain, using techniques to handle greater variability and improve signal accuracy. We argue that “seeing” involves more than just mapping visual stimuli onto the visual cortex; it engages the entire brain, as various emotions and cognitive states can emerge from observing different scenes. In this paper, we develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps while individuals are exposed to visual stimuli. We utilize transformer-based large-scale fMRI encoders and Image generative models (encoders & decoders) pre-trained on large public datasets, which are then fine-tuned through Image-fMRI contrastive learning. Our models can decode visual experience across the entire cerebral cortex, surpassing the traditional confines of the visual cortex. Using a public dataset (BOLD5000), we first compare our method with state-of-the-art approaches for decoding visual processing and show improved predictive semantic accuracy by 43%. A network ablation analysis suggests that, beyond the visual cortex, the default mode network contributes significantly to stimulus decoding, in line with the proposed role of this network in sense-making and semantic processing.
[183] CMOOD: Concept-based Multi-label OOD Detection
Zhendong Liu, Yi Nian, Yuehan Qin, Henry Peng Zou, Li Li, Xiyang Hu, Yue Zhao
Main category: cs.CV
TL;DR: COOD is a zero-shot multi-label OOD detection framework that uses vision-language models with concept-based label expansion to handle complex label dependencies without retraining.
Details
Motivation: Existing OOD detection methods fail in multi-label settings where samples have multiple interdependent labels and complex semantic relationships. Current approaches require extensive training data and don't generalize to unseen label combinations, while LLM-based methods focus only on single-label scenarios.Method: COOD leverages pre-trained vision-language models with a concept-based label expansion strategy and a new scoring function. It enriches the semantic space with both positive and negative concepts for each label to model complex label dependencies, enabling precise OOD detection without additional training.
Result: The method achieves approximately 95% average AUROC on both VOC and COCO datasets, significantly outperforming existing approaches. It maintains robust performance across varying numbers of labels and different types of OOD samples.
Conclusion: COOD successfully addresses the critical gap in zero-shot multi-label OOD detection by modeling complex label dependencies through concept-based expansion, demonstrating strong performance without requiring extensive retraining or large datasets.
Abstract: How can models effectively detect out-of-distribution (OOD) samples in complex, multi-label settings without extensive retraining? Existing OOD detection methods struggle to capture the intricate semantic relationships and label co-occurrences inherent in multi-label settings, often requiring large amounts of training data and failing to generalize to unseen label combinations. While large language models have revolutionized zero-shot OOD detection, they primarily focus on single-label scenarios, leaving a critical gap in handling real-world tasks where samples can be associated with multiple interdependent labels. To address these challenges, we introduce COOD, a novel zero-shot multi-label OOD detection framework. COOD leverages pre-trained vision-language models, enhancing them with a concept-based label expansion strategy and a new scoring function. By enriching the semantic space with both positive and negative concepts for each label, our approach models complex label dependencies, precisely differentiating OOD samples without the need for additional training. Extensive experiments demonstrate that our method significantly outperforms existing approaches, achieving approximately 95% average AUROC on both VOC and COCO datasets, while maintaining robust performance across varying numbers of labels and different types of OOD samples.
[184] Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Na Zhao, Zechao Li, Jingdong Wang
Main category: cs.CV
TL;DR: EDC leverages off-the-shelf visual specialists to enhance image captions by incorporating fine-grained object attributes and relations, improving descriptive quality for better visual understanding and reasoning.
Details
Motivation: Existing methods for generating descriptive image captions for training Large Multimodality Models often lack precision and granularity, especially for complex visual reasoning tasks. Current approaches using distillation from pretrained LMMs, internet images, or human annotation fall short in providing detailed visual understanding.Method: EDC uses off-the-shelf visual specialists trained on annotated images (not originally for captioning) to extract object attributes (depth, emotion, fine-grained categories) and object relations (relative location, human-object-interaction). These rich attributes are systematically integrated into descriptive captions.
Result: EDC significantly improves descriptive quality of captions, providing deeper and more nuanced visual understanding. Experiments show visual specialists enhance performance on visual understanding tasks and reasoning that benefits from accurate visual perception.
Conclusion: Leveraging existing visual specialists for attribute extraction is an effective approach to enhance image caption quality, addressing limitations of current caption generation methods and improving multimodal model training.
Abstract: Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named EDC, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. By systematically integrating these rich attributes into the generated captions, EDC significantly improves the descriptive quality of the captions, providing a deeper and more nuanced understanding of the visual content. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. The complete source code of EDC pipeline and datasets will be available at https://github.com/syp2ysy/DCE.
[185] Revealing Subtle Phenotypes in Small Microscopy Datasets Using Latent Diffusion Models
Anis Bourou, Biel Castaño Segade, Thomas Boyer, Valérie Mezger, Auguste Genovesio
Main category: cs.CV
TL;DR: Using pre-trained latent diffusion models to detect subtle phenotypic variations in cellular images with limited data and computational resources.
Details
Motivation: Subtle phenotypic variations in cellular images are crucial for biological research and drug discovery but are often masked by cellular heterogeneity. While diffusion models show promise for revealing these nuances, they typically require large datasets and substantial computational resources that are often unavailable in biological research settings.Method: Proposes leveraging pre-trained latent diffusion models to uncover subtle phenotypic changes in cellular microscopy images. The approach is designed to work effectively with small datasets and limited computational resources.
Result: Validated qualitatively and quantitatively on several small microscopy image datasets. The approach successfully enables effective detection of phenotypic variations, capturing both visually apparent and imperceptible differences.
Conclusion: The approach demonstrates promising potential for phenotype detection in contexts constrained by limited data and computational capacity, offering a practical solution for biological research where resources are often limited.
Abstract: Identifying subtle phenotypic variations in cellular images is critical for advancing biological research and accelerating drug discovery. These variations are often masked by the inherent cellular heterogeneity, making it challenging to distinguish differences between experimental conditions. Recent advancements in deep generative models have demonstrated significant potential for revealing these nuanced phenotypes through image translation, opening new frontiers in cellular and molecular biology as well as the identification of novel biomarkers. Among these generative models, diffusion models stand out for their ability to produce high-quality, realistic images. However, training diffusion models typically requires large datasets and substantial computational resources, both of which can be limited in biological research. In this work, we propose a novel approach that leverages pre-trained latent diffusion models to uncover subtle phenotypic changes. We validate our approach qualitatively and quantitatively on several small datasets of microscopy images. Our findings reveal that our approach enables effective detection of phenotypic variations, capturing both visually apparent and imperceptible differences. Ultimately, our results highlight the promising potential of this approach for phenotype detection, especially in contexts constrained by limited data and computational capacity.
[186] Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan
Main category: cs.CV
TL;DR: MLLMs lack foundational visual cognition capabilities that humans develop through bottom-up hierarchy, scoring only 30% on VisFactor benchmark testing 20 vision-centric subtasks from cognitive psychology.
Details
Motivation: Current MLLMs are trained directly on complex downstream tasks, bypassing foundational visual capabilities that humans develop through bottom-up hierarchy (basic primitives → Gestalt principles → high-level semantics). There's a systematic gap between human visual cognition and MLLM capabilities that needs investigation.Method: Introduced VisFactor benchmark that digitizes 20 vision-centric subtests from FRCT (cognitive psychology assessment) spanning four domains of human visual cognition. Designed algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Evaluated 23 frontier MLLMs including proprietary (GPT, Gemini) and open-source (LLaMA, Qwen) models.
Result: Best model achieved only 30.17% score. Models consistently failed on tasks like mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. Performance improvements on existing general benchmarks might not represent genuine mastery of human-like visual cognition.
Conclusion: MLLMs lack fundamental visual cognition capabilities that humans develop through hierarchical perception. Current benchmark performance improvements may be misleading, and models need to develop foundational visual skills similar to human perceptual development.
Abstract: Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.
[187] Panoramic Distortion-Aware Tokenization for Person Detection and Localization in Overhead Fisheye Images
Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita
Main category: cs.CV
TL;DR: A novel person detection method for overhead fisheye images that addresses both person rotation and small person detection by remapping to equirectangular panoramas and using panoramic distortion-aware tokenization.
Details
Motivation: Person detection in overhead fisheye images faces two main challenges: person rotation and small person size. Prior work mainly addressed rotation, leaving the small-person problem underexplored. Conventional detection methods tend to miss smaller persons because they favor larger persons that dominate attention maps.Method: 1) Remap fisheye images to equirectangular panoramas to handle rotation. 2) Exploit panoramic geometry where apparent person height decreases linearly with vertical angle near the top. 3) Introduce panoramic distortion-aware tokenization that divides panoramic features using self-similar figures for optimal gap-free divisions. 4) Use maximum significance values in each tile to preserve significance areas of smaller persons. 5) Propose a transformer-based person detection and localization method combining these techniques.
Result: Extensive experiments demonstrated that the proposed method outperforms conventional methods on large-scale datasets, showing improved detection of small persons in overhead fisheye images.
Conclusion: The combination of panoramic-image remapping and distortion-aware tokenization effectively addresses both rotation and small-person detection challenges in overhead fisheye imagery, providing superior performance compared to conventional approaches.
Abstract: Person detection in overhead fisheye images is challenging due to person rotation and small persons. Prior work has mainly addressed person rotation, leaving the small-person problem underexplored. We remap fisheye images to equirectangular panoramas to handle rotation and exploit panoramic geometry to handle small persons more effectively. Conventional detection methods tend to favor larger persons because they dominate the attention maps, causing smaller persons to be missed. In hemispherical equirectangular panoramas, we find that apparent person height decreases approximately linearly with the vertical angle near the top of the image. Using this finding, we introduce panoramic distortion-aware tokenization to enhance the detection of small persons. This tokenization procedure divides panoramic features using self-similar figures that enable the determination of optimal divisions without gaps, and we leverage the maximum significance values in each tile of the token groups to preserve the significance areas of smaller persons. We propose a transformer-based person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets.
[188] Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction
Nan Peng, Xun Zhou, Mingming Wang, Guisong Chen, Wenqi Xu
Main category: cs.CV
TL;DR: Uni-PrevPredMap is a unified framework that integrates temporal perception buffers with corrupted HD maps for robust online vectorized HD map construction, achieving state-of-the-art performance in map-absent scenarios.
Details
Motivation: Safety is critical for autonomous driving, requiring maximal use of available prior information. The paper identifies that temporal perception buffers and cost-efficient HD maps form complementary prior sources for online map construction.Method: Proposes Uni-PrevPredMap with a tri-mode paradigm (non-prior, temporal-prior, and temporal-map-fusion modes) that maintains operational consistency. Also develops a tile-indexed 3D vectorized global map processor for efficient 3D prior data refreshment, storage, and retrieval.
Result: Achieves state-of-the-art map-absent performance on established online vectorized HD map construction benchmarks. Demonstrates robust error-resilient prior fusion when provided with corrupted HD maps, confirming synergistic complementarity between temporal predictions and imperfect map data.
Conclusion: The framework successfully integrates previous predictions with corrupted HD maps, providing robust performance in both map-present and map-absent scenarios while decoupling from ideal map assumptions.
Abstract: Safety constitutes a foundational imperative for autonomous driving systems, necessitating maximal incorporation of accessible prior information. This study establishes that temporal perception buffers and cost-efficient high-definition (HD) maps inherently form complementary prior sources for online vectorized HD map construction. We present Uni-PrevPredMap, a pioneering unified framework systematically integrating previous predictions with corrupted HD maps. Our framework introduces a tri-mode paradigm maintaining operational consistency across non-prior, temporal-prior, and temporal-map-fusion modes. This tri-mode paradigm simultaneously decouples the framework from ideal map assumptions while ensuring robust performance in both map-present and map-absent scenarios. Additionally, we develop a tile-indexed 3D vectorized global map processor enabling efficient 3D prior data refreshment, compact storage, and real-time retrieval. Uni-PrevPredMap achieves state-of-the-art map-absent performance across established online vectorized HD map construction benchmarks. When provided with corrupted HD maps, it exhibits robust capabilities in error-resilient prior fusion, empirically confirming the synergistic complementarity between temporal predictions and imperfect map data. Code is available at https://github.com/pnnnnnnn/Uni-PrevPredMap.
[189] R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors
Haoyang Wang, Liming Liu, Peiheng Wang, Junlin Hao, Jiangkai Wu, Xinggong Zhang
Main category: cs.CV
TL;DR: A framework that uses diffusion models to improve sparse-view mesh reconstruction by filtering unreliable generations and adaptively selecting viewpoints.
Details
Motivation: Mesh reconstruction from multi-view images degrades significantly under sparse-view conditions, especially in unseen regions. While diffusion models can synthesize novel views, their outputs often have visual artifacts and lack 3D consistency, making them unreliable for mesh optimization.Method: 1) Consensus Diffusion Module filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion. 2) Online reinforcement learning strategy based on Upper Confidence Bound (UCB) adaptively selects the most informative viewpoints guided by diffusion loss. 3) Fused images jointly supervise a NeRF-based model alongside sparse-view ground truth.
Result: Extensive experiments demonstrate significant improvements in both geometric quality and rendering quality compared to existing methods.
Conclusion: The proposed framework effectively leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner, addressing instability issues through consensus filtering and adaptive viewpoint selection.
Abstract: Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.
[190] A Genealogy of Foundation Models in Remote Sensing
Kevin Lane, Morteza Karimzadeh
Main category: cs.CV
TL;DR: This paper surveys foundation models for remote sensing, analyzing approaches adapted from computer vision, discussing representation quality and computational efficiency, and emphasizing multi-sensor integration and future directions.
Details
Motivation: Foundation models are gaining attention for remote sensing representation learning, but the field is still developing with competing approaches. The paper aims to examine these approaches, characterize their advantages and pitfalls, and outline future directions to improve remote sensing-specific foundation models.Method: The paper conducts a comprehensive survey and analysis of existing foundation model approaches in remote sensing. It examines single-sensor models to establish context, then focuses on multi-sensor integration, comparing how existing approaches leverage multiple sensors relative to multi-modal foundation models. The analysis includes representation quality assessment and methods to reduce computational requirements.
Result: The paper provides a systematic examination of remote sensing foundation models, identifying current approaches, their computer vision roots, and their effectiveness in leveraging remote sensing data characteristics. It highlights the importance of multi-sensor integration and identifies gaps in how existing models handle the multi-sensor aspect of Earth observations.
Conclusion: The paper concludes by identifying opportunities to better harness vast amounts of unlabeled, seasonal, and multi-sensor remote sensing data for foundation model development. It outlines future research directions to improve remote sensing-specific foundation models, particularly emphasizing the need for more effective multi-sensor integration approaches.
Abstract: Foundation models have garnered increasing attention for representation learning in remote sensing. Many such foundation models adopt approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches for how to most effectively leverage remotely sensed data. This paper examines these approaches, along with their roots in the computer vision field. This is done to characterize potential advantages and pitfalls, while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We first examine single-sensor remote foundation models to introduce concepts and provide context, and then place emphasis on incorporating the multi-sensor aspect of Earth observations into foundation models. In particular, we explore the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
[191] Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map
Emanuele Caruso, Alessandro Simoni, Francesco Pelosin
Main category: cs.CV
TL;DR: A diffusion-based pipeline for generating high-fidelity synthetic industrial defect segmentation datasets with minimal supervision, using enriched bounding box conditioning to produce accurate masks.
Details
Motivation: Industrial defect segmentation requires highly accurate labels, but acquiring real-world defect data is costly and time-consuming. Synthetic dataset generation for industrial applications remains underexplored.Method: Proposes a novel diffusion-based pipeline that conditions the diffusion model on enriched bounding box representations to generate precise segmentation masks, ensuring realistic and accurately localized defect synthesis.
Result: The approach improves defect consistency and spatial accuracy compared to existing layout-conditioned generative methods. Introduces two quantitative metrics and demonstrates effectiveness through downstream segmentation tasks trained on real and synthetic data.
Conclusion: Diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, enabling more reliable and cost-efficient segmentation models for industrial defect detection.
Abstract: Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.
[192] ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization
Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, Ji-Zhe Zhou
Main category: cs.CV
TL;DR: ForensicHub is the first unified benchmark & codebase for all-domain fake image detection and localization, addressing fragmentation across deepfake detection, image manipulation detection, AI-generated image detection, and document image manipulation domains.
Details
Motivation: The FIDL field is highly fragmented with four domains operating independently without interoperability, preventing cross-domain comparisons and hindering overall field development due to domain silos, separate datasets, models, and evaluation protocols.Method: Proposes ForensicHub with: 1) modular configuration-driven architecture decomposing forensic pipelines into interchangeable components; 2) implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC/Doc, integrates 2 existing benchmarks via adapter design; 3) provides unified evaluation framework.
Result: ForensicHub successfully breaks domain silos by providing the first unified benchmark for all FIDL domains, enabling cross-domain comparisons and analysis, with 8 key actionable insights into model architecture, dataset characteristics, and evaluation standards.
Conclusion: ForensicHub represents a significant advancement in unifying the fragmented FIDL field, enabling interoperability across domains and providing a foundation for future research breakthroughs in fake image detection and localization.
Abstract: The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
[193] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
Main category: cs.CV
TL;DR: RoboMaster: A novel framework for generating robotic manipulation videos by decomposing multi-object interactions into three sub-stages to address feature entanglement issues in overlapping regions.
Details
Motivation: Existing video diffusion models for robotic decision-making data struggle with multi-object interactions crucial for complex manipulation, particularly due to entangled features in overlapping regions that degrade visual fidelity.Method: Models inter-object dynamics via collaborative trajectory formulation, decomposing interaction process into three sub-stages: pre-interaction (robotic arm dominant), interaction (manipulated object dominant), and post-interaction (robotic arm dominant). Incorporates appearance- and shape-aware latent representations for subject semantic consistency.
Result: Establishes new state-of-the-art performance on challenging Bridge dataset, RLBench, and SIMPLER benchmarks for trajectory-controlled video generation in robotic manipulation.
Conclusion: RoboMaster effectively addresses multi-object feature fusion issues in prior work and demonstrates superior performance in generating realistic robotic manipulation videos with fine-grained trajectory control.
Abstract: Recent advances in video diffusion models shows promise for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex manipulation. This limitation arises from entangled features in overlapping regions, leading to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics via a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction, and models each phase using the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction. This design effectively alleviates the multi-object feature fusion issue in prior work. To further ensure subject semantic consistency across the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge dataset, as well as RLBench and SIMPLER benchmarks, demonstrate that our method establishs new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation. Project Page: https://fuxiao0719.github.io/projects/robomaster/
[194] MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang
Main category: cs.CV
TL;DR: MLVTG is a novel framework for Video Temporal Grounding that uses MambaAligner (Vision Mamba blocks) for temporal modeling and LLMRefiner (frozen LLM layers) for semantic alignment, achieving SOTA performance on multiple benchmarks.
Details
Motivation: Existing Transformer-based methods for Video Temporal Grounding suffer from redundant attention and suboptimal multi-modal alignment between video and language queries, limiting localization accuracy.Method: Proposes MLVTG with two key modules: (1) MambaAligner uses stacked Vision Mamba blocks as backbone to model temporal dependencies via structured state-space dynamics, and (2) LLMRefiner leverages frozen layers of pre-trained LLMs to transfer semantic priors for enhanced multi-modal alignment without fine-tuning.
Result: Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate state-of-the-art performance, significantly outperforming existing baselines.
Conclusion: MLVTG effectively addresses limitations of Transformer-based methods through dual alignment strategy combining temporal modeling with Mamba blocks and semantic purification with LLM priors, enabling more precise video temporal grounding.
Abstract: Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.
[195] Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better
Ruojing Li, Wei An, Yingqian Wang, Xinyi Ying, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu
Main category: cs.CV
TL;DR: DeepPro redefines IRST detection as 1D temporal anomaly detection, using global temporal saliency instead of spatial/short-term temporal info, achieving SOTA performance with high efficiency.
Details
Motivation: Current learning-based IRST detection methods use spatial and short-term temporal information but suffer from unreliable performance in complex conditions and computational redundancy. The paper explores whether more essential information exists in a different domain for better detection.Method: 1) Theoretical analysis reveals global temporal saliency/correlation in temporal profiles is superior for distinguishing targets. 2) Built first prediction attribution tool to verify temporal profile importance. 3) Remodeled IRST detection as 1D signal anomaly detection. 4) Proposed DeepPro network that only performs calculations in time dimension.
Result: DeepPro outperforms existing SOTA methods on widely-used benchmarks with extremely high efficiency. Achieves significant improvement on dim targets and in complex scenarios. Provides new modeling domain, insight, method, and performance.
Conclusion: The paper demonstrates that focusing on global temporal saliency in temporal profiles is more effective than spatial/short-term temporal approaches for IRST detection. DeepPro’s 1D temporal approach offers superior performance and efficiency, promoting new directions in IRST detection development.
Abstract: Infrared small target (IRST) detection is challenging in simultaneously achieving precise, robust, and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.
[196] A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation
Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
Main category: cs.CV
TL;DR: IDGBR framework combines discriminative and diffusion-based generative learning to refine segmentation boundaries in remote sensing images, addressing limitations of discriminative models in capturing high-frequency details.
Details
Motivation: Remote sensing semantic segmentation requires both semantic correctness (low-frequency) and precise boundary localization (high-frequency). Discriminative models excel at low-frequency features but struggle with high-frequency boundaries, while diffusion models are good at generating high-frequency details but lack semantic inference for low-frequency features.Method: IDGBR framework: 1) Generate coarse segmentation map using discriminative backbone model; 2) Feed coarse map and original image into conditioning guidance network to learn guidance representation; 3) Use iterative denoising diffusion process to refine coarse segmentation guided by learned representation.
Result: Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm the framework’s capability for consistent boundary refinement of coarse results from diverse discriminative architectures.
Conclusion: The integration of discriminative and diffusion-based generative learning effectively addresses the boundary refinement problem in remote sensing semantic segmentation by leveraging strengths of both approaches.
Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model’s ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework’s capability of consistent boundary refinement for coarse results from diverse discriminative architectures.
[197] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion
Roberto Miele, Niklas Linde
Main category: cs.CV
TL;DR: Diffusion models outperform VAEs/GANs for multivariate subsurface modeling, with improved conditioning corrections for hard and seismic data, enabling faster probabilistic inversion.
Details
Motivation: To enhance multivariate subsurface modeling and probabilistic inversion capabilities, addressing limitations of existing generative models (VAEs, GANs) and improving conditioning approaches for geological applications.Method: Proposes corrections to Diffusion Posterior Sampling (DPS) approach, including a likelihood approximation accounting for diffusion’s inherent noise contamination. Tests on multivariate geological scenarios with facies and acoustic impedance using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data).
Result: Significantly improved statistical robustness, enhanced posterior sampling, and reduced computational costs compared to original DPS. Method works with both hard and indirect conditioning data individually or simultaneously, with faster inversion than outer-loop methods like MCMC.
Conclusion: Diffusion models provide superior multivariate subsurface modeling with robust conditioning corrections, enabling efficient probabilistic inversion that outperforms existing generative approaches and traditional inversion methods.
Abstract: Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.
[198] Leveraging Convolutional and Graph Networks for an Unsupervised Remote Sensing Labelling Tool
Tulsi Patel, Mark W. Jones, Thomas Redfern
Main category: cs.CV
TL;DR: Unsupervised pipeline for labeling similar geographical areas in Sentinel-2 imagery using segmentation with CNN and GNN for robust feature encoding.
Details
Motivation: Remote sensing labeling is time-consuming and expensive, requiring expert analysis. Previous methods rely on pre-labeled training data, limiting their applicability to new unseen data.Method: Combines segmentation into homogeneous pixel regions based on color and spatial similarity, then uses convolutional and graph neural networks to encode robust feature representations that preserve local information while aggregating neighborhood context.
Result: Achieves high contextual consistency with SSIM = 0.96 and SAM = 0.21 scores, demonstrating robust feature space organization for interactive labeling with reduced outliers.
Conclusion: The unsupervised pipeline enables granular, rotationally invariant semantic labeling of remote sensing imagery without requiring pre-labeled training data, overcoming limitations of previous supervised approaches.
Abstract: Machine learning for remote sensing imaging relies on up-to-date and accurate labels for model training and testing. Labelling remote sensing imagery is time and cost intensive, requiring expert analysis. Previous labelling tools rely on pre-labelled data for training in order to label new unseen data. In this work, we define an unsupervised pipeline for finding and labelling geographical areas of similar context and content within Sentinel-2 satellite imagery. Our approach removes limitations of previous methods by utilising segmentation with convolutional and graph neural networks to encode a more robust feature space for image comparison. Unlike previous approaches we segment the image into homogeneous regions of pixels that are grouped based on colour and spatial similarity. Graph neural networks are used to aggregate information about the surrounding segments enabling the feature representation to encode the local neighbourhood whilst preserving its own local information. This reduces outliers in the labelling tool, allows users to label at a granular level, and allows a rotationally invariant semantic relationship at the image level to be formed within the encoding space. Our pipeline achieves high contextual consistency, with similarity scores of SSIM = 0.96 and SAM = 0.21 under context-aware evaluation, demonstrating robust organisation of the feature space for interactive labelling.
[199] PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation
Zongyou Yang, Jonathan Loo, Yinghan Hou
Main category: cs.CV
TL;DR: This paper proposes PyCAT4, an improved 3D human pose estimation model that enhances the existing Pymaf architecture by integrating Transformer-based feature extraction, temporal fusion for video sequences, and spatial pyramid structures for multi-scale feature fusion.
Details
Motivation: The motivation stems from recent advancements in 3D human pose estimation through CNN-pyramid grid alignment feedback loops and Transformer-based temporal analysis architectures. The authors aim to deeply optimize and improve the existing Pymaf network architecture by leveraging these recent innovations.Method: The main innovations include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance low-level feature capture; (2) Enhancing temporal signal understanding through feature temporal fusion techniques for video sequences; (3) Implementing spatial pyramid structures for multi-scale feature fusion to balance feature representation differences across scales.
Result: The proposed PyCAT4 model was validated on COCO and 3DPW datasets. The results demonstrate that the improvement strategies significantly enhance the network’s detection capability in human pose estimation.
Conclusion: The study successfully advances human pose estimation technology by integrating Transformer-based feature extraction, temporal fusion, and spatial pyramid structures into the Pymaf architecture, resulting in improved performance on benchmark datasets.
Abstract: Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network’s detection capability in human pose estimation, further advancing the development of human pose estimation technology.
[200] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting
Zheng Zhou, Yu-Jie Xiong, Jia-Chen Zhang, Chun-Ming Xia, Xihe Qiu, Hongjian Zhan
Main category: cs.CV
TL;DR: GDAGS introduces gradient-direction-aware density control for 3D Gaussian Splatting to address over-reconstruction and over-densification issues in complex scenes.
Details
Motivation: Existing 3DGS approaches suffer from two critical limitations: (1) Over-reconstruction due to persistent large Gaussians that can't meet splitting thresholds, exacerbated by conflicting gradient directions preventing effective splitting; (2) Over-densification in regions with aligned gradient aggregation, leading to redundant components and increased memory overhead.Method: GDAGS introduces Gradient Coherence Ratio (GCR) computed through normalized gradient vector norms to discriminate between concordant vs. conflicting gradient directions. A nonlinear dynamic weighting mechanism leverages GCR for gradient-direction-aware density control: prioritizes conflicting-gradient Gaussians during splitting for geometric details, while suppressing redundant concordant-direction Gaussians; promotes concordant-direction Gaussian densification during cloning for structural completion while preventing conflicting-direction Gaussian overpopulation.
Result: Comprehensive evaluations across diverse real-world benchmarks demonstrate superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations.
Conclusion: GDAGS successfully addresses the limitations of existing 3DGS approaches by introducing gradient-direction-aware density control, achieving better rendering quality with more efficient scene representation.
Abstract: The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced Novel View Synthesis (NVS) through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS) to address these challenges. Our key innovations: the Gradient Coherence Ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations.
[201] Assessing the Effectiveness of Deep Embeddings for Tree Species Classification in the Dutch Forest Inventory
Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Rußwurm
Main category: cs.CV
TL;DR: Using pre-trained remote sensing embeddings with Random Forest improves tree species classification accuracy in National Forest Inventories with limited data.
Details
Motivation: Traditional National Forest Inventories require labor-intensive field campaigns, and there's a need for more frequent, scalable updates using remote sensing data with limited annotations.Method: Systematically evaluated three embedding models (Presto, Alpha Earth, Tessera) on three tree species datasets, comparing pre-computed embeddings with dynamically calculated ones, using Random Forest for classification.
Result: Fine-tuning a publicly available remote sensing time series pre-trained model outperformed state-of-the-art NFI classification in the Netherlands by 2-9 percentage points across datasets and metrics.
Conclusion: Deep embeddings from pre-trained models significantly improve classification accuracy over traditional hand-defined features, offering a scalable solution for data-limited forest inventory applications.
Abstract: National Forest Inventory serves as the primary source of forest information, however, maintaining these inventories requires labor-intensive on-site campaigns by forestry experts to identify and document tree species. Embeddings from deep pre-trained remote sensing models offer new opportunities to update NFIs more frequently and at larger scales. While training new deep learning models on few data points remains challenging, we show that using pre-computed embeddings can proven effective for distinguishing tree species through seasonal canopy reflectance patternsin combination with Random Forest. This work systematically investigates how deep embeddings improve tree species classification accuracy in the Netherlands with few annotated data. We evaluate this question on three embedding models: Presto, Alpha Earth, and Tessera, using three tree species datasets of varying difficulty. Data-wise, we compare the available embeddings from Alpha Earth and Tessera with dynamically calculated embeddings from a pre-trained Presto model. Our results demonstrate that fine-tuning a publicly available remote sensing time series pre-trained model outperforms the current state-of-the-art in NFI classification in the Netherlands, yielding performance gains of approximately 2-9 percentage points across datasets and evaluation metrics. This indicates that classic hand-defined features are too simple for this task and highlights the potential of using deep embeddings for data-limited applications such as NFI classification. By leveraging openly available satellite data and deep embeddings from pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.
[202] BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
Yujie Li, Wenjia Xu, Yuanben Zhang, Zhiwei Wei, Mugen Peng
Main category: cs.CV
TL;DR: BTCChat is a multi-temporal MLLM that improves bi-temporal satellite image analysis through a Change Extraction module and Prompt Augmentation mechanism, achieving SOTA performance on change captioning and VQA tasks.
Details
Motivation: Current MLLM approaches for bi-temporal satellite imagery inadequately model temporal correlations and spatial semantic changes due to simple concatenation of image pairs, which hampers visual-semantic alignment and limits effectiveness in change understanding applications.Method: Proposes BTCChat with two key components: 1) Change Extraction module to better capture temporal features and spatial semantic changes in image pairs, and 2) Prompt Augmentation mechanism that incorporates contextual clues into prompts to enhance attention to spatial details.
Result: BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks for bi-temporal satellite imagery analysis.
Conclusion: BTCChat effectively addresses the limitations of previous methods by better modeling temporal correlations and spatial semantic changes, demonstrating superior performance in bi-temporal change understanding while retaining single-image interpretation capability.
Abstract: Bi-temporal satellite imagery supports critical applications such as urbanization monitoring and disaster assessment. Although powerful multimodal large language models~(MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model’s attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks. The code is available \href{https://github.com/IntelliSensing/BTCChat}{here}.
[203] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation
Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin
Main category: cs.CV
TL;DR: DiffInk: First latent diffusion Transformer framework for full-line handwriting generation using dual-regularized VAE and diffusion Transformer for improved accuracy, style fidelity, and efficiency.
Details
Motivation: Existing text-to-online handwriting generation methods focus on character/word-level generation, resulting in inefficiency and lack of holistic structural modeling for full text lines.Method: 1) InkVAE: Sequential VAE with dual regularization losses (OCR-based for glyph accuracy + style-classification for style preservation). 2) InkDiT: Latent diffusion Transformer that integrates target text and reference styles to generate pen trajectories.
Result: Outperforms SOTA methods in both glyph accuracy and style fidelity while significantly improving generation efficiency.
Conclusion: DiffInk successfully addresses limitations of existing methods by enabling full-line handwriting generation with structured latent space disentangling character content and writer styles.
Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.
[204] Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang
Main category: cs.CV
TL;DR: EvoQuality is a self-supervised framework that enables vision-language models to autonomously improve image quality assessment capabilities without ground-truth labels, using pseudo-labels from majority voting and iterative refinement through group relative policy optimization.
Details
Motivation: Current methods for improving VLMs in perceptual domains like IQA require costly human-annotated data through supervised fine-tuning or reinforcement learning. Self-supervised techniques have been effective for reasoning but remain unexplored for perceptual tasks like IQA.Method: EvoQuality adapts self-consistency to ranking-based IQA by generating pseudo-labels through pairwise majority voting on the VLM’s own outputs. These pseudo-rankings create a fidelity reward that guides iterative evolution using group relative policy optimization (GRPO), allowing the model to progressively refine its perceptual capability without ground-truth labels.
Result: EvoQuality boosts the base VLM’s zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Despite being entirely self-supervised, it achieves performance competitive with or surpassing state-of-the-art supervised VLM-based IQA models, outperforming them on 5 out of 7 benchmarks. The framework also shows flexibility for stacking with pre-trained IQA models to improve generalization.
Conclusion: EvoQuality demonstrates that VLMs can autonomously refine their perceptual capabilities for image quality assessment without human supervision, achieving competitive performance with supervised methods while offering greater flexibility and eliminating annotation costs.
Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM’s own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model’s iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM’s perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM’s zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets.
[205] Purrception: Variational Flow Matching for Vector-Quantized Image Generation
Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom
Main category: cs.CV
TL;DR: Purrception is a variational flow matching method for vector-quantized image generation that combines continuous transport dynamics with explicit categorical supervision over discrete codes.
Details
Motivation: To bridge the gap between continuous flow matching methods (which have geometric awareness) and discrete categorical approaches (which provide explicit supervision), enabling uncertainty quantification and temperature-controlled generation for vector-quantized image generation.Method: Adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space, combining continuous transport dynamics with discrete supervision.
Result: Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores on ImageNet-1k 256x256 generation, demonstrating improved training efficiency.
Conclusion: Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation, as demonstrated by the Purrception approach.
Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.
[206] Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Winson Han, Ranjay Krishna
Main category: cs.CV
TL;DR: SOC is a synthetic data pipeline that composes 3D object segments into new images with accurate masks and annotations, outperforming real datasets and other synthetic methods in visual grouping tasks.
Details
Motivation: Real-world visual grouping datasets are expensive, biased, and hard to scale. Synthetic datasets offer an alternative but struggle with flexibility, accuracy, and compositional diversity.Method: Object-centric composition strategy using 3D geometric layout augmentation, camera configuration augmentation, generative harmonization, and mask-area-weighted blending to create accurate synthetic object segments with masks, boxes, and referring expressions.
Result: Models trained on just 100K SOC images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines by +24-36%, achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Also enables targeted data generation for fine-grained tasks.
Conclusion: SOC provides an accurate, scalable synthetic data solution that outperforms real datasets, enables controllable dataset construction, and delivers strong performance across various data scales and specialized tasks like intra-class referring.
Abstract: Visual grouping – operationalized through tasks such as instance segmentation, visual grounding, and object detection – enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% – achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
[207] MSCloudCAM: Multi-Scale Context Adaptation with Convolutional Cross-Attention for Multispectral Cloud Segmentation
Md Abdullah Al Mazid, Liangdong Deng, Naphtali Rishe
Main category: cs.CV
TL;DR: MSCloudCAM is a multi-scale context adapter network with convolution-based cross-attention for multispectral cloud segmentation that achieves superior performance on Sentinel-2 and Landsat-8 datasets.
Details
Motivation: Clouds obstruct optical satellite imaging, hindering environmental and climate analysis. Existing methods struggle with strong spectral variability and large scale differences among cloud types in multispectral/multi-sensor data.Method: Proposes MSCloudCAM with explicit modeling of multiple complementary multi-scale context extractors. Uses convolution-based cross-attention adapter to fuse fine-resolution features with global contextual representations for dynamic, scale-aware feature selection. Integrates hierarchical vision backbone with channel and spatial attention mechanisms.
Result: Achieves superior overall segmentation performance and competitive class-wise accuracy on CloudSEN12 (Sentinel-2) and L8Biome (Landsat-8) datasets compared to state-of-the-art models, while maintaining competitive model complexity.
Conclusion: MSCloudCAM demonstrates novelty and effectiveness for large-scale Earth observation through its multi-scale context modeling and convolution-based cross-attention design for cloud segmentation in multispectral satellite imagery.
Abstract: Clouds remain a major obstacle in optical satellite imaging, limiting accurate environmental and climate analysis. To address the strong spectral variability and the large scale differences among cloud types, we propose MSCloudCAM, a novel multi-scale context adapter network with convolution based cross-attention tailored for multispectral and multi-sensor cloud segmentation. A key contribution of MSCloudCAM is the explicit modeling of multiple complementary multi-scale context extractors. And also, rather than simply stacking or concatenating their outputs, our formulation uses one extractor’s fine-resolution features and the other extractor’s global contextual representations enabling dynamic, scale-aware feature selection. Building on this idea, we design a new convolution-based cross attention adapter that effectively fuses localized, detailed information with broader multi-scale context. Integrated with a hierarchical vision backbone and refined through channel and spatial attention mechanisms, MSCloudCAM achieves strong spectral-spatial discrimination. Experiments on various multisensor datatsets e.g. CloudSEN12 (Sentinel-2) and L8Biome (Landsat-8), demonstrate that MSCloudCAM achieves superior overall segmentation performance and competitive class-wise accuracy compared to recent state-of-the-art models, while maintaining competitive model complexity, highlighting the novelty and effectiveness of the proposed design for large-scale Earth observation.
[208] Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou
Main category: cs.CV
TL;DR: Vlaser is a Vision-Language-Action model that bridges embodied reasoning with policy learning, achieving SOTA on embodied reasoning benchmarks and competitive robot control performance.
Details
Motivation: Address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning for embodied agents, as current research focuses on either reasoning or control but not their integration.Method: Introduce Vlaser, a foundational vision-language model that integrates high-level reasoning with low-level control, built on the Vlaser-6M dataset. Systematically examine how different VLM initializations affect supervised VLA fine-tuning to mitigate domain shift between pre-training and embodied data.
Result: Achieves state-of-the-art performance on embodied reasoning benchmarks (spatial reasoning, embodied grounding, embodied QA, task planning). Achieves SOTA on WidowX benchmark and competitive performance on Google Robot benchmark.
Conclusion: Successfully bridges embodied reasoning with VLA policy learning, providing insights into mitigating domain shift and demonstrating effective integration of reasoning and control for embodied agents.
Abstract: While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
[209] Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Feihong Yan, Peiru Wang, Yao Zhu, Kaiyu Pang, Qingyan Wei, Huiqi Li, Linfeng Zhang
Main category: cs.CV
TL;DR: GtR is a training-free hierarchical sampling strategy that accelerates masked autoregressive models by decomposing generation into structure generation (slow) and detail reconstruction (fast), plus frequency-weighted token selection to allocate more computation to detail tokens.
Details
Motivation: Masked autoregressive models promise parallel generation efficiency but remain constrained by the complexity of modeling spatially correlated visual tokens in a single step. There's a need to accelerate these models while maintaining generation quality.Method: Proposes Generation then Reconstruction (GtR): a two-stage hierarchical sampling strategy where 1) structure generation establishes global semantic scaffolding slowly, and 2) detail reconstruction efficiently completes remaining tokens quickly. Also introduces Frequency-Weighted Token Selection (FTS) to allocate more computation budget to tokens on image details based on high-frequency energy.
Result: Achieves 3.72x speedup on MAR-H while maintaining comparable quality (FID: 1.59, IS: 304.4 vs. original 1.59, 299.1). Outperforms existing acceleration methods across various model scales and generation tasks on ImageNet class-conditional and text-to-image generation.
Conclusion: GtR effectively accelerates masked autoregressive models through a hierarchical generation approach that separates structure and detail processing, achieving significant speedup without compromising generation quality, with FTS further optimizing computation allocation based on semantic importance.
Abstract: Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.
[210] SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
Main category: cs.CV
TL;DR: SCoPE VLM introduces a Chain of Scroll mechanism for efficient long-context document navigation in vision-language models, reducing memory usage while modeling human-like reading behaviors.
Details
Motivation: Current vision-language models struggle with long-context visual information in agentic tasks like GUI control and web navigation. They neglect decision-oriented document understanding and use memory-intensive approaches that aren't practical for local deployment.Method: Proposes SCoPE VLM with Chain of Scroll mechanism for selective, recursive document navigation; dedicated data generation pipeline for Chain of Scroll trajectories; and Episodic Group Relative Policy Optimization for bridging training-inference gap.
Result: Substantially reduces memory usage and effectively models human-like reading behaviors. First framework to explicitly model agentic reading patterns in multi-page document question answering.
Conclusion: SCoPE VLM advances multimodal agent capabilities by addressing long-context document navigation challenges through efficient, selective attention mechanisms and specialized training approaches.
Abstract: Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to bridge the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
[211] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Anirban Ray, Vera Galinova, Florian Jug
Main category: cs.CV
TL;DR: ResMatching is a novel computational super-resolution method using guided conditional flow matching to learn improved data priors for fluorescence microscopy, achieving best trade-off between data fidelity and perceptual realism while providing calibrated uncertainty estimates.
Details
Motivation: Computational super-resolution in fluorescence microscopy is an ill-posed problem requiring strong priors to extrapolate missing frequencies. With advances in data-driven machine learning, better priors can be learned to improve CSR results, especially in challenging cases with noisy low-resolution images.Method: ResMatching uses guided conditional flow matching to learn improved data priors for computational super-resolution. The method can sample from an implicitly learned posterior distribution and provides pixel-wise uncertainty estimates.
Result: Evaluated on 4 diverse biological structures from BioSR dataset against 7 baselines, ResMatching consistently achieves competitive results with the best trade-off between data fidelity and perceptual realism. It’s particularly effective when strong priors are hard to learn (e.g., noisy low-resolution images). The method provides calibrated uncertainty estimates across all tested use-cases.
Conclusion: ResMatching demonstrates that guided conditional flow matching enables learning improved data priors for computational super-resolution, delivering both high-quality reconstructions and reliable uncertainty quantification for fluorescence microscopy applications.
Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
[212] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Congzhang Shao, Quan Yuan, Guiyang Luo, Yue Hu, Danni Wang, Yilin Liu, Rui Pan, Bo Chen, Jinglin Li
Main category: cs.CV
TL;DR: NegoCollab proposes a negotiated common representation approach for heterogeneous collaborative perception, using a negotiator to derive common features from local representations and multiple alignment losses for better knowledge distillation.
Details
Motivation: Immutable heterogeneity in collaborative perception causes domain gaps when agents use different fixed perception models, degrading performance. Existing methods use specific agent representations as common features, making alignment difficult for agents with significant domain discrepancies.Method: Introduces a negotiator during training to derive common representation from local representations of each modality’s agent. Uses sender-receiver pairs for mutual transformation between local and common representation spaces. Implements structural, pragmatic, and distribution alignment losses for supervision.
Result: Effectively reduces inherent domain gaps with various local representations and enables full knowledge distillation from common representation into sender models.
Conclusion: NegoCollab provides a more effective heterogeneous collaboration method through negotiated common representation that better handles domain discrepancies between agents with different perception models.
Abstract: Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality’s agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
[213] Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
Cong Cao, Yujie Xu, Xiaodong Xu
Main category: cs.CV
TL;DR: Proposes a few-shot style editing framework with MoE LoRA for efficient multi-style adaptation using limited paired data.
Details
Motivation: General image editing models fail with new styles, and fine-tuning with limited paired data is challenging.Method: Multi-style Mixture-of-Experts LoRA with style-specific/shared routing, metric-guided rank optimization, optimal DiT LoRA placement, and adversarial/flow matching training.
Result: Outperforms SOTA approaches with significantly fewer LoRA parameters on benchmark dataset with five distinct styles.
Conclusion: Proposed framework enables effective few-shot style editing through parameter-efficient multi-style adaptation.
Abstract: In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters. Our code and dataset are available at https://github.com/cao-cong/FSMSE.
[214] DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition
Raja Kumar, Arka Sadhu, Ram Nevatia
Main category: cs.CV
TL;DR: DiVE-k is a novel RL framework that uses a model’s own top-k predictions to create multiple-choice questions for fine-grained visual recognition, improving generalization by encouraging differential reasoning rather than memorization.
Details
Motivation: Large Vision Language Models have extensive text knowledge but struggle with fine-grained image recognition, often failing to differentiate visually similar categories. Existing RL fine-tuning methods are brittle, encourage memorization, and fail to elicit the differential reasoning needed for generalization to unseen classes.Method: DiVE-k creates multiple-choice questions from the model’s own top-k predictions for each training image, then uses Reinforcement Learning to train the model to select the correct answer. This approach forces the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal.
Result: Experiments on five standard fine-grained datasets show significant improvements. In base-to-novel generalization, DiVE-k surpasses QWEN2.5-VL-7B by 10.04% and ViRFT by 6.16% on Harmonic Mean. Similar gains are shown in mixed-domain and few-shot scenarios.
Conclusion: DiVE-k effectively addresses the limitations of existing fine-tuning methods by leveraging the model’s own predictions to create training signals that encourage differential reasoning, reducing memorization and improving generalization to unseen categories.
Abstract: Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model’s own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model’s top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios. Our code is available $\href{https://github.com/raja-kumar/DiVE-k}{here}$
[215] Scaling Foundation Models for Radar Scene Understanding
Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
Main category: cs.CV
TL;DR: RadarFM: A radar foundation model that learns unified scene representations using structured spatial language supervision and hash-aware contrastive learning for transferable radar perception.
Details
Motivation: Radar sensors provide reliable perception in adverse conditions, but existing radar approaches are fragmented and task-specific, preventing transfer across tasks. Foundation models have transformed vision and language but haven't been well integrated with radar sensing.Method: 1) Structured caption framework encoding vehicle distributions in native radar coordinates; 2) Hash-aware contrastive learning objective quantifying continuous scene similarity rather than binary matching; 3) Using CARLA simulator to generate large-scale annotated radar datasets across diverse driving scenarios.
Result: The paper introduces RadarFM with novel structured spatial language supervision and contrastive learning approach, enabling fine-grained spatial reasoning and unified scene-level representations. Also proposes localization-aware metrics for spatial accuracy assessment.
Conclusion: RadarFM addresses the fragmentation in radar perception by creating a foundation model that learns transferable representations, enabling cross-task radar understanding through structured language supervision and continuous similarity learning.
Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
[216] Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation
Franz Thaler, Martin Urschler, Mateusz Kozinski, Matthias AF Gsell, Gernot Plank, Darko Stern
Main category: cs.CV
TL;DR: A novel single-source domain generalization method for medical image segmentation that uses semantic-aware random convolution for training-time source domain diversification and test-time intensity mapping to handle target domain shifts.
Details
Motivation: To address the challenging problem of single-source domain generalization in medical image segmentation, where models trained on one imaging modality (e.g., CT) need to generalize to different modalities (e.g., MR) without adaptation or access to target domain data during training.Method: Two-stage approach: 1) Training-time: semantic-aware random convolution that diversifies source domain by applying different augmentations to different image regions based on annotation labels; 2) Test-time: intensity mapping to make target domain images resemble source domain data.
Result: Outperforms previous domain generalization techniques in most experiments across abdominal, whole-heart, and prostate segmentation tasks in cross-modality and cross-center settings. Achieves state-of-the-art performance, even matching in-domain baseline performance in several settings.
Conclusion: The proposed method sets new state-of-the-art for single-source domain generalization in medical image segmentation, demonstrating robust performance across diverse domain shifts including cross-modality, cross-center, and different cardiac phases.
Abstract: We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation, where we train a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. Our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently at training-time, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware. Overall, our evaluation shows that our method achieves new state-of-the-art performance in DG for medical image segmentation, even matching the performance of the in-domain baseline in several settings. We will release our source code upon acceptance of this manuscript.
[217] Astra: General Interactive World Model with Autoregressive Denoising
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Astra is an interactive general world model that generates real-world futures for diverse scenarios with precise action interactions, using autoregressive denoising with temporal causal attention and noise-augmented history memory.
Details
Motivation: While diffusion transformers have advanced video generation, world models that can predict long-horizon futures from past observations and actions remain underexplored for general-purpose scenarios and various action forms.Method: Proposes Astra with autoregressive denoising architecture using temporal causal attention to aggregate past observations, noise-augmented history memory to balance responsiveness with coherence, action-aware adapter for precise action control, and mixture of action experts for heterogeneous action modalities.
Result: Astra achieves interactive, consistent, and general long-term video prediction supporting various interactions, with experiments across multiple datasets demonstrating improvements in fidelity, long-range prediction, and action alignment over SOTA world models.
Conclusion: Astra bridges the gap in general-purpose world models by enabling precise action-controlled long-term future prediction across diverse real-world scenarios like autonomous driving and robot manipulation.
Abstract: Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
[218] SVBench: Evaluation of Video Generation Models on Social Reasoning
Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
Main category: cs.CV
TL;DR: Researchers introduce the first benchmark for evaluating social reasoning in video generation, revealing that current models excel at visual realism but fail at understanding social cognition like intention recognition and belief reasoning.
Details
Motivation: Current text-to-video generation models produce visually realistic videos but lack social coherence - they can't infer intentions, beliefs, emotions, or social norms like humans do. There's a need to systematically evaluate this gap in social reasoning capabilities.Method: Developed a training-free agent-based pipeline that: 1) distills reasoning mechanisms from 30 classic social cognition paradigms organized into 7 dimensions, 2) synthesizes diverse video scenarios, 3) enforces conceptual neutrality through cue-based critique, and 4) evaluates videos using a VLM judge across 5 interpretable social reasoning dimensions.
Result: Large-scale study across 7 state-of-the-art video generation systems reveals substantial performance gaps: models excel in surface-level plausibility but systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
Conclusion: The benchmark exposes fundamental limitations in current video generation models’ social reasoning capabilities, highlighting the need for models that can capture underlying causal and psychological logic beyond literal scene rendering.
Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
[219] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding
Jiangyuan Liu, Yuhao Zhao, Hongxuan Ma, Zhe Liu, Jian Wang, Wei Zou
Main category: cs.CV
TL;DR: MGPC is a multimodal point cloud completion framework that integrates point clouds, RGB images, and text to improve generalization to novel objects and real-world scenarios.
Details
Motivation: Existing point cloud completion methods (3D CNN-based, point-based, Transformer-based) have limitations in modality, scalability, and generative capacity, making them struggle with generalization to novel objects and real-world scenarios.Method: Proposes MGPC framework with: 1) modality dropout strategy for robustness, 2) Transformer-based fusion module for scalability, 3) progressive generator for geometric modeling, and 4) automatic data generation pipeline creating MGPC-1M benchmark with 1,000+ categories and 1M training pairs.
Result: Extensive experiments on MGPC-1M and in-the-wild data show MGPC consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
Conclusion: MGPC provides a generalizable multimodal framework that effectively addresses the generalization challenges in point cloud completion through unified multimodal integration and large-scale training.
Abstract: Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
[220] Moonworks Lunara Aesthetic Dataset
Yan Wang, M M Sayeef Abdullah, Partho Hassan, Sabit Hassan
Main category: cs.CV
TL;DR: A high-quality aesthetic image dataset with diverse artistic styles, human-refined prompts, and structured annotations, released under Apache 2.0 license.
Details
Motivation: To create a first-of-its-kind aesthetic dataset that prioritizes quality over breadth, addressing the limitations of web-derived datasets that emphasize quantity but lack precision, aesthetic quality, and licensing transparency.Method: Generated images using Moonworks Lunara model with intentional crafting of distinct aesthetic styles, accompanied by human-refined prompts and structured annotations describing objects, attributes, relationships, and stylistic cues.
Result: Created a dataset with substantially higher aesthetic scores than existing aesthetics-focused and general-purpose datasets, featuring diverse artistic styles from multiple regions and general categories, with licensing transparency for unrestricted use.
Conclusion: The Lunara Aesthetic Dataset provides a high-quality, stylistically diverse resource with clear licensing that supports both research and commercial applications, filling a gap in the current landscape of image datasets.
Abstract: The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
[221] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
Main category: cs.CV
TL;DR: Proposes ViSIL, an information-theoretic metric to evaluate multimodal video summaries by quantifying information loss across modalities, enabling direct comparison and optimal summary selection.
Details
Motivation: Traditional metrics like BLEU/ROUGE fail to measure information coverage across different modalities (text vs keyframes) in multimodal video summaries, creating a need for a unified evaluation framework.Method: Develops Video Summary Information Loss (ViSIL) score using vision-language model inference to quantify video information not captured by summaries, enabling cross-modal comparison despite structural differences.
Result: ViSIL shows statistically significant correlation with human and VLM performance on VQA tasks, and enables summary selection that outperforms text summaries by 7% in VQA accuracy without increasing processing load.
Conclusion: ViSIL provides a unified metric for evaluating multimodal video summaries, addressing limitations of traditional metrics and enabling optimal trade-offs between information coverage and processing efficiency.
Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7%$ in VQA accuracy without increasing processing load.
[222] Federated Joint Learning for Domain and Class Generalization
Haoran Xu, Jiaze Li, Jianzhong Ju, Zhenbo Luo
Main category: cs.CV
TL;DR: FedDCG is a federated learning method that jointly addresses both class and domain generalization by training class-generalized networks within domain groups and aggregating results based on domain similarity.
Details
Motivation: Existing methods typically address either unseen classes or unseen domains in isolation, without considering a joint framework for both in federated learning settings.Method: Introduces domain grouping strategy where class-generalized networks are trained within each group, uses learnable network for class generalization, and employs decoupling mechanism to separate general and domain-specific knowledge.
Result: Extensive experiments across various datasets show FedDCG outperforms state-of-the-art baselines in terms of accuracy and robustness.
Conclusion: FedDCG effectively addresses both class and domain generalization in federated learning by preventing decision boundary confusion and integrating knowledge from both generalization types.
Abstract: Efficient fine-tuning of visual-language models like CLIP has become crucial due to their large-scale parameter size and extensive pretraining requirements. Existing methods typically address either the issue of unseen classes or unseen domains in isolation, without considering a joint framework for both. In this paper, we propose \textbf{Fed}erated Joint Learning for \textbf{D}omain and \textbf{C}lass \textbf{G}eneralization, termed \textbf{FedDCG}, a novel approach that addresses both class and domain generalization in federated learning settings. Our method introduces a domain grouping strategy where class-generalized networks are trained within each group to prevent decision boundary confusion. During inference, we aggregate class-generalized results based on domain similarity, effectively integrating knowledge from both class and domain generalization. Specifically, a learnable network is employed to enhance class generalization capabilities, and a decoupling mechanism separates general and domain-specific knowledge, improving generalization to unseen domains. Extensive experiments across various datasets show that \textbf{FedDCG} outperforms state-of-the-art baselines in terms of accuracy and robustness.
[223] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps
Yuhan Chen, Ying Fang, Guofa Li, Wenxuan Yu, Yicui Shi, Jingrui Zhang, Kefei Qian, Wenbo Chu, Keqiang Li
Main category: cs.CV
TL;DR: LL-GaussianMap is the first unsupervised framework that uses 2D Gaussian Splatting for low-light image enhancement, formulating enhancement as gain map generation guided by 2DGS primitives.
Details
Motivation: Most existing low-light enhancement methods operate in pixel domain or use implicit features, neglecting intrinsic geometric structural priors. 2DGS has superior structural fitting but hasn't been explored for low-level vision tasks.Method: Two-stage approach: 1) High-fidelity structural reconstruction using 2DGS, 2) Data-driven enhancement dictionary coefficients rendered via Gaussian splatting rasterization through a unified enhancement module. Formulates enhancement as gain map generation guided by 2DGS primitives.
Result: Achieves superior enhancement performance with extremely low storage footprint, effectively preserves edges and suppresses artifacts during enhancement, and works without paired data through unsupervised learning.
Conclusion: LL-GaussianMap successfully bridges the gap by incorporating 2DGS into low-light enhancement, demonstrating the effectiveness of explicit Gaussian representations for image enhancement tasks.
Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.
[224] Atomic Depth Estimation From Noisy Electron Microscopy Data Via Deep Learning
Matan Leibovich, Mai Tan, Ramon Manzorro-Ureba, Adria Marcos-Morales, Sreyas Mohan, Peter A. Crozier, Carlos Fernandez-Granda
Main category: cs.CV
TL;DR: Novel deep learning approach for 3D atomic depth estimation from noisy TEM images using semantic segmentation formulation.
Details
Motivation: Need to extract 3D atomic-level information from transmission electron microscopy (TEM) images that suffer from significant noise, which makes depth estimation challenging.Method: Formulate depth estimation as semantic segmentation problem, train deep convolutional neural network on simulated data with synthetic noise to generate pixel-wise depth segmentation maps.
Result: Method successfully applied to estimate depth of atomic columns in CeO2 nanoparticles from both simulated and real-world TEM data, producing accurate, calibrated, and noise-robust depth estimates.
Conclusion: Deep learning-based semantic segmentation approach provides effective solution for 3D atomic depth estimation from noisy TEM images, demonstrating practical applicability to real experimental data.
Abstract: We present a novel approach for extracting 3D atomic-level information from transmission electron microscopy (TEM) images affected by significant noise. The approach is based on formulating depth estimation as a semantic segmentation problem. We address the resulting segmentation problem by training a deep convolutional neural network to generate pixel-wise depth segmentation maps using simulated data corrupted by synthetic noise. The proposed method was applied to estimate the depth of atomic columns in CeO2 nanoparticles from simulated images and real-world TEM data. Our experiments show that the resulting depth estimates are accurate, calibrated and robust to noise.
[225] iFSQ: Improving FSQ for Image Generation with 1 Line of Code
Bin Lin, Zongjian Li, Yuwei Niu, Kaixiong Gong, Yunyang Ge, Yunlong Lin, Mingzhe Zheng, JianWei Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Li Yuan
Main category: cs.CV
TL;DR: iFSQ improves FSQ quantization by replacing activation function with distribution-matching mapping, enabling optimal bin utilization and reconstruction. Using iFSQ as benchmark reveals optimal 4-bit equilibrium between discrete/continuous representations and shows AR models converge faster but diffusion models achieve higher ceilings.
Details
Motivation: The field is divided between autoregressive models (discrete tokens) and diffusion models (continuous latents), rooted in VQ-VAE vs VAE differences. This hinders unified modeling and fair benchmarking. While FSQ offers a theoretical bridge, vanilla FSQ suffers from activation collapse and trade-offs between reconstruction fidelity and information efficiency.Method: Propose iFSQ (improved FSQ) by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. This one-line code change mathematically guarantees both optimal bin utilization and reconstruction precision. Use iFSQ as controlled benchmark to analyze discrete vs continuous representations, and adapt Representation Alignment (REPA) to AR models as LlamaGen-REPA.
Result: iFSQ resolves the reconstruction-information trade-off. Benchmarking reveals: (1) optimal equilibrium between discrete and continuous representations at ~4 bits per dimension; (2) AR models show rapid initial convergence but diffusion models achieve superior performance ceiling, suggesting sequential ordering may limit generation quality upper bounds. LlamaGen-REPA demonstrates REPA adaptation to AR models.
Conclusion: iFSQ provides a simple yet effective solution to FSQ’s limitations, enabling fair benchmarking. The 4-bit equilibrium point offers guidance for representation design, while the convergence patterns suggest architectural trade-offs between AR and diffusion approaches. The work bridges the discrete-continuous divide in image generation.
Abstract: The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ
[226] Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries
Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless
Main category: cs.CV
TL;DR: Using generated synthetic images alongside text improves zero-shot accuracy prediction for Vision-Language Models compared to text-only evaluation.
Details
Motivation: Non-expert users need a way to assess whether a chosen VLM will work for their specific domain without labeled examples, as models that work well in one domain may fail in another.Method: Builds on prior text-only evaluation methods and explores approaches that generate synthetic images relevant to the task to evaluate and refine zero-shot accuracy predictions.
Result: Generated imagery substantially improves the quality of zero-shot accuracy predictions compared to baseline text-only scores, and provides users with feedback on the kinds of images used for assessment.
Conclusion: The image-based approach helps users predict whether a VLM will be effective for their application without any labeled examples, as demonstrated on standard CLIP benchmark datasets.
Abstract: Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
[227] The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus
Main category: cs.CV
TL;DR: A new agentic framework bridges the semantic gap between dialogue and cinematic video generation by translating dialogue into executable scripts, then orchestrating video models for coherent long-form narratives.
Details
Motivation: Current video generation models struggle with long-form coherent narratives from high-level concepts like dialogue, creating a "semantic gap" between creative ideas and cinematic execution.Method: Introduces an end-to-end agentic framework with ScripterAgent (translates dialogue to cinematic scripts using ScriptBench dataset) and DirectorAgent (orchestrates video models with cross-scene continuous generation for coherence).
Result: Framework significantly improves script faithfulness and temporal fidelity across tested video models, but reveals a trade-off between visual spectacle and strict script adherence in current SOTA models.
Conclusion: The framework successfully bridges the dialogue-to-video semantic gap and provides valuable insights for automated filmmaking, highlighting the need to balance visual quality with narrative faithfulness.
Abstract: Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap’’ between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.
[228] MV-S2V: Multi-View Subject-Consistent Video Generation
Ziyang Song, Xinyu Gong, Bangya Liu, Zelin Zhao
Main category: cs.CV
TL;DR: Proposes Multi-View Subject-to-Video (MV-S2V) generation using multiple reference views for 3D-level subject consistency, addressing data scarcity with synthetic data and introducing TS-RoPE to distinguish cross-subject vs cross-view references.
Details
Motivation: Existing S2V methods are limited to single-view subject references, reducing the task to S2I + I2V pipeline and failing to exploit full video subject control potential. Multi-view references are needed for true 3D-level subject consistency.Method: 1) Proposes MV-S2V task using multiple reference views; 2) Develops synthetic data curation pipeline for training data; 3) Introduces Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects vs different views of same subject in reference conditioning.
Result: Achieves superior 3D subject consistency with multi-view reference images and high-quality visual outputs, establishing new direction for subject-driven video generation.
Conclusion: MV-S2V framework successfully addresses limitations of single-view S2V methods by enabling multi-view subject control with 3D consistency, using synthetic data and novel TS-RoPE technique for effective reference conditioning.
Abstract: Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at: https://szy-young.github.io/mv-s2v
[229] Geometry-Grounded Gaussian Splatting
Baowen Zhang, Chenxing Jiang, Heng Li, Shaojie Shen, Ping Tan
Main category: cs.CV
TL;DR: The paper introduces Geometry-Grounded Gaussian Splatting, a method that treats Gaussian primitives as stochastic solids to enable high-quality shape reconstruction from Gaussian Splatting representations.
Details
Motivation: While Gaussian Splatting (GS) shows impressive quality in novel view synthesis, shape extraction from Gaussian primitives remains challenging due to inadequate geometry parameterization and approximation, leading to poor multi-view consistency and sensitivity to floaters.Method: The authors establish Gaussian primitives as a specific type of stochastic solids through rigorous theoretical derivation. This framework enables direct treatment of Gaussian primitives as explicit geometric representations. Using the volumetric nature of stochastic solids, the method efficiently renders high-quality depth maps for fine-grained geometry extraction.
Result: Experiments show that the method achieves the best shape reconstruction results among all Gaussian Splatting-based methods on public datasets.
Conclusion: The theoretical framework provides a principled foundation for Geometry-Grounded Gaussian Splatting by enabling Gaussian primitives to serve as explicit geometric representations, solving the shape extraction problem that previously limited GS applications.
Abstract: Gaussian Splatting (GS) has demonstrated impressive quality and efficiency in novel view synthesis. However, shape extraction from Gaussian primitives remains an open problem. Due to inadequate geometry parameterization and approximation, existing shape reconstruction methods suffer from poor multi-view consistency and are sensitive to floaters. In this paper, we present a rigorous theoretical derivation that establishes Gaussian primitives as a specific type of stochastic solids. This theoretical framework provides a principled foundation for Geometry-Grounded Gaussian Splatting by enabling the direct treatment of Gaussian primitives as explicit geometric representations. Using the volumetric nature of stochastic solids, our method efficiently renders high-quality depth maps for fine-grained geometry extraction. Experiments show that our method achieves the best shape reconstruction results among all Gaussian Splatting-based methods on public datasets.
[230] SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification
Ignacio Antequera-Sánchez, Juan Luis Suárez-Díaz, Rosana Montes, Francisco Herrera
Main category: cs.CV
TL;DR: SeNeDiF-OOD: A hierarchical Semantic Nested Dichotomy Fusion framework for OOD detection that outperforms traditional baselines by decomposing detection into binary fusion nodes aligned with semantic abstraction levels.
Details
Motivation: Current OOD detection methods struggle with heterogeneous OOD data (from low-level corruption to semantic shifts) in open-world environments, and single-stage detectors often fail to address this complexity.Method: Semantic Nested Dichotomy Fusion (SeNeDiF-OOD) decomposes OOD detection into a hierarchical structure of binary fusion nodes, where each layer integrates decision boundaries aligned with specific levels of semantic abstraction.
Result: Extensive experimental evaluation on MonuMAI (architectural style recognition system) shows the hierarchical fusion methodology significantly outperforms traditional baselines, effectively filtering diverse OOD categories while preserving in-distribution performance.
Conclusion: The proposed SeNeDiF-OOD framework provides an effective solution for handling heterogeneous OOD data in open-world environments, demonstrating superior performance over conventional approaches through its hierarchical semantic fusion structure.
Abstract: Out-of-distribution (OOD) detection is a fundamental requirement for the reliable deployment of artificial intelligence applications in open-world environments. However, addressing the heterogeneous nature of OOD data, ranging from low-level corruption to semantic shifts, remains a complex challenge that single-stage detectors often fail to resolve. To address this issue, we propose SeNeDiF-OOD, a novel methodology based on Semantic Nested Dichotomy Fusion. This framework decomposes the detection task into a hierarchical structure of binary fusion nodes, where each layer is designed to integrate decision boundaries aligned with specific levels of semantic abstraction. To validate the proposed framework, we present a comprehensive case study using MonuMAI, a real-world architectural style recognition system exposed to an open environment. This application faces a diverse range of inputs, including non-monument images, unknown architectural styles, and adversarial attacks, making it an ideal testbed for our proposal. Through extensive experimental evaluation in this domain, results demonstrate that our hierarchical fusion methodology significantly outperforms traditional baselines, effectively filtering these diverse OOD categories while preserving in-distribution performance.
cs.AI
[231] Agentic Business Process Management Systems
Marlon Dumas, Fredrik Milani, David Chapela-Campa
Main category: cs.AI
TL;DR: The paper proposes Agentic Business Process Management Systems (A-BPMS) that integrate AI agents with process mining to shift BPM from automation to autonomy and from design-driven to data-driven management.
Details
Motivation: The rise of Generative and Agentic AI presents an opportunity for a new wave in BPM evolution, shifting focus from automation to autonomy and from design-driven to data-driven process management using process mining techniques.Method: The paper proposes an architectural vision for A-BPMS that integrates autonomy, reasoning, and learning into process management, leveraging process mining as a foundation for agents to sense process states, reason about improvements, and act to optimize performance.
Result: The paper outlines how process mining provides the foundation for AI agents in BPM and proposes a new class of platforms (A-BPMS) that support a continuum of processes from human-driven to fully autonomous.
Conclusion: A-BPMS will redefine the boundaries of process automation and governance by integrating AI agents with process mining, enabling autonomous, data-driven process management across the full spectrum from human-driven to fully autonomous processes.
Abstract: Since the early 90s, the evolution of the Business Process Management (BPM) discipline has been punctuated by successive waves of automation technologies. Some of these technologies enable the automation of individual tasks, while others focus on orchestrating the execution of end-to-end processes. The rise of Generative and Agentic Artificial Intelligence (AI) is opening the way for another such wave. However, this wave is poised to be different because it shifts the focus from automation to autonomy and from design-driven management of business processes to data-driven management, leveraging process mining techniques. This position paper, based on a keynote talk at the 2025 Workshop on AI for BPM, outlines how process mining has laid the foundations on top of which agents can sense process states, reason about improvement opportunities, and act to maintain and optimize performance. The paper proposes an architectural vision for Agentic Business Process Management Systems (A-BPMS): a new class of platforms that integrate autonomy, reasoning, and learning into process management and execution. The paper contends that such systems must support a continuum of processes, spanning from human-driven to fully autonomous, thus redefining the boundaries of process automation and governance.
[232] LLM Driven Design of Continuous Optimization Problems with Controllable High-level Properties
Urban Skvorc, Niki van Stein, Moritz Seiler, Britta Grimme, Thomas Bäck, Heike Trautmann
Main category: cs.AI
TL;DR: LLMs can generate diverse optimization problems with specific landscape characteristics through evolutionary guidance and ELA-based scoring, expanding benchmark diversity beyond BBOB.
Details
Motivation: Existing benchmark suites like BBOB lack structural diversity, limiting continuous black-box optimization benchmarking. Need for more diverse, interpretable benchmark problems.Method: Use LLaMEA framework to guide LLMs in generating problem code from natural-language descriptions of target properties (multimodality, separability, etc.). Evolutionary loop with ELA-based property predictors scores candidates. ELA-space fitness-sharing increases diversity and avoids redundant landscapes.
Result: Generated functions exhibit intended structural traits verified by basin-of-attraction analysis, statistical testing, and visual inspection. t-SNE embedding shows they expand BBOB instance space rather than forming unrelated clusters.
Conclusion: The approach produces a broad, interpretable, and reproducible benchmark library for landscape analysis and automated algorithm selection, addressing limitations of existing test suites.
Abstract: Benchmarking in continuous black-box optimisation is hindered by the limited structural diversity of existing test suites such as BBOB. We explore whether large language models embedded in an evolutionary loop can be used to design optimisation problems with clearly defined high-level landscape characteristics. Using the LLaMEA framework, we guide an LLM to generate problem code from natural-language descriptions of target properties, including multimodality, separability, basin-size homogeneity, search-space homogeneity and globallocal optima contrast. Inside the loop we score candidates through ELA-based property predictors. We introduce an ELA-space fitness-sharing mechanism that increases population diversity and steers the generator away from redundant landscapes. A complementary basin-of-attraction analysis, statistical testing and visual inspection, verifies that many of the generated functions indeed exhibit the intended structural traits. In addition, a t-SNE embedding shows that they expand the BBOB instance space rather than forming an unrelated cluster. The resulting library provides a broad, interpretable, and reproducible set of benchmark problems for landscape analysis and downstream tasks such as automated algorithm selection.
[233] More at Stake: How Payoff and Language Shape LLM Agent Strategies in Cooperation Dilemmas
Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Nguyen Lam Phu Quy, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Pham Phu Hoa, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han
Main category: cs.AI
TL;DR: LLMs show consistent strategic patterns in repeated social dilemmas, with behavior influenced by payoff magnitude and linguistic context, revealing cooperation biases with implications for AI governance.
Details
Motivation: As LLMs increasingly act as autonomous agents in interactive and multi-agent settings, understanding their strategic behavior is critical for safety, coordination, and AI-driven social and economic systems.Method: Used payoff-scaled Prisoner’s Dilemma to isolate sensitivity to incentive strength, trained supervised classifiers on canonical repeated-game strategies, and applied them to LLM decisions across models and languages.
Result: Observed consistent behavioral patterns including incentive-sensitive conditional strategies and cross-linguistic divergence. Linguistic framing sometimes matched or exceeded architectural effects in shaping behavior.
Conclusion: Provides a unified framework for auditing LLMs as strategic agents and highlights cooperation biases with direct implications for AI governance and multi-agent system design.
Abstract: As LLMs increasingly act as autonomous agents in interactive and multi-agent settings, understanding their strategic behavior is critical for safety, coordination, and AI-driven social and economic systems. We investigate how payoff magnitude and linguistic context shape LLM strategies in repeated social dilemmas, using a payoff-scaled Prisoner’s Dilemma to isolate sensitivity to incentive strength. Across models and languages, we observe consistent behavioral patterns, including incentive-sensitive conditional strategies and cross-linguistic divergence. To interpret these dynamics, we train supervised classifiers on canonical repeated-game strategies and apply them to LLM decisions, revealing systematic, model- and language-dependent behavioral intentions, with linguistic framing sometimes matching or exceeding architectural effects. Our results provide a unified framework for auditing LLMs as strategic agents and highlight cooperation biases with direct implications for AI governance and multi-agent system design.
[234] Explainable Uncertainty Quantification for Wastewater Treatment Energy Prediction via Interval Type-2 Neuro-Fuzzy System
Qusai Khaled, Bahjat Mallak, Uzay Kaymak, Laura Genga
Main category: cs.AI
TL;DR: IT2-ANFIS model for wastewater treatment energy forecasting with explainable uncertainty quantification through fuzzy rule structures
Details
Motivation: Wastewater treatment consumes significant global electricity (1-3%), requiring accurate energy forecasting for optimization. Current ML models lack explainable uncertainty quantification needed for risk-aware decision-making in safety-critical infrastructure.Method: Developed Interval Type-2 Adaptive Neuro-Fuzzy Inference System (IT2-ANFIS) that generates interpretable prediction intervals through fuzzy rule structures. Framework decomposes uncertainty across three levels: feature-level (identifies ambiguous variables), rule-level (reveals local model confidence), and instance-level (quantifies overall prediction uncertainty).
Result: Validated on Melbourne Water’s Eastern Treatment Plant dataset. IT2-ANFIS achieves comparable predictive performance to first order ANFIS with substantially reduced variance across training runs. Provides explainable uncertainty estimates linking prediction confidence directly to operational conditions and input variables.
Conclusion: The IT2-ANFIS framework offers both accurate energy forecasting and explainable uncertainty quantification, addressing the critical need for transparent, risk-aware decision-making in wastewater treatment operations.
Abstract: Wastewater treatment plants consume 1-3% of global electricity, making accurate energy forecasting critical for operational optimization and sustainability. While machine learning models provide point predictions, they lack explainable uncertainty quantification essential for risk-aware decision-making in safety-critical infrastructure. This study develops an Interval Type-2 Adaptive Neuro-Fuzzy Inference System (IT2-ANFIS) that generates interpretable prediction intervals through fuzzy rule structures. Unlike black-box probabilistic methods, the proposed framework decomposes uncertainty across three levels: feature-level, footprint of uncertainty identify which variables introduce ambiguity, rule-level analysis reveals confidence in local models, and instance-level intervals quantify overall prediction uncertainty. Validated on Melbourne Water’s Eastern Treatment Plant dataset, IT2-ANFIS achieves comparable predictive performance to first order ANFIS with substantially reduced variance across training runs, while providing explainable uncertainty estimates that link prediction confidence directly to operational conditions and input variables.
[235] RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures
Andrew Jaffe, Noah Reicin, Jinho D. Choi
Main category: cs.AI
TL;DR: LLMs struggle with non-sequential instructions despite identical content, revealing structural sensitivity as a fundamental limitation in current architectures.
Details
Motivation: LLMs are increasingly used for complex workflows, but their ability to handle non-sequential instruction flow remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it hard to isolate the impact of prompt topology on performance.Method: Introduced RIFT (Reordered Instruction Following Testbed) to assess instruction following by disentangling structure from content. Used rephrased Jeopardy! question-answer pairs across two prompt structures: linear prompts (sequential progression) and jumping prompts (identical content but requiring non-sequential traversal). Conducted 10,000 evaluations across six state-of-the-art open-source LLMs.
Result: Accuracy dropped by up to 72% under jumping conditions compared to baseline, showing strong dependence on positional continuity. Error analysis revealed approximately 50% of failures stem from instruction-order violations and semantic drift, indicating LLMs internalize instruction following as sequential patterns rather than reasoning skills.
Conclusion: Structural sensitivity is a fundamental limitation in current LLM architectures, with direct implications for applications requiring non-sequential control flow like workflow automation and multi-agent systems.
Abstract: Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question-answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non-sequential traversal. Across 10,000 evaluations spanning six state-of-the-art open-source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction-order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non-sequential control flow such as workflow automation and multi-agent systems.
[236] TS-Debate: Multimodal Collaborative Debate for Zero-Shot Time Series Reasoning
Patara Trirat, Jin Myung Kwak, Jay Heo, Heejun Lee, Sung Ju Hwang
Main category: cs.AI
TL;DR: TS-Debate is a multi-agent debate framework for zero-shot time series reasoning that uses specialized experts for text, visuals, and numbers, with verification mechanisms to reduce hallucinations.
Details
Motivation: LLMs struggle with numeric fidelity, modality interference, and principled cross-modal integration in time series analysis, despite showing promise in reasoning over temporal structure.Method: A collaborative multi-agent debate framework with dedicated expert agents for textual context, visual patterns, and numerical signals, using explicit domain knowledge elicitation, structured debate protocol, reviewer agents with verification-conflict-calibration mechanism, and lightweight code execution for programmatic verification.
Result: Across 20 tasks spanning three public benchmarks, TS-Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate where all agents observe all inputs.
Conclusion: TS-Debate preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task-specific fine-tuning, demonstrating effective zero-shot time series reasoning.
Abstract: Recent progress at the intersection of large language models (LLMs) and time series (TS) analysis has revealed both promise and fragility. While LLMs can reason over temporal structure given carefully engineered context, they often struggle with numeric fidelity, modality interference, and principled cross-modal integration. We present TS-Debate, a modality-specialized, collaborative multi-agent debate framework for zero-shot time series reasoning. TS-Debate assigns dedicated expert agents to textual context, visual patterns, and numerical signals, preceded by explicit domain knowledge elicitation, and coordinates their interaction via a structured debate protocol. Reviewer agents evaluate agent claims using a verification-conflict-calibration mechanism, supported by lightweight code execution and numerical lookup for programmatic verification. This architecture preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task-specific fine-tuning. Across 20 tasks spanning three public benchmarks, TS-Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate in which all agents observe all inputs.
[237] Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation
Nanhan Shen, Zhilei Liu
Main category: cs.AI
TL;DR: UA-3DTalk improves 3D emotional talking face synthesis by addressing poor audio-vision emotion alignment and suboptimal multi-view fusion through uncertainty-aware mechanisms and emotion prior distillation.
Details
Motivation: Existing 3D emotional talking face methods suffer from poor audio-vision emotion alignment (difficult audio emotion extraction and inadequate control over emotional micro-expressions) and use a one-size-fits-all multi-view fusion strategy that ignores uncertainty and feature quality differences, undermining rendering quality.Method: UA-3DTalk has three core modules: 1) Prior Extraction module disentangles audio into content-synchronized features for alignment and person-specific complementary features; 2) Emotion Distillation module uses multi-modal attention-weighted fusion and 4D Gaussian encoding with multi-resolution code-books for fine-grained audio emotion extraction and micro-expression control; 3) Uncertainty-based Deformation uses uncertainty blocks to estimate view-specific aleatoric and epistemic uncertainty for adaptive multi-view fusion, with a multi-head decoder for Gaussian primitive optimization.
Result: Extensive experiments on regular and emotional datasets show UA-3DTalk outperforms state-of-the-art methods (DEGSTalk, EDTalk) by 5.2% in E-FID for emotion alignment, 3.1% in SyncC for lip synchronization, and 0.015 in LPIPS for rendering quality.
Conclusion: UA-3DTalk effectively addresses key challenges in 3D emotional talking face synthesis through uncertainty-aware mechanisms and emotion prior distillation, achieving superior performance in emotion alignment, lip synchronization, and rendering quality compared to existing methods.
Abstract: Emotional Talking Face synthesis is pivotal in multimedia and signal processing, yet existing 3D methods suffer from two critical challenges: poor audio-vision emotion alignment, manifested as difficult audio emotion extraction and inadequate control over emotional micro-expressions; and a one-size-fits-all multi-view fusion strategy that overlooks uncertainty and feature quality differences, undermining rendering quality. We propose UA-3DTalk, Uncertainty-Aware 3D Emotional Talking Face Synthesis with emotion prior distillation, which has three core modules: the Prior Extraction module disentangles audio into content-synchronized features for alignment and person-specific complementary features for individualization; the Emotion Distillation module introduces a multi-modal attention-weighted fusion mechanism and 4D Gaussian encoding with multi-resolution code-books, enabling fine-grained audio emotion extraction and precise control of emotional micro-expressions; the Uncertainty-based Deformation deploys uncertainty blocks to estimate view-specific aleatoric (input noise) and epistemic (model parameters) uncertainty, realizing adaptive multi-view fusion and incorporating a multi-head decoder for Gaussian primitive optimization to mitigate the limitations of uniform-weight fusion. Extensive experiments on regular and emotional datasets show UA-3DTalk outperforms state-of-the-art methods like DEGSTalk and EDTalk by 5.2% in E-FID for emotion alignment, 3.1% in SyncC for lip synchronization, and 0.015 in LPIPS for rendering quality. Project page: https://mrask999.github.io/UA-3DTalk
[238] Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
Qiyuan Xu, Xiaokun Luan, Renxi Wang, Joshua Ong Jun Leang, Peixin Wang, Haonan Li, Wenda Li, Conrad Watt
Main category: cs.AI
TL;DR: First benchmark for neural theorem proving on real-world verification conditions from industrial codebases, showing LLMs have promise but significant challenges remain.
Details
Motivation: Automated theorem proving for verification conditions (VCs) is a major bottleneck in program verification. Real-world VCs are often too hard for existing ATPs, requiring extensive manual proofs. While neural theorem proving has succeeded in mathematical domains, its application to program verification VCs remains unexplored with no dedicated benchmarks.Method: Created NTP4VC benchmark using real-world projects (Linux, Contiki-OS) and industrial verification pipelines (Why3, Frama-C) to generate semantically equivalent test cases across Isabelle, Lean, and Rocq formal languages. Evaluated both general-purpose and theorem-proving fine-tuned LLMs on this benchmark.
Result: LLMs show promise in VC proving but face significant challenges for program verification tasks. The benchmark reveals a large gap between current capabilities and practical needs, highlighting substantial opportunity for future research.
Conclusion: NTP4VC is the first real-world multi-language benchmark for neural theorem proving on verification conditions. While LLMs demonstrate potential, program verification presents unique challenges that require further research to bridge the gap between current capabilities and practical application needs.
Abstract: Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification–particularly VC proving–remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
[239] Exploring Weaknesses in Function Call Models via Reinforcement Learning: An Adversarial Data Augmentation Approach
Weiran Guo, Bing Bo, Shaoxiang Wu, Jingsheng Yang
Main category: cs.AI
TL;DR: Proposes adversarial data augmentation using reinforcement learning to improve LLM function call robustness by generating challenging queries.
Details
Motivation: Existing methods for improving LLM function call capabilities rely on manual annotation or automated generation, which lack targeted design and are constrained by fixed patterns, limiting generalization and robustness.Method: Adversarial data augmentation using reinforcement learning: trains a query model with RL to generate adversarial queries targeting function call model weaknesses, using zero-sum game formulation with iterative alternating training.
Result: The method advances development of more robust function call models and provides systematic way to identify and correct weaknesses in LLM external tool interaction.
Conclusion: Proposed adversarial RL approach effectively enhances function call LLM robustness by systematically targeting weaknesses through generated adversarial queries.
Abstract: Function call capabilities have become crucial for Large Language Models (LLMs), enabling them to interact more effectively with external tools and APIs. Existing methods for improving the function call capabilities of LLMs rely on data obtained either through manual annotation or automated generation by models, and use this data to finetune the LLMs. However, these methods often lack targeted design and are constrained by fixed patterns and data distributions, which limits their effectiveness in enhancing the generalization and robustness of function call LLMs. To address this limitation, we propose a novel adversarial data augmentation method that employs reinforcement learning to systematically identify and target the weaknesses of function call LLMs. Our training framework introduces a query model trained with reinforcement learning (RL) to generate adversarial queries that are specifically designed to challenge function call (FC) models. This approach adopts a zero sum game formulation, where the query model and the FC model engage in iterative alternating training. Overall, our method advances the development of more robust FC models and provides a systematic way to identify and correct weaknesses in the ability of LLMs to interact with external tools.
[240] Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction
Zhicheng Zhang, Zhaocheng Du, Jieming Zhu, Jiwei Tang, Fengyuan Lu, Wang Jiaheng, Song-Li Wu, Qianhui Zhu, Jingyu Li, Hai-Tao Zheng, Zhenhua Dong
Main category: cs.AI
TL;DR: LAIN is a plug-and-play framework that uses sequence length as a conditioning signal to balance long- and short-sequence modeling in CTR prediction, improving performance for short-sequence users without harming long-sequence effectiveness.
Details
Motivation: Modern recommendation systems face length heterogeneity in user behavior sequences. Increasing maximum input sequence length in existing CTR models paradoxically degrades performance for short-sequence users due to attention polarization and length imbalance in training data.Method: LAIN (Length-Adaptive Interest Network) with three components: 1) Spectral Length Encoder that maps length into continuous representations, 2) Length-Conditioned Prompting that injects global contextual cues into both long- and short-term behavior branches, and 3) Length-Modulated Attention that adaptively adjusts attention sharpness based on sequence length.
Result: Extensive experiments on three real-world benchmarks across five strong CTR backbones show LAIN consistently improves overall performance, achieving up to 1.15% AUC gain and 2.25% log loss reduction. Notably, significantly improves accuracy for short-sequence users without sacrificing long-sequence effectiveness.
Conclusion: LAIN offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation by explicitly incorporating sequence length as a conditioning signal to balance long- and short-sequence modeling.
Abstract: User behavior sequences in modern recommendation systems exhibit significant length heterogeneity, ranging from sparse short-term interactions to rich long-term histories. While longer sequences provide more context, we observe that increasing the maximum input sequence length in existing CTR models paradoxically degrades performance for short-sequence users due to attention polarization and length imbalance in training data. To address this, we propose LAIN(Length-Adaptive Interest Network), a plug-and-play framework that explicitly incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling. LAIN consists of three lightweight components: a Spectral Length Encoder that maps length into continuous representations, Length-Conditioned Prompting that injects global contextual cues into both long- and short-term behavior branches, and Length-Modulated Attention that adaptively adjusts attention sharpness based on sequence length. Extensive experiments on three real-world benchmarks across five strong CTR backbones show that LAIN consistently improves overall performance, achieving up to 1.15% AUC gain and 2.25% log loss reduction. Notably, our method significantly improves accuracy for short-sequence users without sacrificing longsequence effectiveness. Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.
[241] LocationAgent: A Hierarchical Agent for Image Geolocation via Decoupling Strategy and Evidence from Parametric Knowledge
Qiujun Li, Zijin Xiao, Xulin Wang, Zhidan Ma, Cheng Yang, Haifeng Li
Main category: cs.AI
TL;DR: LocationAgent is a hierarchical localization agent that separates reasoning from evidence verification using external tools, achieving 30%+ improvement over existing methods in zero-shot settings.
Details
Motivation: Existing image geolocation methods internalize location knowledge and reasoning patterns into static memory, making them prone to factual hallucinations and generalization bottlenecks in open-world settings requiring dynamic knowledge.Method: Proposes LocationAgent with RER architecture (Reasoner-Executor-Recorder) for hierarchical reasoning and external clue exploration tools for evidence verification. Also introduces CCL-Bench benchmark for Chinese data.
Result: LocationAgent significantly outperforms existing methods by at least 30% in zero-shot settings.
Conclusion: Separating reasoning logic from evidence verification using external tools effectively addresses hallucination and generalization issues in image geolocation, with the approach validated through comprehensive experiments.
Abstract: Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textit{hypothesis-verification cycles}, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30% in zero-shot settings.
[242] Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement
Wangyang Ying, Yanchi Liu, Xujiang Zhao, Wei Cheng, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen
Main category: cs.AI
TL;DR: Procedural graph extraction from natural language using multi-agent framework with structural and logical refinement
Details
Motivation: Automatically extracting workflows as procedural graphs from natural language is promising but challenging, requiring both structural validity and logical alignment. Current LLMs often produce ill-formed structures or misinterpret logical flows.Method: Multi-agent framework with three iterative stages: (1) graph extraction by graph builder agent, (2) structural feedback via simulation agent diagnosing structural defects, and (3) logical feedback via semantic agent aligning semantics between flow logic and linguistic cues. Uses natural language feedback prioritized and injected into subsequent prompts.
Result: Achieves substantial improvements in both structural correctness and logical consistency over strong baselines.
Conclusion: The multi-agent framework enables interpretable and controllable refinement for procedural graph extraction, allowing agents to target distinct error types without supervision or parameter updates.
Abstract: Automatically extracting workflows as procedural graphs from natural language is promising yet underexplored, demanding both structural validity and logical alignment. While recent large language models (LLMs) show potential for procedural graph extraction, they often produce ill-formed structures or misinterpret logical flows. We present \model{}, a multi-agent framework that formulates procedural graph extraction as a multi-round reasoning process with dedicated structural and logical refinement. The framework iterates through three stages: (1) a graph extraction phase with the graph builder agent, (2) a structural feedback phase in which a simulation agent diagnoses and explains structural defects, and (3) a logical feedback phase in which a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into subsequent prompts, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that \model{} achieves substantial improvements in both structural correctness and logical consistency over strong baselines.
[243] CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
Jingyu Li, Zhaocheng Du, Qianhui Zhu, kaiyuan Li, Zhicheng Zhang, Song-Li Wu, Chaolang Li, Pengwen Dai
Main category: cs.AI
TL;DR: CollectiveKV reduces KV cache storage in sequential recommendation by sharing similar KV patterns across users via a global pool, achieving 99.2% compression while maintaining performance.
Details
Motivation: KV cache reduces inference latency in sequential recommendation but creates massive storage overhead due to large user bases with long histories. Observations show significant KV similarities across users, suggesting collaborative signals exist in KV sequences.Method: Proposes CollectiveKV with cross-user KV sharing: 1) Uses SVD analysis to identify shareable vs user-specific KV information, 2) Creates learnable global KV pool for shared information, 3) During inference, retrieves high-dimensional shared KV from pool and concatenates with low-dimensional user-specific KV.
Result: Achieves 0.8% KV cache size (99.2% compression) across five sequential recommendation models and three datasets while maintaining or even improving model performance.
Conclusion: KV sequences contain collaborative signals that can be exploited for efficient storage. CollectiveKV effectively compresses KV cache with minimal performance impact, addressing the storage bottleneck in large-scale sequential recommendation systems.
Abstract: Sequential recommendation models are widely used in applications, yet they face stringent latency requirements. Mainstream models leverage the Transformer attention mechanism to improve performance, but its computational complexity grows with the sequence length, leading to a latency challenge for long sequences. Consequently, KV cache technology has recently been explored in sequential recommendation systems to reduce inference latency. However, KV cache introduces substantial storage overhead in sequential recommendation systems, which often have a large user base with potentially very long user history sequences. In this work, we observe that KV sequences across different users exhibit significant similarities, indicating the existence of collaborative signals in KV. Furthermore, we analyze the KV using singular value decomposition (SVD) and find that the information in KV can be divided into two parts: the majority of the information is shareable across users, while a small portion is user-specific. Motivated by this, we propose CollectiveKV, a cross-user KV sharing mechanism. It captures the information shared across users through a learnable global KV pool. During inference, each user retrieves high-dimensional shared KV from the pool and concatenates them with low-dimensional user-specific KV to obtain the final KV. Experiments on five sequential recommendation models and three datasets show that our method can compress the KV cache to only 0.8% of its original size, while maintaining or even enhancing model performance.
[244] CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning
Van-Quang Nguyen, Takayuki Okatani
Main category: cs.AI
TL;DR: CoReTab introduces a code-driven reasoning framework for multimodal table understanding that generates interpretable, verifiable multi-step reasoning with Python code, achieving significant performance gains over existing methods.
Details
Motivation: Existing multimodal table understanding datasets provide only short factual answers without explicit reasoning supervision, leading to models that generate brief, uninterpretable responses with insufficient accuracy.Method: Developed CoReTab framework that couples multi-step reasoning with executable Python code to create scalable, interpretable annotations. Curated 115K verified samples and fine-tuned open-source MLLMs through a three-stage pipeline.
Result: Achieved significant gains of +6.2% (table QA), +5.7% (fact verification), and +25.6% (table structure understanding) over MMTab-trained baselines across 17 benchmarks, while producing transparent reasoning traces.
Conclusion: CoReTab establishes a robust and generalizable supervision framework for improving multi-step reasoning in multimodal table understanding through code-driven, verifiable reasoning.
Abstract: Existing datasets for multimodal table understanding, such as MMTab, primarily provide short factual answers without explicit multi-step reasoning supervision. Models trained on these datasets often generate brief responses that offers insufficient accuracy and limited interpretability into how these models arrive at the final answer. We introduce CoReTab, a code-driven reasoning framework that produces scalable, interpretable, and automatically verifiable annotations by coupling multi-step reasoning with executable Python code. Using the CoReTab framework, we curate a dataset of 115K verified samples averaging 529 tokens per response and fine-tune open-source MLLMs through a three-stage pipeline. We evaluate the resulting model trained on CoReTab across 17 MMTab benchmarks spanning table question answering, fact verification, and table structure understanding. Our model achieves significant gains of +6.2%, +5.7%, and +25.6%, respectively, over MMTab-trained baselines, while producing transparent and verifiable reasoning traces. These results establish CoReTab as a robust and generalizable supervision framework for improving multi-step reasoning in multimodal table understanding.
[245] MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution
Libo Sun, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
Main category: cs.AI
TL;DR: MAGNET is a memory-driven adaptive GUI agent framework with dual-level memory that leverages stable functional semantics and task intents across UI changes to improve performance in evolving software environments.
Details
Motivation: Mobile GUI agents trained on historical data fail when UI appearance changes and workflows reorganize, even though functional semantics and task intents remain fundamentally stable across updates.Method: Introduces MAGNET with dual-level memory: stationary memory links visual features to stable functional semantics for action grounding, and procedural memory captures stable task intents across varying workflows. Includes dynamic memory evolution mechanism that continuously refines memories by prioritizing frequently accessed knowledge.
Result: Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts.
Conclusion: Leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments, validating the approach of focusing on functional semantics and task intents rather than surface appearance.
Abstract: Mobile GUI agents powered by large foundation models enable autonomous task execution, but frequent updates altering UI appearance and reorganizing workflows cause agents trained on historical data to fail. Despite surface changes, functional semantics and task intents remain fundamentally stable. Building on this insight, we introduce MAGNET, a memory-driven adaptive agent framework with dual-level memory: stationary memory linking diverse visual features to stable functional semantics for robust action grounding and procedural memory capturing stable task intents across varying workflows. We propose a dynamic memory evolution mechanism that continuously refines both memories by prioritizing frequently accessed knowledge. Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts. These results validate that leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments.
[246] MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning
Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi
Main category: cs.AI
TL;DR: MATA is a multi-agent hierarchical trainable automaton for visual reasoning that uses a trainable hyper agent to coordinate specialized agents, achieving state-of-the-art results with improved interpretability.
Details
Motivation: Current vision-language models have strong perception but suffer from implicit reasoning that's hard to explain and prone to hallucinations on complex queries. Compositional methods improve interpretability but typically rely on single agents or fixed pipelines, lacking the ability to dynamically decide when agents should collaborate or compete.Method: MATA is a multi-agent system structured as a hierarchical finite-state automaton. A trainable hyper agent controls top-level transitions between specialized agents. Each agent runs a small rule-based sub-automaton for reliable micro-control. All agents read and write to shared memory for transparent execution history. The hyper agent’s transition policy is trained using a supervised finetuning dataset (MATA-SFT-90K) created from transition-trajectory trees.
Result: MATA achieves state-of-the-art results across multiple visual reasoning benchmarks compared to both monolithic and compositional baselines. The finetuned LLM as transition policy effectively understands queries and agent capabilities to choose optimal agents for tasks.
Conclusion: MATA provides an interpretable, multi-agent approach to visual reasoning that dynamically coordinates specialized agents through a trainable hyper agent, addressing limitations of both monolithic models and fixed compositional pipelines while achieving superior performance.
Abstract: Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent’s transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
[247] Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection
Yongxin Deng, Zhen Fang, Yixuan Li, Ling Chen
Main category: cs.AI
TL;DR: SpikeScore method detects LLM hallucinations by measuring uncertainty fluctuations in multi-turn dialogues, achieving strong cross-domain generalization.
Details
Motivation: Existing hallucination detection methods perform well within the same domain but fail to generalize across different domains, limiting real-world deployment of LLMs. The paper addresses this cross-domain generalization gap.Method: Proposes SpikeScore, which quantifies abrupt uncertainty fluctuations in multi-turn dialogues following LLM responses. The method is based on the observation that hallucination-initiated dialogues exhibit larger uncertainty fluctuations than factual ones across domains.
Result: SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks show it outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods.
Conclusion: The SpikeScore-based detection method effectively addresses the generalizable hallucination detection problem, providing robust cross-domain performance for LLM hallucination detection in real-world applications.
Abstract: Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.
[248] GLOVE: Global Verifier for LLM Memory-Environment Realignment
Xingkun Yin, Hongyang Du
Main category: cs.AI
TL;DR: GLOVE introduces a relative truth verification framework for LLM memory systems that detects inconsistencies between memories and fresh observations to enable memory-environment realignment without ground-truth supervision.
Details
Motivation: Existing memory-enhanced LLM approaches rely on external evaluators or internal model cognition for memory validation, but these assumptions break down in practical environments with dynamic drifts, creating a need for more robust memory verification methods.Method: Proposes GLOVE (Global Verifier) framework that establishes relative truth through active probing to detect inconsistencies between retrieved memories and fresh observations, enabling memory verification and updating without ground-truth supervision or heavy reliance on model introspection.
Result: GLOVE substantially improves agent success rates across diverse benchmarks (web navigation, planning, control) augmented with controlled environmental drifts, demonstrating robustness in non-stationary settings.
Conclusion: GLOVE provides a robust pathway to cognitive agents capable of self-evolving by introducing a new design dimension for LLM memory systems based on relative truth verification and memory-environment realignment.
Abstract: Most existing memory-enhanced Large Language Model (LLM) approaches implicitly assume that memory validity can be established either through external evaluators that provide task-specific success signals or through internal model cognition, such as reflection, for editing memory entries. However, these assumptions often break down in practical environments with dynamic drifts. We propose the Global Verifier (GLOVE), a framework that introduces a new design dimension for LLM memory systems by establishing a relative notion of truth. Through active probing to detect inconsistencies between retrieved memories and fresh observations, GLOVE enables memory-environment realignment by verifying and updating memory without access to ground-truth supervision or strong reliance on model introspection. We evaluate GLOVE on diverse benchmarks spanning web navigation, planning, and control, augmented with controlled environmental drifts that introduce non-stationarity beyond the original benchmark settings. Our results show that GLOVE substantially improves agent success rates, suggesting a robust pathway to cognitive agents capable of self-evolving.
[249] Curiosity Driven Knowledge Retrieval for Mobile Agents
Sijia Li, Xiaoyu Tan, Shahir Ali, Niels Schmidt, Gengchen Ma, Xihe Qiu
Main category: cs.AI
TL;DR: A curiosity-driven knowledge retrieval framework for mobile agents that uses AppCards to encode app information, improving performance on complex smartphone automation tasks.
Details
Motivation: Mobile agents for smartphone automation struggle with incomplete knowledge and poor generalization to unseen environments, limiting their performance in complex applications.Method: Introduces a curiosity-driven framework that formalizes execution uncertainty as a curiosity score. When threshold is exceeded, retrieves external information from docs, code, and historical trajectories, organizing it into structured AppCards encoding functional semantics, parameters, interfaces, and interaction patterns. Enhanced agent selectively integrates relevant AppCards during reasoning.
Result: Evaluation on AndroidWorld benchmark shows consistent improvements across backbones: average gain of 6 percentage points, new SOTA success rate of 88.8% with GPT-5. AppCards are particularly effective for multi-step and cross-application tasks, reducing ambiguity, shortening exploration, and supporting stable execution trajectories.
Conclusion: The curiosity-driven knowledge retrieval framework with AppCards effectively compensates for knowledge blind spots in mobile agents, improving planning reliability and generalization capabilities for complex smartphone automation tasks.
Abstract: Mobile agents have made progress toward reliable smartphone automation, yet performance in complex applications remains limited by incomplete knowledge and weak generalization to unseen environments. We introduce a curiosity driven knowledge retrieval framework that formalizes uncertainty during execution as a curiosity score. When this score exceeds a threshold, the system retrieves external information from documentation, code repositories, and historical trajectories. Retrieved content is organized into structured AppCards, which encode functional semantics, parameter conventions, interface mappings, and interaction patterns. During execution, an enhanced agent selectively integrates relevant AppCards into its reasoning process, thereby compensating for knowledge blind spots and improving planning reliability. Evaluation on the AndroidWorld benchmark shows consistent improvements across backbones, with an average gain of six percentage points and a new state of the art success rate of 88.8% when combined with GPT-5. Analysis indicates that AppCards are particularly effective for multi step and cross application tasks, while improvements depend on the backbone model. Case studies further confirm that AppCards reduce ambiguity, shorten exploration, and support stable execution trajectories. Task trajectories are publicly available at https://lisalsj.github.io/Droidrun-appcard/.
[250] Balancing Sustainability And Performance: The Role Of Small-Scale Llms In Agentic Artificial Intelligence Systems
Anh Khoa Ngo Ho, Martin Chauvin, Simon Gosset, Philippe Cordier, Boris Gamazaychikov
Main category: cs.AI
TL;DR: Smaller open-weight language models can reduce energy consumption while maintaining performance in multi-agent systems, enabling more sustainable AI development.
Details
Motivation: As large language models become integral to agentic AI systems, their energy demands during inference pose significant sustainability challenges. There's a need to investigate whether smaller models can reduce energy consumption without compromising performance in real-world multi-agent environments.Method: Conducted comparative analysis across language models of varying scales to quantify trade-offs between efficiency and performance. Investigated energy consumption, responsiveness, and output quality in multi-agent, real-world environments.
Result: Smaller open-weights models can lower energy usage while preserving task quality. The study identified practical optimization opportunities for sustainable AI design.
Conclusion: Proposed practical guidelines for sustainable AI design including optimal batch size configuration and computation resource allocation. These insights offer actionable strategies for developing scalable, environmentally responsible AI systems.
Abstract: As large language models become integral to agentic artificial intelligence systems, their energy demands during inference may pose significant sustainability challenges. This study investigates whether deploying smaller-scale language models can reduce energy consumption without compromising responsiveness and output quality in a multi-agent, real-world environments. We conduct a comparative analysis across language models of varying scales to quantify trade-offs between efficiency and performance. Results show that smaller open-weights models can lower energy usage while preserving task quality. Building on these findings, we propose practical guidelines for sustainable artificial intelligence design, including optimal batch size configuration and computation resource allocation. These insights offer actionable strategies for developing scalable, environmentally responsible artificial intelligence systems.
[251] SETA: Statistical Fault Attribution for Compound AI Systems
Sayak Chowdhury, Meenakshi D’Souza
Main category: cs.AI
TL;DR: A modular robustness testing framework for multi-network AI systems that analyzes component-wise errors and error propagation across neural network modules.
Details
Motivation: Modern AI systems increasingly use multiple interconnected neural networks for complex tasks, but existing robustness testing techniques designed for single networks don't scale well to multi-network pipelines, creating challenges for testing robustness and safety.Method: Proposes a modular robustness testing framework that applies perturbations to test data, supports component-wise system analysis to isolate errors, and enables reasoning about error propagation across neural network modules. The framework is architecture and modality agnostic.
Result: Successfully applied the framework to a real-world autonomous rail inspection system composed of multiple deep networks, demonstrating fine-grained robustness analysis beyond conventional end-to-end metrics.
Conclusion: The proposed modular testing framework enables effective robustness analysis of complex multi-network AI systems by providing component-level insights and understanding error propagation, addressing limitations of single-network testing approaches.
Abstract: Modern AI systems increasingly comprise multiple interconnected neural networks to tackle complex inference tasks. Testing such systems for robustness and safety entails significant challenges. Current state-of-the-art robustness testing techniques, whether black-box or white-box, have been proposed and implemented for single-network models and do not scale well to multi-network pipelines. We propose a modular robustness testing framework that applies a given set of perturbations to test data. Our testing framework supports (1) a component-wise system analysis to isolate errors and (2) reasoning about error propagation across the neural network modules. The testing framework is architecture and modality agnostic and can be applied across domains. We apply the framework to a real-world autonomous rail inspection system composed of multiple deep networks and successfully demonstrate how our approach enables fine-grained robustness analysis beyond conventional end-to-end metrics.
[252] PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems
Amit Singh Bhatti, Vishal Vaddina, Dagnachew Birru
Main category: cs.AI
TL;DR: PROTEUS is an LLM router that accepts accuracy targets as runtime input and uses Lagrangian dual control to translate specified accuracy values into routing decisions that satisfy them, enabling a single trained model to serve the full accuracy spectrum without retraining.
Details
Motivation: Current LLM routers force operators to tune parameters offline and guess what accuracy might result, with indirect, non-monotonic, and dataset-dependent relationships between parameters and outcomes. Operators need to specify accuracy targets directly rather than infer them from opaque settings.Method: PROTEUS uses Lagrangian dual control where a learned dual variable lambda tracks constraint violations during training and conditions the policy network. This allows the router to translate specified accuracy targets (tau) into routing decisions that satisfy them.
Result: PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau, with target-response correlation reaching 0.97-0.98. It meets accuracy floors 78% more often than the closest baseline (OmniRouter). On RouterBench it achieves 90.1% accuracy (within 1.3% of oracle), on SPROUT 94.0% accuracy (within 4.6% of oracle), with cost savings reaching 89.8% versus the best fixed model.
Conclusion: PROTEUS enables operators to directly specify accuracy targets at runtime rather than tuning opaque parameters, providing consistent floor compliance across the accuracy spectrum from a single trained model while achieving near-oracle accuracy and significant cost savings.
Abstract: Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present PROTEUS (Polymorphic Router for Operational Target Enforcement with Unified SLA), a router that accepts accuracy targets tau as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable lambda tracks constraint violations during training and conditions the policy network. This lets the router translate specified tau values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without retraining.We evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22% of the time despite also using Lagrangian optimization. PROTEUS operates across tau in [0.85, 0.95] from a single model. On RouterBench it achieves 90.1% accuracy, within 1.3% of oracle. On SPROUT it achieves 94.0% accuracy, within 4.6% of oracle. Cost savings reach 89.8% versus the best fixed model.
[253] RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization
Hongzhu Yi, Xinming Wang, Zhenghao zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Haopeng Jin, Zhepeng Wang, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Bingkang Shi, Jungang Xu
Main category: cs.AI
TL;DR: RPO is a reinforcement fine-tuning algorithm that reduces computational overhead by 90-95% by only generating reasoning path suffixes instead of full paths, while maintaining performance comparable to full-path methods.
Details
Motivation: Traditional reinforcement fine-tuning for LLMs requires generating complete reasoning trajectories from input queries, which incurs significant computational overhead during training rollout phases.Method: RPO analyzes impact of different reasoning path segments on final correctness, then trains models by generating only reasoning path suffixes using experience cache instead of full paths.
Result: RPO reduces token generation by ~95% during rollout, cutting training time by 90% for 1.5B models and 72% for 7B models while maintaining comparable performance to full-path methods.
Conclusion: RPO provides a plug-and-play acceleration solution for reinforcement fine-tuning that can integrate with existing algorithms like GRPO and DAPO, offering significant efficiency gains without sacrificing performance.
Abstract: Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.
[254] Fuzzy expert system for the process of collecting and purifying acidic water: a digital twin approach
Temirbolat Maratuly, Pakizar Shamoi, Timur Samigulin
Main category: cs.AI
TL;DR: A fuzzy expert system with digital twin for automated sour water purification, tested under 105 scenarios with comprehensive performance metrics.
Details
Motivation: Sour water purification is essential for reducing emissions, corrosion risks, enabling water reuse, lowering costs, and protecting workers through automation. Untreated sour water from crude oil processing poses environmental threats and accelerates equipment corrosion.Method: Developed a fuzzy expert system combined with a digital twin using Honeywell UniSim Design R492 for industrial simulation. Valve dynamics modeled via MATLAB system identification, with real-time OPC DA data exchange. Fuzzy controller uses split-range control on two valves, tested under 21 initial pressure conditions with 5 defuzzification strategies (105 total scenarios). Performance evaluated using error metrics (MSE, RMSE, MAE, IAE, ISE, ITAE) and dynamic response metrics. Web interface built with Python Streamlit.
Result: The system was comprehensively tested under various conditions with multiple defuzzification strategies, though specific numerical results aren’t provided in the abstract. The approach enables simple, intuitive control for non-expert personnel.
Conclusion: The proposed fuzzy expert system with digital twin effectively maintains key parameters in sour water treatment through human-reasoning mimicry. While demonstrated for sour water purification, the system is general-purpose and applicable to other industrial processes.
Abstract: Purifying sour water is essential for reducing emissions, minimizing corrosion risks, enabling the reuse of treated water in industrial or domestic applications, and ultimately lowering operational costs. Moreover, automating the purification process helps reduce the risk of worker harm by limiting human involvement. Crude oil contains acidic components such as hydrogen sulfide, carbon dioxide, and other chemical compounds. During processing, these substances are partially released into sour water. If not properly treated, sour water poses serious environmental threats and accelerates the corrosion of pipelines and equipment. This paper presents a fuzzy expert system, combined with a custom-generated digital twin, developed from a documented industrial process to maintain key parameters at desired levels by mimicking human reasoning. The control strategy is designed to be simple and intuitive, allowing junior or non-expert personnel to interact with the system effectively. The digital twin was developed using Honeywell UniSim Design R492 to simulate real industrial behavior accurately. Valve dynamics were modeled through system identification in MATLAB, and real-time data exchange between the simulator and controller was established using OPC DA. The fuzzy controller applies split-range control to two valves and was tested under 21 different initial pressure conditions using five distinct defuzzification strategies, resulting in a total of 105 unique test scenarios. System performance was evaluated using both error-based metrics (MSE, RMSE, MAE, IAE, ISE, ITAE) and dynamic response metrics, including overshoot, undershoot, rise time, fall time, settling time, and steady-state error. A web-based simulation interface was developed in Python using the Streamlit framework. Although demonstrated here for sour water treatment, the proposed fuzzy expert system is general-purpose.
[255] Benchmarks Saturate When The Model Gets Smarter Than The Judge
Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
Main category: cs.AI
TL;DR: Omni-MATH-2 is a manually revised version of Omni-MATH dataset with clean exact-answer problems and tagged non-standard problems, created to reduce dataset noise and enable better evaluation of LLM performance on mathematical reasoning tasks.
Details
Motivation: Existing benchmarks for Large Language Models often suffer from inaccuracies in datasets and evaluation methods, undermining their effectiveness in tracking progress. The authors aim to create a cleaner dataset and better evaluation framework to provide more precise assessment of model performance.Method: Manual auditing of Omni-MATH dataset to ensure LaTeX compilability, solvability, and verifiability. Problems were categorized into clean exact-answer subset (n=4181) and tagged non-standard subset (n=247). Added missing figures/information, labeled problems requiring proof/estimation/image, and removed clutter. Used expert annotations to evaluate judge-induced noise by comparing GPT-5 mini with original Omni-Judge.
Result: Significant reduction in dataset-induced noise. Expert annotations revealed Omni-Judge was wrong in 96.4% of judge disagreements, indicating its inability to differentiate between models’ abilities. Neither judge identified failure modes for tagged problems. As problems become more challenging, increasingly competent judges become essential to prevent judge errors from masking genuine model differences.
Conclusion: Dataset quality and judge reliability are both critical for developing accurate benchmarks of model performance. The Omni-MATH-2 dataset provides a cleaner foundation for evaluating LLMs on mathematical reasoning tasks, while the analysis highlights the importance of competent judges, especially for challenging problems.
Abstract: Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ($n{=}4181$) and a tagged, non-standard subset ($n{=}247$). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in $96.4%$ of the judge disagreements, indicating its inability to differentiate between models’ abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
[256] Learning Adaptive Parallel Execution for Efficient Code Localization
Ke Xu, Siyang Xiao, Ming Liang, Yichen Yu, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li
Main category: cs.AI
TL;DR: FuseSearch is a new approach for parallel code localization that optimizes both quality and efficiency by dynamically adjusting search breadth, achieving state-of-the-art performance with significant speedups and reduced resource usage.
Details
Motivation: Current code localization agents suffer from high redundant invocation rates (34.9%) that negate parallelism benefits, creating a bottleneck in automated software development pipelines. There's a need for approaches that can effectively leverage parallel tool execution without wasteful redundancy.Method: FuseSearch reformulates parallel code localization as a joint quality-efficiency optimization task. It defines “tool efficiency” as the ratio of unique information gain to invocation count, and uses a two-phase training approach (SFT and RL) to learn adaptive parallel strategies. Unlike fixed-breadth methods, it dynamically modulates search breadth based on task context, transitioning from exploration to refinement phases.
Result: On SWE-bench Verified, FuseSearch-4B achieves state-of-the-art performance with 84.7% file-level and 56.4% function-level F1 scores, while achieving 93.6% speedup, using 67.7% fewer turns and 68.9% fewer tokens compared to existing approaches.
Conclusion: Efficiency-aware training naturally improves quality by eliminating noisy redundant signals, enabling high-performance cost-effective localization agents. FuseSearch demonstrates that optimizing for efficiency can lead to both better performance and reduced resource consumption in code localization tasks.
Abstract: Code localization constitutes a key bottleneck in automated software development pipelines. While concurrent tool execution can enhance discovery speed, current agents demonstrate a 34.9% redundant invocation rate, which negates parallelism benefits. We propose \textbf{FuseSearch}, reformulating parallel code localization as a \textbf{joint quality-efficiency optimization} task. Through defining \textbf{tool efficiency} – the ratio of unique information gain to invocation count – we utilize a two-phase SFT and RL training approach for learning adaptive parallel strategies. Different from fixed-breadth approaches, FuseSearch dynamically modulates search breadth according to task context, evolving from exploration phases to refinement stages. Evaluated on SWE-bench Verified, FuseSearch-4B achieves SOTA-level performance (84.7% file-level and 56.4% function-level $F_1$ scores) with 93.6% speedup, utilizing 67.7% fewer turns and 68.9% fewer tokens. Results indicate that efficiency-aware training naturally improves quality through eliminating noisy redundant signals, enabling high-performance cost-effective localization agents.
[257] ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks
Haoyun Li, Ming Xiao, Kezhi Wang, Robert Schober, Dong In Kim, Yong Liang Guan
Main category: cs.AI
TL;DR: ComAgent is a multi-LLM agent framework that automates cross-layer optimization in 6G networks by translating high-level intents into mathematical formulations through specialized agents working in a Perception-Planning-Action-Reflection cycle.
Details
Motivation: Manual translation of high-level intents into mathematical formulations for 6G network optimization is a bottleneck. Existing monolithic LLM approaches lack sufficient domain grounding, constraint awareness, and verification capabilities needed for complex wireless network optimization tasks.Method: ComAgent employs a multi-LLM agent framework with a closed-loop Perception-Planning-Action-Reflection cycle. It coordinates specialized agents for literature search, coding, and scoring to autonomously generate solver-ready formulations and reproducible simulations. The framework iteratively decomposes problems and self-corrects errors.
Result: ComAgent achieves expert-comparable performance in complex beamforming optimization and outperforms monolithic LLMs across diverse wireless tasks. The framework effectively bridges the gap between user intent and execution in wireless network optimization.
Conclusion: ComAgent demonstrates significant potential for automating design in emerging wireless networks by addressing the limitations of monolithic LLM approaches through its multi-agent architecture and iterative refinement capabilities.
Abstract: Emerging 6G networks rely on complex cross-layer optimization, yet manually translating high-level intents into mathematical formulations remains a bottleneck. While Large Language Models (LLMs) offer promise, monolithic approaches often lack sufficient domain grounding, constraint awareness, and verification capabilities. To address this, we present ComAgent, a multi-LLM agentic AI framework. ComAgent employs a closed-loop Perception-Planning-Action-Reflection cycle, coordinating specialized agents for literature search, coding, and scoring to autonomously generate solver-ready formulations and reproducible simulations. By iteratively decomposing problems and self-correcting errors, the framework effectively bridges the gap between user intent and execution. Evaluations demonstrate that ComAgent achieves expert-comparable performance in complex beamforming optimization and outperforms monolithic LLMs across diverse wireless tasks, highlighting its potential for automating design in emerging wireless networks.
[258] Algorithmic Prompt-Augmentation for Efficient LLM-Based Heuristic Design for A* Search
Thomas Bömer, Nico Koltermann, Max Disselnmeyer, Bastian Amberg, Anne Meyer
Main category: cs.AI
TL;DR: A-CEoH framework automates A* heuristic design using LLMs with code-inclusive prompts, outperforming handcrafted heuristics in warehouse logistics and sliding puzzle domains.
Details
Motivation: Traditional heuristic design for A* search requires significant expertise and manual effort. Recent advances in LLMs and evolutionary frameworks enable automated heuristic generation, but need improved methods to leverage LLMs' in-context learning capabilities.Method: Extends Evolution of Heuristics (EoH) framework with A-CEoH: domain-agnostic prompt augmentation that includes A* algorithm code in prompts to enhance in-context learning. Evaluated on Unit-Load Pre-Marshalling Problem (warehouse logistics) and classical sliding puzzle problem.
Result: A-CEoH significantly improves generated heuristic quality and outperforms expert-designed heuristics in computational experiments across both problem domains.
Conclusion: Automated heuristic generation via LLMs with algorithmic-contextual prompts is effective and can surpass human-designed heuristics, demonstrating practical value for complex search problems.
Abstract: Heuristic functions are essential to the performance of tree search algorithms such as A*, where their accuracy and efficiency directly impact search outcomes. Traditionally, such heuristics are handcrafted, requiring significant expertise. Recent advances in large language models (LLMs) and evolutionary frameworks have opened the door to automating heuristic design. In this paper, we extend the Evolution of Heuristics (EoH) framework to investigate the automated generation of guiding heuristics for A* search. We introduce a novel domain-agnostic prompt augmentation strategy that includes the A* code into the prompt to leverage in-context learning, named Algorithmic - Contextual EoH (A-CEoH). To evaluate the effectiveness of A-CeoH, we study two problem domains: the Unit-Load Pre-Marshalling Problem (UPMP), a niche problem from warehouse logistics, and the classical sliding puzzle problem (SPP). Our computational experiments show that A-CEoH can significantly improve the quality of the generated heuristics and even outperform expert-designed heuristics.
[259] Agentic Design Patterns: A System-Theoretic Framework
Minh-Dung Dao, Quy Minh Le, Hoang Thanh Lam, Duc-Trong Le, Quoc-Viet Pham, Barry O’Sullivan, Hoang D. Nguyen
Main category: cs.AI
TL;DR: A systems-theoretic framework for engineering robust AI agents with 12 design patterns addressing common challenges in agentic systems.
Details
Motivation: Foundation models enable agentic AI systems but suffer from issues like hallucination, poor reasoning, and ad-hoc design approaches. Existing characterizations lack rigorous systems-theoretic foundations, resulting in high-level taxonomies that are difficult to implement.Method: Proposes a system-theoretic framework deconstructing agentic AI into five core subsystems: Reasoning & World Model, Perception & Grounding, Action Execution, Learning & Adaptation, and Inter-Agent Communication. Derives 12 agentic design patterns categorized as Foundational, Cognitive & Decisional, Execution & Interaction, and Adaptive & Learning.
Result: Demonstrates utility through a case study on the ReAct framework, showing how proposed patterns can rectify systemic architectural deficiencies. Provides a foundational language and structured methodology for standardizing agentic design communication.
Conclusion: This work offers a principled methodology for engineering robust AI agents, enabling more modular, understandable, and reliable autonomous systems through standardized design patterns and systems-theoretic foundations.
Abstract: With the development of foundation model (FM), agentic AI systems are getting more attention, yet their inherent issues like hallucination and poor reasoning, coupled with the frequent ad-hoc nature of system design, lead to unreliable and brittle applications. Existing efforts to characterise agentic design patterns often lack a rigorous systems-theoretic foundation, resulting in high-level or convenience-based taxonomies that are difficult to implement. This paper addresses this gap by introducing a principled methodology for engineering robust AI agents. We propose two primary contributions: first, a novel system-theoretic framework that deconstructs an agentic AI system into five core, interacting functional subsystems: Reasoning & World Model, Perception & Grounding, Action Execution, Learning & Adaptation, and Inter-Agent Communication. Second, derived from this architecture and directly mapped to a comprehensive taxonomy of agentic challenges, we present a collection of 12 agentic design patterns. These patterns - categorised as Foundational, Cognitive & Decisional, Execution & Interaction, and Adaptive & Learning - offer reusable, structural solutions to recurring problems in agent design. The utility of the framework is demonstrated by a case study on the ReAct framework, showing how the proposed patterns can rectify systemic architectural deficiencies. This work provides a foundational language and a structured methodology to standardise agentic design communication among researchers and engineers, leading to more modular, understandable, and reliable autonomous systems.
[260] GAVEL: Towards rule-based safety through activation monitoring
Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky
Main category: cs.AI
TL;DR: Rule-based activation safety framework using cognitive elements for precise, interpretable detection of harmful behaviors in LLMs.
Details
Motivation: Existing activation safety approaches suffer from poor precision, limited flexibility, and lack of interpretability when detecting harmful behaviors in LLMs.Method: Model activations as cognitive elements (CEs) - fine-grained interpretable factors like ‘making a threat’ and ‘payment processing’. Define predicate rules over CEs to detect violations in real time without retraining.
Result: Compositional rule-based activation safety improves precision, supports domain customization, and enables scalable, interpretable, and auditable AI governance.
Conclusion: The proposed framework (GAVEL) offers a practical solution for precise, configurable, and transparent safety monitoring in LLMs, with open-source release planned.
Abstract: Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ‘‘making a threat’’ and ‘‘payment processing’’, that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
[261] CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing
Shanyv Liu, Xuyang Yuan, Tao Chen, Zijun Zhan, Zhu Han, Danyang Zheng, Weishan Zhang, Shaohua Cao
Main category: cs.AI
TL;DR: CASTER is a lightweight router for dynamic model selection in graph-based multi-agent systems that reduces inference costs by 72.4% while maintaining performance.
Details
Motivation: Graph-based Multi-Agent Systems enable complex workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub-tasks.Method: CASTER uses a Dual-Signal Router combining semantic embeddings with structural meta-features to estimate task difficulty, and self-optimizes through a Cold Start to Iterative Evolution paradigm learning from routing failures via on-policy negative feedback.
Result: Experiments across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity show CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while matching success rates, and outperforms heuristic routing and FrugalGPT across all domains.
Conclusion: CASTER provides an effective solution for dynamic model selection in graph-based MAS, achieving significant computational savings without compromising task success rates.
Abstract: Graph-based Multi-Agent Systems (MAS) enable complex cyclic workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub-tasks. We propose CASTER (Context-Aware Strategy for Task Efficient Routing), a lightweight router for dynamic model selection in graph-based MAS. CASTER employs a Dual-Signal Router that combines semantic embeddings with structural meta-features to estimate task difficulty. During training, the router self-optimizes through a Cold Start to Iterative Evolution paradigm, learning from its own routing failures via on-policy negative feedback. Experiments using LLM-as-a-Judge evaluation across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity demonstrate that CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while matching their success rates, and consistently outperforms both heuristic routing and FrugalGPT across all domains.
[262] An Interpretable Recommendation Model for Psychometric Data, With an Application to Gerontological Primary Care
Andre Paulino de Lima, Paula Castro, Suzana Carvalho Vaz de Andrade, Rosa Maria Marcucci, Ruth Caldeira de Melo, Marcelo Garcia Manzato
Main category: cs.AI
TL;DR: A recommender system for gerontological primary care that provides visual explanations to address healthcare challenges like data scarcity, interpretability, and risk concerns.
Details
Motivation: Recommender systems face unique challenges in healthcare: lack of public clinical data, difficulty understanding recommendations, risks of following recommendations, and uncertainty about effectiveness. These barriers limit their usefulness in clinical settings.Method: Developed a recommendation model that leverages psychometric data structure to provide visual explanations that are both faithful to the model and interpretable by care professionals. Focused specifically on gerontological primary care for personalized care plan creation.
Result: Conducted comparative offline performance evaluation on Brazilian healthcare datasets and user studies evaluating visual explanation interpretability. Results show the model can advance recommender system applications in this healthcare niche.
Conclusion: The proposed model addresses key healthcare recommender system challenges and shows promise for gerontological primary care, which will grow in demand due to demographic changes and increasing technology needs.
Abstract: There are challenges that must be overcome to make recommender systems useful in healthcare settings. The reasons are varied: the lack of publicly available clinical data, the difficulty that users may have in understanding the reasons why a recommendation was made, the risks that may be involved in following that recommendation, and the uncertainty about its effectiveness. In this work, we address these challenges with a recommendation model that leverages the structure of psychometric data to provide visual explanations that are faithful to the model and interpretable by care professionals. We focus on a narrow healthcare niche, gerontological primary care, to show that the proposed recommendation model can assist the attending professional in the creation of personalised care plans. We report results of a comparative offline performance evaluation of the proposed model on healthcare datasets that were collected by research partners in Brazil, as well as the results of a user study that evaluates the interpretability of the visual explanations the model generates. The results suggest that the proposed model can advance the application of recommender systems in this healthcare niche, which is expected to grow in demand , opportunities, and information technology needs as demographic changes become more pronounced.
[263] Routing End User Queries to Enterprise Databases
Saikrishna Sudarshan, Tanay Kulkarni, Manasi Patwardhan, Lovekesh Vig, Ashwin Srinivasan, Tanmay Tulsidas Verlekar
Main category: cs.AI
TL;DR: A modular reasoning-driven reranking strategy for routing natural language queries in multi-database enterprise environments outperforms embedding-only and LLM-prompting baselines.
Details
Motivation: Routing natural language queries in multi-database enterprise environments becomes increasingly challenging with larger, domain-overlapping database repositories and ambiguous queries, creating a need for more structured and robust reasoning-based solutions.Method: Proposed a modular, reasoning-driven reranking strategy that explicitly models schema coverage, structural connectivity, and fine-grained semantic alignment between queries and databases.
Result: The proposed approach consistently outperforms embedding-only and direct LLM-prompting baselines across all evaluation metrics on realistic benchmarks constructed by extending existing NL-to-SQL datasets.
Conclusion: Explicit modeling of schema coverage, structural connectivity, and semantic alignment through reasoning-driven reranking provides superior performance for query routing in complex multi-database environments compared to simpler approaches.
Abstract: We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven reranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.
[264] Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
Main category: cs.AI
TL;DR: Visual generation in multimodal models enhances reasoning for physical/spatial tasks where verbal reasoning alone has limitations, but doesn’t help for abstract tasks.
Details
Motivation: Current AI systems excel at verbal reasoning in abstract domains but lag in physical/spatial intelligence. The emergence of unified multimodal models raises questions about when and how visual generation benefits reasoning compared to purely verbal approaches.Method: Theoretical analysis formalizing internal world modeling as a core component of CoT reasoning, plus empirical evaluation using a new benchmark suite (VisWorld-Eval) to test interleaved visual-verbal CoT reasoning on a state-of-the-art UMM.
Result: Interleaved visual-verbal CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling (physical/spatial tasks), but offers no clear advantage on other tasks.
Conclusion: Visual generation serves as superior world models for certain physical/spatial tasks where verbal reasoning has representational limitations, clarifying the potential of multimodal world modeling for more human-like AI.
Abstract: Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks–particularly those grounded in the physical world–visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.
[265] Temporal Knowledge-Graph Memory in a Partially Observable Environment
Taewoon Kim, Vincent François-Lavet, Michael Cochez
Main category: cs.AI
TL;DR: Room Environment v3: A configurable RDF-based environment for evaluating agents with temporal knowledge graph memory, showing symbolic TKG agents outperform neural baselines by 4x in QA accuracy.
Details
Motivation: Existing benchmarks lack environments where both world dynamics and agent memory are explicitly graph-shaped, despite knowledge graphs being natural representations for evolving states in partially observable environments.Method: Introduced Room Environment v3 with hidden RDF KG state and RDF triple observations. Developed lightweight temporal KG memory with RDF-star-style qualifiers (time_added, last_accessed, num_recalled). Evaluated symbolic baselines with capacity constraints and neural baselines (LSTM, Transformer).
Result: Temporal qualifiers led to more stable performance. Symbolic TKG agent achieved roughly fourfold higher test QA accuracy than neural baselines under same environment and query conditions. Agents trained on one layout and evaluated on held-out layout with different query order.
Conclusion: Explicit temporal knowledge graph memory significantly improves agent performance in partially observable environments compared to neural sequence models, demonstrating the value of graph-structured memory for state tracking and generalization.
Abstract: Agents in partially observable environments require persistent memory to integrate observations over time. While KGs (knowledge graphs) provide a natural representation for such evolving state, existing benchmarks rarely expose agents to environments where both the world dynamics and the agent’s memory are explicitly graph-shaped. We introduce the Room Environment v3, a configurable environment whose hidden state is an RDF KG and whose observations are RDF triples. The agent may extend these observations into a temporal KG when storing them in long-term memory. The environment is easily adjustable in terms of grid size, number of rooms, inner walls, and moving objects. We define a lightweight temporal KG memory for agents, based on RDF-star-style qualifiers (time_added, last_accessed, num_recalled), and evaluate several symbolic baselines that maintain and query this memory under different capacity constraints. Two neural sequence models (LSTM and Transformer) serve as contrasting baselines without explicit KG structure. Agents train on one layout and are evaluated on a held-out layout with the same dynamics but a different query order, exposing train-test generalization gaps. In this setting, temporal qualifiers lead to more stable performance, and the symbolic TKG (temporal knowledge graph) agent achieves roughly fourfold higher test QA (question-answer) accuracy than the neural baselines under the same environment and query conditions. The environment, agent implementations, and experimental scripts are released for reproducible research at https://github.com/humemai/agent-room-env-v3 and https://github.com/humemai/room-env.
[266] PowerGraph-LLM: Novel Power Grid Graph Embedding and Optimization with Large Language Models
Fabien Bernier, Jun Cao, Maxime Cordy, Salah Ghamizi
Main category: cs.AI
TL;DR: PowerGraph-LLM is the first framework using Large Language Models (LLMs) to solve Optimal Power Flow problems, combining graph and tabular representations with specialized in-context learning and fine-tuning for power grid optimization.
Details
Motivation: There's a growing need for scalable algorithms to handle increasing variability, constraints, and uncertainties in modern power networks while providing accurate and fast solutions for Optimal Power Flow problems.Method: PowerGraph-LLM combines graph and tabular representations of power grids to query LLMs effectively, capturing complex relationships and constraints. It introduces new in-context learning and fine-tuning protocols specifically tailored for OPF problems.
Result: The framework demonstrates reliable performance using off-the-shelf LLMs, showing the impact of LLM architecture, size, and fine-tuning, and proving its ability to handle realistic grid components and constraints.
Conclusion: PowerGraph-LLM represents a novel approach to solving OPF problems using LLMs, offering a promising framework that can effectively capture power system complexities and provide scalable solutions for modern grid management.
Abstract: Efficiently solving Optimal Power Flow (OPF) problems in power systems is crucial for operational planning and grid management. There is a growing need for scalable algorithms capable of handling the increasing variability, constraints, and uncertainties in modern power networks while providing accurate and fast solutions. To address this, machine learning techniques, particularly Graph Neural Networks (GNNs) have emerged as promising approaches. This letter introduces PowerGraph-LLM, the first framework explicitly designed for solving OPF problems using Large Language Models (LLMs). The proposed approach combines graph and tabular representations of power grids to effectively query LLMs, capturing the complex relationships and constraints in power systems. A new implementation of in-context learning and fine-tuning protocols for LLMs is introduced, tailored specifically for the OPF problem. PowerGraph-LLM demonstrates reliable performances using off-the-shelf LLM. Our study reveals the impact of LLM architecture, size, and fine-tuning and demonstrates our framework’s ability to handle realistic grid components and constraints.
[267] Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation
Tianyi Zeng, Tianyi Wang, Zimo Zeng, Feiyang Zhang, Jiseop Byeon, Yujin Wang, Yajie Zou, Yangyang Wang, Junfeng Jiao, Christian Claudel, Xinbo Chen
Main category: cs.AI
TL;DR: A Bayesian physics-informed neural network framework called Damper-B-PINN is proposed for accurate dynamic wheel load estimation in vehicles, combining suspension dynamics modeling with Bayesian inference to handle noise and uncertainty.
Details
Motivation: Wheel load estimation is crucial for vehicle stability and ADAS safety, but remains challenging due to complex chassis modeling and susceptibility to noise in nonlinear systems.Method: 1) Refined suspension linkage-level modeling with nonlinear instantaneous dynamics; 2) Damper-B-PINN framework combining physics-informed neural networks with Bayesian inference; 3) Damper-characteristic physics conditioning (DPC) module for embedding physical priors.
Result: Damper-B-PINN consistently outperforms existing methods across various test conditions, especially extreme ones, using both CarSim simulation data and real-world Formula Student race car data.
Conclusion: The proposed framework enhances accuracy and robustness of dynamic wheel load estimation, improving reliability and safety of ADAS applications.
Abstract: Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.
[268] RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning
Qianyue Hao, Sibo Li, Jian Yuan, Yong Li
Main category: cs.AI
TL;DR: RLoT uses reinforcement learning to train a lightweight navigator model that dynamically selects and combines logic blocks to enhance LLM reasoning, outperforming existing methods by up to 13.4% with strong transferability across models and tasks.
Details
Motivation: Current inference-time reasoning techniques (Chain/Tree/Graph-of-Thoughts) use manually predefined, task-agnostic frameworks that lack adaptability across diverse tasks, limiting their effectiveness despite being cost-effective.Method: Propose RL-of-Thoughts (RLoT) with a lightweight navigator model trained via reinforcement learning. The navigator dynamically selects from five basic logic blocks (designed from human cognition perspective) and combines them into task-specific logical structures based on problem characteristics.
Result: Outperforms established inference-time techniques by up to 13.4% across multiple reasoning benchmarks (AIME, MATH, GPQA) with various LLMs (GPT, Llama, Qwen, DeepSeek). The RL navigator (<3K parameters) enables sub-10B LLMs to perform comparably to 100B-scale models and shows strong transferability to unseen LLMs and tasks.
Conclusion: RLoT provides an adaptive, efficient approach to enhance LLM reasoning at inference time through reinforcement learning, achieving significant performance gains with minimal parameters while maintaining strong generalization capabilities.
Abstract: Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs’ parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques by up to 13.4%. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://github.com/tsinghua-fib-lab/RL-LLM-Reasoning for reproducibility.
[269] MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale
Ya Wen, Jixuan Cai, Qiyao Ma, Linyan Li, Xinhua Chen, Chris Webster, Yulun Zhou
Main category: cs.AI
TL;DR: MoRA is a human-centric geospatial framework that learns location embeddings by fusing mobility graph data with POIs, remote sensing imagery, and demographic statistics through spatial tokenization, GNNs, and contrastive learning.
Details
Motivation: Current geospatial representation learning approaches focus too much on physical states (Earth observation) rather than human activity patterns and functional relationships between locations revealed by human movement. The authors argue that a location's true "meaning" comes from its socio-economic context and functional role within human dynamics.Method: MoRA integrates four modalities: 1) billion-edge mobility graph (core backbone), 2) 100M+ POIs, 3) massive remote sensing imagery, and 4) structured demographic statistics. It uses spatial tokenization, Graph Neural Networks (GNNs), and asymmetric contrastive learning to align auxiliary modalities with the mobility graph, ensuring interpretation through human dynamics.
Result: MoRA achieves superior predictive performance on 9 downstream social/economic tasks, outperforming state-of-the-art models by an average of 12.9% with only 128-dimensional embeddings. The framework also demonstrates scaling behavior similar to LLM scaling laws in geospatial representation learning.
Conclusion: MoRA successfully demonstrates that human mobility patterns provide a fundamental backbone for learning meaningful geospatial representations, enabling better understanding of location socio-economic contexts and functional roles than traditional Earth observation approaches.
Abstract: Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence, with increasingly diverging philosophies and techniques. While Earth observation paradigms excel at depicting locations in their physical states, we claim that a location’s comprehensive “meaning” is better grounded in its internal human activity patterns and, crucially, its functional relationships with other locations, as revealed by human movement. We present MoRA, a human-centric geospatial framework that leverages a mobility graph as its core backbone to fuse various data modalities, aiming to learn embeddings that represent the socio-economic context and functional role of a location. MoRA achieves this through the integration of spatial tokenization, GNNs, and asymmetric contrastive learning to align 100M+ POIs, massive remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph, ensuring the three auxiliary modalities are interpreted through the lens of fundamental human dynamics. To rigorously evaluate the effectiveness of MoRA, we construct a benchmark dataset composed of 9 downstream prediction tasks across social and economic domains. Experiments show that MoRA, with four input modalities and a compact 128-dimensional representation space, achieves superior predictive performances than state-of-the-art models by an average of 12.9%. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: https://github.com/ylzhouchris/MoRA.
[270] Privacy Reasoning in Ambiguous Contexts
Ren Yi, Octavian Suciu, Adria Gascon, Sarah Meiklejohn, Eugene Bagdasarian, Marco Gruteser
Main category: cs.AI
TL;DR: Language models struggle with ambiguous context in privacy decisions; Camber framework uses model rationales to disambiguate context, improving accuracy by up to 13.3% precision and 22.3% recall.
Details
Motivation: Previous work focused on aligning models with human privacy decisions, but this paper examines how ambiguity and missing context affect model performance in information-sharing decisions, identifying context ambiguity as a key barrier.Method: Developed Camber framework for context disambiguation that uses model-generated decision rationales to reveal ambiguities, then systematically disambiguates context based on these rationales.
Result: Significant accuracy improvements (up to 13.3% in precision and up to 22.3% in recall) and reductions in prompt sensitivity when using context disambiguation approach.
Conclusion: Context disambiguation approaches are promising for enhancing agentic privacy reasoning in language models, addressing the crucial barrier of context ambiguity in privacy assessments.
Abstract: We study the ability of language models to reason about appropriate information disclosure - a central aspect of the evolving field of agentic privacy. Whereas previous works have focused on evaluating a model’s ability to align with human decisions, we examine the role of ambiguity and missing context on model performance when making information-sharing decisions. We identify context ambiguity as a crucial barrier for high performance in privacy assessments. By designing Camber, a framework for context disambiguation, we show that model-generated decision rationales can reveal ambiguities and that systematically disambiguating context based on these rationales leads to significant accuracy improvements (up to 13.3% in precision and up to 22.3% in recall) as well as reductions in prompt sensitivity. Overall, our results indicate that approaches for context disambiguation are a promising way forward to enhance agentic privacy reasoning.
[271] GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models
Eduardo C. Garrido-Merchán, Cristina Puente
Main category: cs.AI
TL;DR: LLMs generate coherent info but suffer from hallucinations; this paper introduces a controlled approach using LLMs to create symbolic Prolog knowledge bases that can be validated by humans, ensuring interpretability and reliability.
Details
Motivation: LLMs have disadvantages like hallucinations and confident generation of incorrect facts, making them unreliable for sensitive domains that require verifiable, accurate knowledge.Method: Limit the domain and use well-structured prompt-based extraction to produce symbolic Prolog representations that can be validated and corrected by human experts, combining LLM recall with symbolic precision.
Result: Experiments with Claude Sonnet 3.7 and GPT-4.1 show strong adherence to facts and semantic coherence in generated knowledge bases, demonstrating a transparent hybrid solution.
Conclusion: This approach provides interpretability, scalability, and reliability for expert systems, laying the foundation for dependable AI applications in sensitive domains by combining LLM recall with symbolic precision.
Abstract: The development of large language models (LLMs) has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.
[272] $R^2$-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation
Zhen Wu, Ritam Dutt, Luke M. Breitfeller, Armineh Nourbakhsh, Siddharth Parekh, Carolyn Rosé
Main category: cs.AI
TL;DR: Analysis of text-graph complementarity in relational reasoning tasks using knowledge co-distillation architecture
Details
Motivation: Prior research hasn't systematically explored text-graph interplay and its effect on hybrid models for relational reasoning tasksMethod: Analysis-driven approach using unified architecture with knowledge co-distillation (CoD) across five relational reasoning tasks with different text-graph information encoding
Result: Uncovered interpretable patterns of alignment and divergence in dual representations during training, providing insights into when and why text-graph integration is beneficial
Conclusion: Systematic analysis reveals how text and graph representations complement each other in relational reasoning, offering guidance for effective hybrid model design
Abstract: Relational reasoning lies at the core of many NLP tasks, drawing on complementary signals from text and graphs. While prior research has investigated how to leverage this dual complementarity, a detailed and systematic understanding of text-graph interplay and its effect on hybrid models remains underexplored. We take an analysis-driven approach to investigate text-graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We explore five tasks involving relational reasoning that differ in how text and graph structures encode the information needed to solve that task. By tracking how these dual representations evolve during training, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration is beneficial.
[273] Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making
Kausik Lakkaraju, Siva Likitha Valluru, Biplav Srivastava
Main category: cs.AI
TL;DR: H-XAI is a holistic explainable AI framework that integrates causality-based rating with post-hoc explanations for transparent, stakeholder-aligned evaluation of AI systems in online decision contexts.
Details
Motivation: AI systems in domains like credit scoring and financial forecasting lack transparency and exhibit bias, raising fairness and trust concerns. Current XAI approaches primarily serve developers rather than addressing needs of affected users or regulators.Method: H-XAI combines causality-based rating methods with post-hoc explanation techniques, treating explanation as interactive, hypothesis-driven process. It allows stakeholders to ask questions, test hypotheses, and compare model behavior against automatically generated random and biased baselines.
Result: Through case studies in credit risk assessment and stock price prediction, H-XAI demonstrates extended explainability beyond developers toward responsible and inclusive AI practices that strengthen accountability in sociotechnical systems.
Conclusion: H-XAI provides a framework for transparent, stakeholder-aligned evaluation of AI systems, helping communicate model bias and instability to strengthen accountability in digital decision-making contexts.
Abstract: As AI systems increasingly mediate decisions in domains such as credit scoring and financial forecasting, their lack of transparency and bias raises critical concerns for fairness and public trust. Existing explainable AI (XAI) approaches largely serve developers, focusing on model justification rather than the needs of affected users or regulators. We introduce Holistic eXplainable AI (H-XAI), a framework that integrates causality-based rating methods with post-hoc explanation techniques to support transparent, stakeholder-aligned evaluation of AI systems deployed in online decision contexts. H-XAI treats explanation as an interactive, hypothesis-driven process, allowing users, auditors, and organizations to ask questions, test hypotheses, and compare model behavior against automatically generated random and biased baselines. By combining global and instance-level explanations, H-XAI helps communicate model bias and instability that shape everyday digital decisions. Through case studies in credit risk assessment and stock price prediction, we show how H-XAI extends explainability beyond developers toward responsible and inclusive AI practices that strengthen accountability in sociotechnical systems.
[274] Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li, Wang Guo, Haiyang Wu, Mengwei Wu, Jipeng Zhang, Qing Zhu, Yu Liu, Xin Huang, Chao Tao
Main category: cs.AI
TL;DR: This review paper proposes shifting from vision-centered to language-centered remote sensing interpretation, using Large Language Models as cognitive hubs inspired by Global Workspace Theory to enable unified understanding, reasoning, and decision-making.
Details
Motivation: Current vision-centered remote sensing models have limitations in multi-modal reasoning, semantic abstraction, and interactive decision-making. While LLMs have been introduced, there's no unified theoretical framework explaining language's cognitive role in remote sensing interpretation.Method: Proposes a language-centered framework inspired by Global Workspace Theory, treating LLMs as cognitive central hubs integrating perceptual, task, knowledge and action spaces. Constructs a global workspace-driven interpretation mechanism and reviews language-centered solutions for core challenges.
Result: Provides a conceptual foundation for next-generation remote sensing systems, establishing a roadmap toward cognition-driven intelligent geospatial analysis. Identifies core technical challenges and proposes solutions through language-centered approaches.
Conclusion: Advocates for a paradigm shift to language-centered remote sensing interpretation, positioning LLMs as central cognitive components to overcome limitations of vision-centered models and enable more sophisticated understanding, reasoning, and decision-making in geospatial analysis.
Abstract: The mainstream paradigm of remote sensing image interpretation has long been dominated by vision-centered models, which rely on visual features for semantic understanding. However, these models face inherent limitations in handling multi-modal reasoning, semantic abstraction, and interactive decision-making. While recent advances have introduced Large Language Models (LLMs) into remote sensing workflows, existing studies primarily focus on downstream applications, lacking a unified theoretical framework that explains the cognitive role of language. This review advocates a paradigm shift from vision-centered to language-centered remote sensing interpretation. Drawing inspiration from the Global Workspace Theory (GWT) of human cognition, We propose a language-centered framework for remote sensing interpretation that treats LLMs as the cognitive central hub integrating perceptual, task, knowledge and action spaces to enable unified understanding, reasoning, and decision-making. We first explore the potential of LLMs as the central cognitive component in remote sensing interpretation, and then summarize core technical challenges, including unified multimodal representation, knowledge association, and reasoning and decision-making. Furthermore, we construct a global workspace-driven interpretation mechanism and review how language-centered solutions address each challenge. Finally, we outline future research directions from four perspectives: adaptive alignment of multimodal data, task understanding under dynamic knowledge constraints, trustworthy reasoning, and autonomous interaction. This work aims to provide a conceptual foundation for the next generation of remote sensing interpretation systems and establish a roadmap toward cognition-driven intelligent geospatial analysis.
[275] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap
Main category: cs.AI
TL;DR: Multi-agent framework reduces private info leakage in LLMs by decomposing privacy reasoning into specialized subtasks with iterative validation, achieving 18-19% reduction on benchmarks.
Details
Motivation: Addressing contextual privacy concerns in interactive LLM settings where models process information from multiple sources (e.g., summarizing meetings with private/public info) remains challenging.Method: Introduce multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing information load on single agents while enabling iterative validation and reliable adherence to contextual privacy norms. Conduct systematic ablation over information-flow topologies to understand error propagation.
Result: Best multi-agent configuration substantially reduces private information leakage (18% on ConfAIde and 19% on PrivacyLens with GPT-4o) while preserving public content fidelity, outperforming single-agent baselines.
Conclusion: Results highlight promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs, showing how specialized decomposition and validation mechanisms improve privacy protection.
Abstract: Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information-flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18%} on ConfAIde and \textbf{19%} on PrivacyLens with GPT-4o) while preserving the fidelity of public content, outperforming single-agent baselines. These results highlight the promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs.
[276] Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?
Zetian Sun, Dongfang Li, Xuhui Chen, Baotian Hu, Min Zhang
Main category: cs.AI
TL;DR: The paper analyzes the effectiveness of static vs on-policy preference data in language model alignment, showing systematic differences across models and proposing a two-stage alignment framework with a boundary measurement algorithm.
Details
Motivation: Current LM alignment methods like DPO use either static preference data or on-policy sampling, but there's no clear understanding of when each approach is optimal. The authors observe systematic effectiveness differences between static and on-policy data across different models, motivating the need for a theoretical framework to explain these differences and guide alignment strategy selection.Method: The paper proposes the “alignment stage assumption” dividing alignment into two stages: preference injection (benefits from diverse data) and preference fine-tuning (favors high-quality data). They develop theoretical and empirical analysis to characterize these stages and propose an algorithm to identify boundaries between them. Experiments are conducted on 5 models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO, SLiC-HF).
Result: Results show significant effectiveness differences: on-policy data can be 3× more effective than static data for Llama-3, but only 0.4× as effective for Zephyr. The alignment stage assumption successfully explains these differences, and the boundary measurement algorithm effectively identifies transition points between alignment stages across different models and methods.
Conclusion: The alignment process has distinct stages with different data requirements, and the effectiveness of static vs on-policy data depends on which stage a model is in. The proposed framework provides guidance for selecting appropriate alignment strategies and the boundary measurement algorithm offers practical utility for optimizing LM alignment processes.
Abstract: The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization~(DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling~(i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a $3\times$ effectiveness compared with static data for Llama-3, and a $0.4\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on $5$ models~(Llama, Zephyr, Phi-2, Qwen, Pythia) and $2$ alignment methods~(DPO, SLiC-HF) to show the generalizability of alignment stage assumption and the effectiveness of the boundary measurement algorithm.
[277] Improving Value-based Process Verifier via Low-Cost Variance Reduction
Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang
Main category: cs.AI
TL;DR: ComMCS reduces variance in value-based process verification for LLM reasoning by combining Monte Carlo estimators across steps without extra inference cost, improving math problem solving performance.
Details
Motivation: Value-based process verifiers for LLM reasoning suffer from estimation errors due to limited Monte Carlo samples (high inference cost). The error comes primarily from high variance, not bias, and current MC estimators are already Minimum Variance Unbiased Estimators (MVUE).Method: Proposes Compound Monte Carlo Sampling (ComMCS) that constructs an unbiased estimator by linearly combining MC estimators from current and subsequent steps. Theoretically reduces variance while maintaining unbiased estimation without additional LLM inference cost.
Result: ComMCS outperforms regression-based optimization by 2.8 points and non-variance-reduced baseline by 2.2 points on MATH-500 in Best-of-32 sampling experiments. Also tested on GSM8K benchmark.
Conclusion: ComMCS effectively addresses variance issues in value-based process verification for LLM reasoning, improving mathematical problem-solving performance without increasing computational cost.
Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.
[278] BASIL: Bayesian Assessment of Sycophancy in LLMs
Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani
Main category: cs.AI
TL;DR: The paper introduces a Bayesian framework to distinguish sycophancy from rational belief updating in LLMs, with metrics for descriptive and normative evaluation, and shows methods to reduce sycophantic behavior.
Details
Motivation: Sycophancy in LLMs poses challenges for human-AI collaboration in high-stakes domains, but existing approaches can't separate sycophantic belief shifts from rational updates based on evidence.Method: A Bayesian probabilistic framework grounded in behavioral economics that separates sycophancy from rational belief updating, with descriptive and normative metrics applicable even without ground-truth labels.
Result: The framework reveals robust sycophantic belief shifts in LLMs across tasks, and shows post-hoc calibration plus fine-tuning (SFT and DPO) substantially reduces Bayesian inconsistency, especially under explicit sycophancy prompting.
Conclusion: The Bayesian framework successfully distinguishes sycophancy from rational updating, enabling better evaluation and mitigation of sycophantic behavior in LLMs for improved human-AI collaboration.
Abstract: Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
[279] LLM-Generated Explanations Do Not Suffice for Ultra-Strong Machine Learning
Lun Ai, Johannes Langer, Ute Schmid, Stephen Muggleton
Main category: cs.AI
TL;DR: LENS framework combines symbolic program synthesis with LLMs to generate natural language explanations of learned logic programs, achieving higher-quality explanations than templates or direct LLM prompting, but these explanations don’t effectively support human learning in practice.
Details
Motivation: To achieve Ultra Strong Machine Learning (USML) where AI systems can teach their knowledge to improve human performance, moving beyond hand-crafted explanation templates to automatically generate high-quality natural language explanations.Method: LENS neuro-symbolic framework combines symbolic program synthesis with large language models to automatically generate natural language explanations of learned logic programs, evaluated using LLMs-as-judges and expert validation.
Result: LENS produces higher-quality explanations than both direct LLM prompting and hand-crafted templates, but human trials show LLM-generated explanations provide no advantage over human self-learning despite being rated as higher quality.
Conclusion: Achieving USML requires methods grounded in human learning, as current LLM-generated explanations don’t capture human cognitive constraints and LLMs-as-judges evaluations don’t reflect what effectively supports human learning.
Abstract: Ultra Strong Machine Learning (USML) refers to symbolic learning systems that not only improve their own performance but can also teach their acquired knowledge to quantifiably improve human performance. We introduce LENS (Logic Programming Explanation via Neural Summarisation), a neuro-symbolic framework that combines symbolic program synthesis with large language models (LLMs). This framework automatically generates natural language explanations of learned logic programs, replacing hand-crafted templates used in prior USML work. Using LLMs-as-judges evaluation and expert validation, we show that LENS produces higher-quality explanations than both direct LLM prompting and hand-crafted templates. We then examine whether LENS explanations suffice for achieving USML in a human trial teaching active learning strategies across three related domains. Our exploratory analysis suggests that concise, expert-written explanations may benefit learners with higher initial performance, while LLM-generated explanations provide no advantage over human self learning despite being rated as higher quality. This case study reveals that achieving USML requires methods grounded in human learning, where current LLM-generated explanations do not capture human cognitive constraints and LLMs-as-judges evaluations do not reflect what effectively supports human learning.
[280] Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles
Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata
Main category: cs.AI
TL;DR: SciTrek is a new benchmark for evaluating LLMs’ long-context reasoning using scientific articles, featuring automatically generated questions requiring information synthesis across multiple papers, with SQL-based ground truth for verifiable reasoning analysis.
Details
Motivation: Current long-context benchmarks focus on simple retrieval tasks or use artificial contexts, lacking evaluation of complex reasoning across real scientific articles. There's a need for benchmarks that test information aggregation and synthesis across multiple full-text scientific papers.Method: Questions and ground-truth answers are automatically generated by formulating them as SQL queries over a database of article metadata (titles, authors, references). This provides explicit, verifiable reasoning processes and scales to contexts up to 1M tokens with minimal supervision.
Result: Experiments show SciTrek poses significant challenges as context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Analysis reveals systematic shortcomings in LLMs’ ability to perform numerical operations and accurately locate information in long contexts.
Conclusion: SciTrek addresses limitations of existing long-context benchmarks by providing a scalable, automatically generated evaluation framework that reveals fundamental weaknesses in current LLMs’ long-context reasoning capabilities, particularly in numerical operations and information location.
Abstract: We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs’ ability to effectively perform numerical operations and accurately locate information in long contexts.
[281] LLM Agents for Knowledge Discovery in Atomic Layer Processing
Andreas Werbrouck, Marshall B. Lindsay, Matthew Maschmann, Matthias J. Young
Main category: cs.AI
TL;DR: LLM agents can autonomously explore and discover knowledge in materials science systems through trial-and-error exploration of black box functions, demonstrating path-dependent discovery patterns.
Details
Motivation: To test the potential of LLM agents as independent reasoning entities for knowledge discovery in materials science, moving beyond optimization tasks to free exploration and generalizable statement generation.Method: Repurposed LangGraph’s tool functionality to provide agents with black box functions to interrogate. Used trial-and-error exploration approach with intentionally limited probe capabilities, demonstrated through a children’s parlor game and applied to Atomic Layer Processing reactor simulation.
Result: Proof of concept shows LLM agents can explore, discover, and exploit diverse chemical interactions in complex systems without explicit instructions, demonstrating the importance of persistence and path-dependence in knowledge discovery.
Conclusion: LLM agents show promise as autonomous knowledge discovery tools in materials science, capable of generating and verifying generalizable statements through free exploration of complex systems.
Abstract: Large Language Models (LLMs) have garnered significant attention for several years now. Recently, their use as independently reasoning agents has been proposed. In this work, we test the potential of such agents for knowledge discovery in materials science. We repurpose LangGraph’s tool functionality to supply agents with a black box function to interrogate. In contrast to process optimization or performing specific, user-defined tasks, knowledge discovery consists of freely exploring the system, posing and verifying statements about the behavior of this black box, with the sole objective of generating and verifying generalizable statements. We provide proof of concept for this approach through a children’s parlor game, demonstrating the role of trial-and-error and persistence in knowledge discovery, and the strong path-dependence of results. We then apply the same strategy to show that LLM agents can explore, discover, and exploit diverse chemical interactions in an advanced Atomic Layer Processing reactor simulation using intentionally limited probe capabilities without explicit instructions.
[282] Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning
Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu
Main category: cs.AI
TL;DR: Deep layers in LLMs are often considered less important, but this paper shows depth usage is highly context-dependent - shallow layers handle knowledge/retrieval while deeper layers enable reasoning and coherence, with evaluation metrics dramatically affecting conclusions.
Details
Motivation: To challenge oversimplified claims that deep layers in LLMs are unimportant, and to provide a systematic analysis of how different layers contribute across various evaluation settings, tasks, and model architectures.Method: Systematic study analyzing depth utilization across diverse dimensions including evaluation protocols (likelihood-based vs generation-based), task categories, and model architectures, using layer pruning and distillation techniques.
Result: Under likelihood metrics without generation, only initial layers are critical; but generation-based evaluation reveals middle/deeper layers are indispensable for reasoning and long-range coherence. Knowledge/retrieval concentrate in shallow layers while reasoning accuracy depends on deeper layers but can be reshaped through distillation.
Conclusion: Depth usage in LLMs is highly heterogeneous and context-dependent, requiring task-, metric-, and model-aware perspectives for both interpreting and compressing large models, rather than simplistic claims about layer importance.
Abstract: Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers – yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.
[283] SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling
Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma
Main category: cs.AI
TL;DR: SAC-Opt is a backward-guided correction framework that uses semantic anchors to refine LLM-generated optimization code, improving modeling accuracy by 7.7% on average across datasets.
Details
Motivation: Existing LLM approaches for optimization modeling are solver-driven and rely on limited post-hoc fixes, leaving undetected semantic errors that produce syntactically correct but logically flawed models.Method: SAC-Opt uses backward-guided correction that grounds optimization modeling in problem semantics rather than solver feedback. It aligns original semantic anchors with those reconstructed from generated code and selectively corrects mismatched components.
Result: Empirical results on seven public datasets show SAC-Opt improves average modeling accuracy by 7.7%, with gains up to 21.9% on the ComplexLP dataset.
Conclusion: Semantic-anchored correction is crucial in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code, enhancing both fidelity and robustness without requiring additional training.
Abstract: Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.
[284] MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption
Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides
Main category: cs.AI
TL;DR: MetaVLA is a unified post-training framework for Vision-Language-Action models that enables efficient and scalable alignment through context-aware meta co-training, reducing training costs while improving performance on unseen tasks.
Details
Motivation: Current VLA models require task-specific fine-tuning, have high compute costs, and generalize poorly to unseen tasks, limiting their potential as general-purpose embodied agents.Method: Proposes Context-Aware Meta Co-Training that consolidates diverse target tasks into single fine-tuning stage using auxiliary tasks for better generalization. Uses lightweight meta-learning mechanism derived from Attentive Neural Processes for rapid adaptation without significant architectural changes or inference overhead.
Result: On LIBERO benchmark: outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%.
Conclusion: MetaVLA demonstrates that scalable, low-resource post-training is achievable, paving the way toward general-purpose embodied agents.
Abstract: Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.
[285] Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism
Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu, Zekai Lin, Lilong Wang, Wenjie Lou, Lei Liu, Lei Bai, Xiaosong Wang
Main category: cs.AI
TL;DR: Thoth is a new LLM system that generates complete, consistent scientific protocols using a “Sketch-and-Fill” paradigm and structured reward mechanism, outperforming existing models on protocol generation tasks.
Details
Motivation: Current LLMs generate incomplete or inconsistent scientific protocols, limiting their utility for reproducible science. Autonomous generation of precise, logically ordered, and executable protocols could greatly improve reproduction efficiency.Method: 1) Created SciRecipe dataset (12K structured protocols across 27 biological subfields); 2) Proposed “Sketch-and-Fill” paradigm separating analysis, structuring, and expression; 3) Structured component-based reward mechanism evaluating step granularity, action order, and semantic fidelity; 4) Developed Thoth using staged Knowledge-to-Action training process.
Result: Thoth consistently surpasses both proprietary and open-source LLMs across multiple benchmarks, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy.
Conclusion: The approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution, with all data, code, and models to be released publicly.
Abstract: The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.
[286] Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs
Yanlin Song, Ben Liu, Víctor Gutiérrez-Basulto, Zhiwei Hu, Qianqian Xie, Min Peng, Sophia Ananiadou, Jeff Z. Pan
Main category: cs.AI
TL;DR: Graph-RFT is a two-stage reinforcement fine-tuning framework for KGQA that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions.
Details
Motivation: Existing KGQA methods struggle to fully exploit both KG knowledge and LLM reasoning capabilities, especially in complex scenarios. They assume complete KG coverage, lack mechanisms to judge when external information is needed, and have locally myopic reasoning that fails to maintain coherent multi-step planning.Method: Two-stage framework: 1) Chain-of-thought fine-tuning with customized plan-retrieval dataset to activate structured reasoning and resolve GRPO cold-start problem; 2) Plan-retrieval guided reinforcement learning with explicit planning/retrieval actions and multi-reward design. Includes Cartesian-inspired planning module for question decomposition and logical expression for tool invocation.
Result: Enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions, learning when and how to combine KG and web retrieval effectively through multi-reward optimization.
Conclusion: Graph-RFT addresses key limitations in current KGQA approaches by enabling coverage-aware retrieval scheduling and globally consistent multi-step reasoning, allowing LLMs to better leverage both structured KG knowledge and web information for complex question answering.
Abstract: Knowledge Graph Question Answering aims to answer natural language questions by reasoning over structured knowledge graphs. While large language models have advanced KGQA through their strong reasoning capabilities, existing methods continue to struggle to fully exploit both the rich knowledge encoded in KGs and the reasoning capabilities of LLMs, particularly in complex scenarios. They often assume complete KG coverage and lack mechanisms to judge when external information is needed, and their reasoning remains locally myopic, failing to maintain coherent multi-step planning, leading to reasoning failures even when relevant knowledge exists. We propose Graph-RFT, a novel two-stage reinforcement fine-tuning KGQA framework with a ‘plan-KGsearch-and-Websearch-during-think’ paradigm, that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions. Graph-RFT introduces a chain-of-thought fine-tuning method with a customized plan-retrieval dataset activates structured reasoning and resolves the GRPO cold-start problem. It then introduces a novel plan-retrieval guided reinforcement learning process integrates explicit planning and retrieval actions with a multi-reward design, enabling coverage-aware retrieval scheduling. It employs a Cartesian-inspired planning module to decompose complex questions into ordered subquestions, and logical expression to guide tool invocation for globally consistent multi-step reasoning. This reasoning retrieval process is optimized with a multi-reward combining outcome and retrieval specific signals, enabling the model to learn when and how to combine KG and web retrieval effectively.
[287] Reasoning-Aware Proxy Reward Model using Process Mining
Yongjae Lee, Taekhyun Park, Sunghyun Sim, Hyerim Bae
Main category: cs.AI
TL;DR: TACReward is a reward model that uses process mining to evaluate stepwise reasoning quality in mathematical problem solving, enabling better feedback for sparse reward policy gradient methods without extra annotation costs.
Details
Motivation: Current sparse reward methods for language model post-training provide limited feedback on intermediate reasoning steps in mathematical problem solving. Binarized outcome rewards don't capture reasoning quality, and existing approaches estimating overall reasoning quality may not reliably reflect stepwise reasoning structure.Method: TACReward treats reasoning as a structured process and uses process mining techniques to aggregate stepwise structural deviations between teacher and policy reasoning. It produces a scalar reward in [0,1] range to indicate reasoning quality, seamlessly integrating into existing sparse reward frameworks without additional human annotation or architectural changes.
Result: Experiments on multiple mathematical reasoning benchmarks show that integrating TACReward into sparse reward frameworks encourages policy models to improve structural quality of reasoning, leading to consistent performance improvements over existing sparse reward methods.
Conclusion: TACReward effectively addresses the limitation of sparse rewards in reasoning tasks by providing structured feedback on stepwise reasoning quality, enabling better reinforcement learning for language models in mathematical problem solving without additional costs.
Abstract: Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL) for language models post-training. However, for reasoning tasks such as mathematical problem solving, binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted to address this issue by estimating overall reasoning quality, it remains unclear whether these rewards are reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and propose \textbf{TACReward}, the reward model that can be seamlessly integrated into sparse reward policy gradient methods without additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations between teacher and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$ to indicate reasoning quality. Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to consistent performance improvements over existing sparse reward frameworks. Our code and model are publicly available at \href{https://github.com/Thrillcrazyer/TACReward}{GitHub} and \href{https://huggingface.co/Thrillcrazyer/TACReward7B}{HuggingFace}
[288] Quantifying Fidelity: A Decisive Feature Approach to Comparing Synthetic and Real Imagery
Danial Safaei, Siddartha Khastgir, Mohsen Alirezaei, Jeroen Ploeg, Son Tong, Xingyu Zhao
Main category: cs.AI
TL;DR: Proposes Decisive Feature Fidelity (DFF), a new SUT-specific metric for AV testing that measures whether autonomous systems use consistent decision evidence across real and simulated domains, rather than just visual realism.
Details
Motivation: Current virtual testing focuses on visual realism, but recent studies show pixel-level fidelity doesn't guarantee reliable transfer from simulation to real world. What matters is whether the system bases decisions on consistent evidence across domains.Method: Introduces Decisive Feature Fidelity (DFF) metric that uses explainable-AI methods to identify and compare decisive features driving SUT’s decisions for matched real-synthetic pairs. Proposes estimators based on counterfactual explanations and DFF-guided calibration scheme.
Result: Experiments on 2126 matched KITTI-VirtualKITTI2 pairs show DFF reveals discrepancies overlooked by conventional output-value fidelity. DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.
Conclusion: DFF provides a behavior-grounded fidelity measure that captures mechanism parity - agreement in model-specific decisive evidence across domains, offering a more reliable approach to simulator calibration for AV safety assurance.
Abstract: Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on consistent decision evidence in both real and simulated environments, not just whether images “look real” to humans. To this end this paper proposes a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity, that is, agreement in the model-specific decisive evidence that drives the SUT’s decisions across domains. DFF leverages explainable-AI methods to identify and compare the decisive features driving the SUT’s outputs for matched real-synthetic pairs. We further propose estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.
[289] Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios
Defei Xia, Bingfeng Pi, Shenbin Zhang, Song Hua, Yunfei Wei, Lei Zuo
Main category: cs.AI
TL;DR: Jenius-Agent is a system-level LLM agent framework with adaptive prompting, context-aware tool orchestration, and layered memory that improves robustness in long-horizon tasks, plus a new evaluation methodology for better failure diagnosis.
Details
Motivation: Existing agent frameworks and benchmarks lack visibility into execution-level behavior, making it difficult to diagnose failures in tool invocation, state tracking, and context management, especially for real-world deployment.Method: Jenius-Agent integrates adaptive prompt generation, context-aware tool orchestration, and layered memory mechanism to stabilize execution. It also introduces an evaluation methodology that jointly measures procedural fidelity, semantic correctness, and efficiency.
Result: Experiments on Jenius-bench show up to 35% relative improvement in task completion rate over base agent, with reduced token consumption, response latency, and tool invocation failures. The framework is already deployed in production at Jenius.
Conclusion: Jenius-Agent provides a lightweight, scalable solution for robust autonomous agents with better observability and systematic failure analysis, addressing limitations of output-only metrics in existing frameworks.
Abstract: As agent systems powered by large language models (LLMs) advance, improving performance in context understanding, tool usage, and long-horizon execution has become critical. However, existing agent frameworks and benchmarks provide limited visibility into execution-level behavior, making failures in tool invocation, state tracking, and context management difficult to diagnose. This paper presents Jenius-Agent, a system-level agent framework grounded in real-world deployment experience. It integrates adaptive prompt generation, context-aware tool orchestration, and layered memory mechanism to stabilize execution and improve robustness in long-horizon, tool-augmented tasks. Beyond system design, we introduce an evaluation methodology that jointly measures procedural fidelity, semantic correctness, and efficiency. This framework makes agent behavior observable as a structured execution process and enables systematic analysis of failure modes not captured by output-only metrics. Experiments on Jenius-bench show substantial improvements in task completion rate, with up to a 35 percent relative gain over the base agent, along with reduced token consumption, response latency, and tool invocation failures. The framework is already deployed in Jenius ({https://www.jenius.cn}), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.
[290] Large-Scale Continual Scheduling and Execution for Dynamic Distributed Satellite Constellation Observation Allocation
Itai Zilberstein, Steve Chien
Main category: cs.AI
TL;DR: The paper proposes new online algorithms for large-scale dynamic distributed constraint optimization problems (DDCOP) applied to multi-satellite constellation observation scheduling, with D-NSS algorithm outperforming baselines and forming foundation for NASA FAME mission.
Details
Motivation: Earth-observing satellite constellations are growing rapidly, requiring distributed onboard control for time-sensitive measurements. Deploying autonomy to large multiagent systems needs algorithms with efficient computation and communication, which current DDCOP approaches struggle with.Method: Proposes DCOSP formulation for integrated scheduling and execution, develops an omniscient offline algorithm for optimality analysis, and creates D-NSS (Dynamic Incremental Neighborhood Stochastic Search) - an incomplete online decomposition-based DDCOP approach.
Result: D-NSS converges to near-optimal solutions and outperforms DDCOP baselines in solution quality, computation time, and message volume. The work forms foundation for NASA FAME mission, the largest in-space demonstration of distributed multiagent AI.
Conclusion: The proposed algorithms enable efficient distributed autonomy for large satellite constellations, addressing computational and communication challenges while providing near-optimal performance for dynamic observation scheduling problems.
Abstract: The size and capabilities of Earth-observing satellite constellations are rapidly increasing. Leveraging distributed onboard control, we can enable novel time-sensitive measurements and responses. However, deploying autonomy to large multiagent satellite systems necessitates algorithms with efficient computation and communication. We tackle this challenge and propose new, online algorithms for large-scale dynamic distributed constraint optimization problems (DDCOP). We present the Dynamic Multi-Satellite Constellation Observation Scheduling Problem (DCOSP), a new formulation of DDCOPs that models integrated scheduling and execution. We construct an omniscient offline algorithm to compute the novel optimality condition of DCOSP and present the Dynamic Incremental Neighborhood Stochastic Search (D-NSS) algorithm, an incomplete online decomposition-based DDCOP approach. We show through simulation that D-NSS converges to near-optimal solutions and outperforms DDCOP baselines in terms of solution quality, computation time, and message volume. Our work forms the foundation of the largest in-space demonstration of distributed multiagent AI to date: the NASA FAME mission.
[291] AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation
Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, Yankai Lin
Main category: cs.AI
TL;DR: AtomMem reframes memory management as a dynamic decision-making problem using atomic CRUD operations, combining supervised fine-tuning with RL to learn task-aligned memory policies that outperform static workflow methods.
Details
Motivation: Existing agent memory mechanisms rely on static, hand-crafted workflows, limiting performance and generalization. There's a need for more flexible, learning-based memory frameworks to solve real-world long-horizon problems.Method: Deconstructs high-level memory processes into atomic CRUD (Create, Read, Update, Delete) operations, transforming memory workflow into a learnable decision process. Combines supervised fine-tuning with reinforcement learning to learn autonomous, task-aligned policies.
Result: AtomMem-8B consistently outperforms prior static-workflow memory methods across 3 long-context benchmarks. Training dynamics analysis shows the agent discovers structured, task-aligned memory management strategies.
Conclusion: Learning-based formulation enables discovery of effective memory management strategies, highlighting key advantages over predefined routines for solving long-horizon problems.
Abstract: Equipping agents with memory is essential for solving real-world long-horizon problems. However, most existing agent memory mechanisms rely on static and hand-crafted workflows. This limits the performance and generalization ability of these memory designs, which highlights the need for a more flexible, learning-based memory framework. In this paper, we propose AtomMem, which reframes memory management as a dynamic decision-making problem. We deconstruct high-level memory processes into fundamental atomic CRUD (Create, Read, Update, Delete) operations, transforming the memory workflow into a learnable decision process. By combining supervised fine-tuning with reinforcement learning, AtomMem learns an autonomous, task-aligned policy to orchestrate memory behaviors tailored to specific task demands. Experimental results across 3 long-context benchmarks demonstrate that the trained AtomMem-8B consistently outperforms prior static-workflow memory methods. Further analysis of training dynamics shows that our learning-based formulation enables the agent to discover structured, task-aligned memory management strategies, highlighting a key advantage over predefined routines.
[292] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen, Yanfeng Wang
Main category: cs.AI
TL;DR: ML-Master 2.0 achieves 56.44% medal rate on MLE-Bench using Hierarchical Cognitive Caching for ultra-long-horizon autonomy in machine learning engineering.
Details
Motivation: AI advancement toward agentic science is bottlenecked by ultra-long-horizon autonomy - the ability to maintain strategic coherence over experimental cycles spanning days/weeks. LLMs struggle with high-dimensional, delayed-feedback environments and fail to consolidate sparse feedback into long-term guidance.Method: Hierarchical Cognitive Caching (HCC) - a multi-tiered architecture inspired by computer systems that reframes context management as cognitive accumulation. It dynamically distills transient execution traces into stable knowledge and cross-task wisdom, decoupling immediate execution from long-term strategy.
Result: Achieves state-of-the-art 56.44% medal rate on OpenAI’s MLE-Bench under 24-hour budgets, demonstrating superior performance in ultra-long-horizon machine learning engineering tasks.
Conclusion: Ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities, with HCC enabling structural differentiation of experience over time to overcome context window limitations.
Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI’s MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.
[293] Rethinking the AI Scientist: Interactive Multi-Agent Workflows for Scientific Discovery
Lukas Weidener, Marko Brkić, Mihailo Jovanović, Ritvik Singh, Chiara Baccin, Emre Ulgac, Alex Dobrin, Aakaash Meduri
Main category: cs.AI
TL;DR: Deep Research is a multi-agent AI system for interactive scientific discovery with minute-level turnaround times, achieving state-of-the-art performance on computational biology benchmarks.
Details
Motivation: Existing AI systems for scientific discovery are proprietary, operate in batch-processing modes with hours-long cycles, and lack real-time researcher guidance capabilities.Method: Multi-agent architecture with specialized agents for planning, data analysis, literature search, and novelty detection, unified through persistent world state. Supports semi-autonomous (with human checkpoints) and fully autonomous operational modes.
Result: Achieved 48.8% accuracy on open response and 64.4% on multiple-choice evaluation on BixBench computational biology benchmark, exceeding existing baselines by 14-26 percentage points.
Conclusion: Deep Research enables interactive scientific investigation with rapid turnaround, though practical deployment faces challenges including open access literature limitations and automated novelty assessment difficulties.
Abstract: Artificial intelligence systems for scientific discovery have demonstrated remarkable potential, yet existing approaches remain largely proprietary and operate in batch-processing modes requiring hours per research cycle, precluding real-time researcher guidance. This paper introduces Deep Research, a multi-agent system enabling interactive scientific investigation with turnaround times measured in minutes. The architecture comprises specialized agents for planning, data analysis, literature search, and novelty detection, unified through a persistent world state that maintains context across iterative research cycles. Two operational modes support different workflows: semi-autonomous mode with selective human checkpoints, and fully autonomous mode for extended investigations. Evaluation on the BixBench computational biology benchmark demonstrated state-of-the-art performance, achieving 48.8% accuracy on open response and 64.4% on multiple-choice evaluation, exceeding existing baselines by 14 to 26 percentage points. Analysis of architectural constraints, including open access literature limitations and challenges inherent to automated novelty assessment, informs practical deployment considerations for AI-assisted scientific workflows.
[294] Understanding Mental States to Guide Social Influence in Multi-Person Group Dialogue
Zhichao Liang, Satoshi Nakamura
Main category: cs.AI
TL;DR: SocialMindChange benchmark moves from passive mental state tracking to active mind-changing in social interactions, testing LLMs’ ability to generate dialogue to achieve goals while maintaining consistent mental-state representations across connected scenes.
Details
Motivation: Existing ToM benchmarks are passive - models just read and report mental states. Real social interaction requires using ToM for action: planning what to say to change others' mental states toward goals. Need to test active mind-changing ability.Method: Created SocialMindChange benchmark with structured 4-step framework: 1) define social context with 4 characters, 2) create 5 connected scenes, 3) model plays one character generating dialogue across scenes to reach target, 4) include higher-order mental states. Constructed 1,200 contexts (6,000 scenarios, 90,000+ questions) validated for realism.
Result: Evaluated 10 state-of-the-art LLMs, found average performance 54.2% below human performance. Shows current LLMs struggle to maintain and change mental-state representations across long, linked social interactions.
Conclusion: SocialMindChange reveals significant gap in LLMs’ active ToM capabilities. Moving from passive tracking to active mind-changing exposes limitations in maintaining consistent mental-state representations during extended social interactions.
Abstract: Existing dynamic Theory of Mind (ToM) benchmarks mostly place language models in a passive role: the model reads a sequence of connected scenarios and reports what people believe, feel, intend, and do as these states change. In real social interaction, ToM is also used for action: a speaker plans what to say in order to shift another person’s mental-state trajectory toward a goal. We introduce SocialMindChange, a benchmark that moves from tracking minds to changing minds in social interaction. Each instance defines a social context with 4 characters and five connected scenes. The model plays one character and generates dialogue across the five scenes to reach the target while remaining consistent with the evolving states of all participants. SocialMindChange also includes selected higher-order states. Using a structured four-step framework, we construct 1,200 social contexts, covering 6000 scenarios and over 90,000 questions, each validated for realism and quality. Evaluations on ten state-of-the-art LLMs show that their average performance is 54.2% below human performance. This gap suggests that current LLMs still struggle to maintain and change mental-state representations across long, linked interactions.
[295] Human Simulation Computation: A Human-Inspired Framework for Adaptive AI Systems
Hong Su
Main category: cs.AI
TL;DR: HSC proposes a human-inspired computational framework that models intelligence as a continuous closed-loop process involving thinking, action, learning, reflection, and scheduling, addressing LLMs’ limitations in real-world adaptation.
Details
Motivation: Current LLMs rely solely on textual data, limiting their ability to adapt, verify reasoning outcomes, and operate effectively in open, dynamic real-world environments. There's a need for more robust, human-like reasoning systems.Method: Human Simulation Computation (HSC) framework models intelligence as continuous closed-loop process with thinking, action, learning, reflection, and activity scheduling. It emphasizes active participation in both internal reasoning and environmental interactions, using actions to refine reasoning mechanisms automatically. Incorporates human thinking strategies like main-feature-oriented reasoning, scope expansion through action, and on-time learning from environmental feedback.
Result: Theoretical analysis shows human simulation strategies cannot be fully learned from language material alone. Human-like reasoning processes and action-grounded reasoning methods are essential for robust adaptation and effective interaction with real-world environments.
Conclusion: HSC provides a comprehensive framework for developing more adaptive, verifiable AI systems that can effectively operate in dynamic real-world environments by incorporating human-inspired reasoning processes and action-grounded learning mechanisms.
Abstract: Large language models (LLMs) have demonstrated strong capabilities in knowledge representation and reasoning based on textual data. However, their reliance on language material alone limits their ability to adapt, verify reasoning outcomes, and operate effectively in open and dynamic real-world environments. In this paper, we propose Human Simulation Computation (HSC), a human-inspired computational framework that models intelligence as a continuous, closed-loop process involving thinking, action, learning, reflection, and activity scheduling, collectively referred to as the internal reasoning process. HSC emphasizes active participation both within the internal reasoning process and in interactions with the environment, where actions are used not only to achieve goals but also to automatically refine and improve internal reasoning mechanisms without external intervention. Furthermore, HSC incorporates commonly used human thinking strategies across all stages of the internal reasoning process, such as main-feature-oriented reasoning, scope expansion through action, and on-time learning driven by environmental feedback. Through theoretical analysis, we argue that human simulation strategies cannot be fully learned from language material alone, and that human-like reasoning processes and action-grounded reasoning methods are essential for robust adaptation and effective interaction with real-world environments.
[296] LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen
Main category: cs.AI
TL;DR: LangForce addresses information collapse in VLA models by enforcing instruction following through Bayesian decomposition and maximizing conditional PMI between actions and instructions.
Details
Motivation: Current VLA models struggle with generalization to new instructions and multi-task scenarios due to dataset bias where language instructions become predictable from visual observations alone, causing information collapse where models ignore language constraints.Method: Proposes LangForce framework with learnable Latent Action Queries in a dual-branch architecture to estimate vision-only prior p(a|v) and language-conditioned posterior π(a|v,ℓ), then optimizes policy to maximize conditional Pointwise Mutual Information between actions and instructions.
Result: Significantly improves generalization without requiring new data, achieving 11.3% improvement on challenging OOD SimplerEnv benchmark, with extensive experiments across SimplerEnv and RoboCasa demonstrating substantial gains.
Conclusion: LangForce effectively addresses information collapse in VLA models by penalizing vision shortcuts and rewarding actions that explain language commands, enabling robust language grounding in action for better generalization.
Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
[297] Tabular Incremental Inference
Xinda Chen, Zhen Xing, Hanyu Zhang, Weimin Tan, Bo Yan
Main category: cs.AI
TL;DR: Tabular Incremental Inference (TabII) enables AI models to handle dynamically changing table columns during inference, using information bottleneck theory and LLM placeholders with Pretrained TabAdapters.
Details
Motivation: Traditional AI models trained on fixed-column tables cannot handle dynamically changing tables in real-world scenarios where columns evolve due to technological advancements, changing needs, and data integration.Method: Frames TabII as an optimization problem using information bottleneck theory, designs method with Large Language Model placeholders and Pretrained TabAdapter for external knowledge, and uses Incremental Sample Condensation blocks to condense task-relevant information from incremental column attributes.
Result: Experimental results across eight public datasets show TabII effectively utilizes incremental attributes and achieves state-of-the-art performance.
Conclusion: TabII addresses the practical limitation of fixed-column table models by enabling incremental inference with dynamic column changes, enhancing AI model practicality in real-world tabular data scenarios.
Abstract: Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity’s continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.
[298] The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data
Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou
Main category: cs.AI
TL;DR: The paper proposes an LLM Data Auditor framework to systematically evaluate the quality and trustworthiness of LLM-generated synthetic data across six modalities, shifting focus from extrinsic task-based evaluation to intrinsic data properties.
Details
Motivation: LLMs can generate synthetic data to overcome real-world data scarcity, but ensuring high-quality synthetic data remains challenging. Existing research focuses on generation methods rather than data quality evaluation, and lacks a unified perspective across different data modalities.Method: Proposes the LLM Data Auditor framework that: 1) describes LLM-based data generation across six modalities, 2) systematically categorizes intrinsic evaluation metrics for synthetic data from quality and trustworthiness dimensions, 3) analyzes experimental evaluations of representative methods, and 4) provides practical application methodologies.
Result: Analysis reveals substantial deficiencies in current evaluation practices for LLM-generated synthetic data. The framework identifies gaps in quality assessment and provides concrete recommendations for improving data generation evaluation.
Conclusion: The LLM Data Auditor framework addresses critical gaps in synthetic data evaluation by shifting focus to intrinsic data properties, providing systematic evaluation metrics, and offering practical guidance for improving data generation quality assessment across multiple modalities.
Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.
cs.SD
[299] SICL-AT: Another way to adapt Auditory LLM to low-resource task
Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson
Main category: cs.SD
TL;DR: SICL-AT improves auditory LLMs’ in-context learning for low-resource speech/audio tasks using only high-resource data, outperforming direct fine-tuning.
Details
Motivation: Auditory LLMs struggle with low-resource or unfamiliar tasks, and direct fine-tuning is brittle when labeled in-domain data is scarce or mismatched. In-context learning offers a training-free alternative but needs enhancement for better generalization.Method: Proposes Speech In-Context Learning Adaptation Training (SICL-AT), a post-training recipe that uses only high-resource speech data to strengthen models’ in-context learning capability, which then generalizes to audio understanding/reasoning tasks.
Result: Experiments show the proposed method consistently outperforms direct fine-tuning in low-resource scenarios, and vanilla ICL improves zero-shot performance across diverse speech/audio tasks.
Conclusion: SICL-AT effectively enhances auditory LLMs’ in-context learning ability using only high-resource data, providing a robust solution for low-resource speech and audio tasks without requiring in-domain labeled data.
Abstract: Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource or unfamiliar tasks. In case of labeled in-domain data is scarce or mismatched to the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that \emph{Vanilla ICL}, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose \textbf{Speech In-Context Learning Adaptation Training (SICL-AT)}, a post-training recipe utilizes only high resource speech data intending to strengthen model’s in-context learning capability. The enhancement can generalize to audio understanding/reasoning task. Experiments indicate our proposed method consistently outperforms direct fine-tuning in low-resource scenario.
[300] Enhancing Speech Emotion Recognition using Dynamic Spectral Features and Kalman Smoothing
Marouane El Hizabri, Abdelfattah Bezzaz, Ismail Hayoukane, Youssef Taki
Main category: cs.SD
TL;DR: Adding dynamic spectral features with Kalman smoothing improves speech emotion recognition accuracy by reducing noise and stabilizing classification, achieving 87% accuracy on RAVDESS dataset.
Details
Motivation: Traditional speech emotion recognition systems rely on static features (MFCCs, ZCR, RMSE) which are vulnerable to acoustic noise, leading to emotion misclassification. The paper aims to address this limitation by incorporating temporal dynamics and noise reduction.Method: Proposed method combines dynamic spectral features (Deltas and Delta-Deltas) with Kalman Smoothing algorithm. The Kalman filter reduces noise in vocal signals while also stabilizing classifier outputs over time, accounting for the temporal nature of emotional expression.
Result: Achieved state-of-the-art 87% accuracy on RAVDESS dataset. The method significantly reduced misclassification between emotions with similar acoustic features, demonstrating improved robustness to noise.
Conclusion: Incorporating dynamic features with Kalman smoothing effectively addresses noise vulnerability in speech emotion recognition, leading to more accurate and stable emotion classification by capturing temporal dynamics and reducing acoustic noise interference.
Abstract: Speech Emotion Recognition systems often use static features like Mel-Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), and Root Mean Square Energy (RMSE). Because of this, they can misclassify emotions when there is acoustic noise in vocal signals. To address this, we added dynamic features using Dynamic Spectral features (Deltas and Delta-Deltas) along with the Kalman Smoothing algorithm. This approach reduces noise and improves emotion classification. Since emotion changes over time, the Kalman Smoothing filter also helped make the classifier outputs more stable. Tests on the RAVDESS dataset showed that this method achieved a state-of-the-art accuracy of 87% and reduced misclassification between emotions with similar acoustic features
[301] A Framework for Evaluating Faithfulness in Explainable AI for Machine Anomalous Sound Detection Using Frequency-Band Perturbation
Alexander Buck, Georgina Cosma, Iain Phillips, Paul Conway, Patrick Baker
Main category: cs.SD
TL;DR: Researchers propose a quantitative framework to evaluate XAI faithfulness in anomalous sound detection by linking attribution relevance to model behavior through frequency-band removal, finding Occlusion performs best while gradient-based methods often fail.
Details
Motivation: Current XAI methods for anomalous sound detection rely on qualitative inspection of saliency maps, leaving uncertainty about whether these attributions accurately reflect the spectral cues the model actually uses. There's a need for objective evaluation of XAI faithfulness in audio analysis.Method: Introduces a quantitative framework that directly links attribution relevance to model behavior through systematic frequency-band removal. This approach objectively measures whether XAI methods correctly identify frequency regions that influence ASD model predictions. Tests four widely adopted XAI methods: Integrated Gradients, Occlusion, Grad-CAM, and SmoothGrad.
Result: XAI techniques differ significantly in reliability. Occlusion demonstrates the strongest alignment with true model sensitivity, while gradient-based methods (Integrated Gradients, Grad-CAM, SmoothGrad) often fail to accurately capture spectral dependencies in anomalous sound detection models.
Conclusion: The proposed framework offers a reproducible way to benchmark audio explanations and enables more trustworthy interpretation of spectrogram-based ASD systems, addressing the gap in quantitative evaluation of XAI faithfulness for machine sound analysis.
Abstract: Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative inspection of saliency maps, leaving open the question of whether these attributions accurately reflect the spectral cues the model uses. In this work, we introduce a new quantitative framework for evaluating XAI faithfulness in machine-sound analysis by directly linking attribution relevance to model behaviour through systematic frequency-band removal. This approach provides an objective measure of whether an XAI method for machine ASD correctly identifies frequency regions that influence an ASD model’s predictions. By using four widely adopted methods, namely Integrated Gradients, Occlusion, Grad-CAM and SmoothGrad, we show that XAI techniques differ in reliability, with Occlusion demonstrating the strongest alignment with true model sensitivity and gradient-+based methods often failing to accurately capture spectral dependencies. The proposed framework offers a reproducible way to benchmark audio explanations and enables more trustworthy interpretation of spectrogram-based ASD systems.
[302] Audio Foundation Models Outperform Symbolic Representations for Piano Performance Evaluation
Jai Dhiman
Main category: cs.SD
TL;DR: Audio-based piano performance evaluation using pre-trained models (MuQ/MERT) outperforms symbolic (MIDI) approaches by 55% on 19 perceptual quality dimensions, with audio alone being sufficient for evaluation.
Details
Motivation: Traditional MIDI-based piano performance evaluation misses acoustic nuances crucial for expressive playing assessment. Audio foundation models can capture these nuances for better performance evaluation.Method: Use pre-trained audio foundation models (MuQ and MERT) to predict 19 perceptual dimensions of piano performance quality. Compare audio vs symbolic approaches using synthesized audio from PercePiano MIDI files rendered via Pianoteq under controlled conditions.
Result: MuQ layers 9-12 with Pianoteq soundfont augmentation achieves R² = 0.537 (55% improvement over symbolic baseline R² = 0.347). Audio outperforms symbolic on all 19 dimensions (p < 10⁻²⁵). Cross-soundfont generalization (R² = 0.534), difficulty correlation (rho = 0.623), and multi-performer consistency validated.
Conclusion: Audio representations alone are sufficient for piano performance evaluation, as audio-symbolic fusion provides minimal benefit due to high error correlation (r = 0.738). Audio-based approach significantly outperforms traditional symbolic methods.
Abstract: Automated piano performance evaluation traditionally relies on symbolic (MIDI) representations, which capture note-level information but miss the acoustic nuances that characterize expressive playing. I propose using pre-trained audio foundation models, specifically MuQ and MERT, to predict 19 perceptual dimensions of piano performance quality. Using synthesized audio from PercePiano MIDI files (rendered via Pianoteq), I compare audio and symbolic approaches under controlled conditions where both derive from identical source data. The best model, MuQ layers 9-12 with Pianoteq soundfont augmentation, achieves R^2 = 0.537 (95% CI: [0.465, 0.575]), representing a 55% improvement over the symbolic baseline (R^2 = 0.347). Statistical analysis confirms significance (p < 10^-25) with audio outperforming symbolic on all 19 dimensions. I validate the approach through cross-soundfont generalization (R^2 = 0.534 +/- 0.075), difficulty correlation with an external dataset (rho = 0.623), and multi-performer consistency analysis. Analysis of audio-symbolic fusion reveals high error correlation (r = 0.738), explaining why fusion provides minimal benefit: audio representations alone are sufficient. I release the complete training pipeline, pretrained models, and inference code.
[303] Interpretable and Perceptually-Aligned Music Similarity with Pretrained Embeddings
Arhan Vohra, Taketo Akama
Main category: cs.SD
TL;DR: Pretrained text-audio embeddings match state-of-the-art perceptual similarity without fine-tuning; new method with source separation and linear optimization improves alignment and provides interpretable instrument weights for music retrieval.
Details
Motivation: Current music similarity systems using self-supervised metric learning align well with human perception but lack interpretability, generalization, and suffer from limited dataset availability. There's a need for more interpretable and controllable similarity models that can help music producers find stem-level content.Method: First shows that pretrained text-audio embeddings (CLAP and MuQ-MuLan) work well without fine-tuning. Then introduces a novel method combining source separation with linear optimization on ABX preference data from listening tests to perceptually align pretrained embeddings.
Result: Pretrained embeddings achieve comparable perceptual alignment to state-of-the-art methods. The proposed method surpasses this baseline and provides interpretable, controllable instrument-wise weights, enabling stem-level music retrieval based on mixed reference songs.
Conclusion: Pretrained embeddings offer strong baseline for perceptual similarity, and the proposed source separation + linear optimization method provides improved alignment with interpretable instrument weights, making it valuable for music production applications.
Abstract: Perceptual similarity representations enable music retrieval systems to determine which songs sound most similar to listeners. State-of-the-art approaches based on task-specific training via self-supervised metric learning show promising alignment with human judgment, but are difficult to interpret or generalize due to limited dataset availability. We show that pretrained text-audio embeddings (CLAP and MuQ-MuLan) offer comparable perceptual alignment on similarity tasks without any additional fine-tuning. To surpass this baseline, we introduce a novel method to perceptually align pretrained embeddings with source separation and linear optimization on ABX preference data from listening tests. Our model provides interpretable and controllable instrument-wise weights, allowing music producers to retrieve stem-level loops and samples based on mixed reference songs.
[304] A Hybrid Discriminative and Generative System for Universal Speech Enhancement
Yinghao Liu, Chengwei Liu, Xiaotao Liang, Haoyin Yan, Shaofei Xue, Zheng Xue
Main category: cs.SD
TL;DR: Hybrid speech enhancement system combining discriminative and generative models with adaptive fusion, achieving 3rd place in ICASSP 2026 URGENT Challenge.
Details
Motivation: Universal speech enhancement needs to handle diverse speech distortions and recording conditions, requiring both signal fidelity and detail-rich reconstruction capabilities.Method: Hybrid architecture with: 1) discriminative TF-GridNet with Sampling-Frequency-Independent strategy for variable sampling rates, 2) autoregressive model with spectral mapping for detail-rich speech generation, and 3) fusion network with adaptive weights optimized by signal-level and Speech Quality Assessment losses.
Result: The system achieved 3rd place ranking in the ICASSP 2026 URGENT Challenge (Track 1).
Conclusion: The hybrid approach successfully synergizes discriminative and generative modeling strengths for universal speech enhancement, demonstrating competitive performance in challenge evaluation.
Abstract: Universal speech enhancement aims at handling inputs with various speech distortions and recording conditions. In this work, we propose a novel hybrid architecture that synergizes the signal fidelity of discriminative modeling with the reconstruction capabilities of generative modeling. Our system utilizes the discriminative TF-GridNet model with the Sampling-Frequency-Independent strategy to handle variable sampling rates universally. In parallel, an autoregressive model combined with spectral mapping modeling generates detail-rich speech while effectively suppressing generative artifacts. Finally, a fusion network learns adaptive weights of the two outputs under the optimization of signal-level losses and the comprehensive Speech Quality Assessment (SQA) loss. Our proposed system is evaluated in the ICASSP 2026 URGENT Challenge (Track 1) and ranks the third place.
[305] Phase-Retrieval-Based Physics-Informed Neural Networks For Acoustic Magnitude Field Reconstruction
Karl Schrader, Shoichi Koyama, Tomohiko Nakamura, Mirco Pezzoli
Main category: cs.SD
TL;DR: Proposes a phase-retrieval-based PINN method to estimate acoustic magnitude fields from sparse magnitude measurements when phase information is unavailable.
Details
Motivation: Current PINNs for sound field estimation require phase measurements, but phase data is often unreliable or inaccessible in real-world applications. There's a need for methods that can work with only magnitude measurements.Method: Develops a phase-retrieval-based PINN that uses separate neural networks to represent magnitude and phase distributions. The PDE loss is computed based on the reconstructed complex amplitude, enabling physics-informed learning without direct phase measurements.
Result: Experimental evaluation demonstrates the effectiveness of the proposed method for estimating acoustic magnitude fields from spatially sparse magnitude measurements.
Conclusion: The phase-retrieval-based PINN successfully extends physics-informed neural networks to settings where only magnitude measurements are available, addressing a practical limitation of conventional PINNs for acoustic field estimation.
Abstract: We propose a method for estimating the magnitude distribution of an acoustic field from spatially sparse magnitude measurements. Such a method is useful when phase measurements are unreliable or inaccessible. Physics-informed neural networks (PINNs) have shown promise for sound field estimation by incorporating constraints derived from governing partial differential equations (PDEs) into neural networks. However, they do not extend to settings where phase measurements are unavailable, as the loss function based on the governing PDE relies on phase information. To remedy this, we propose a phase-retrieval-based PINN for magnitude field estimation. By representing the magnitude and phase distributions with separate networks, the PDE loss can be computed based on the reconstructed complex amplitude. We demonstrate the effectiveness of our phase-retrieval-based PINN through experimental evaluation.
[306] Residual Tokens Enhance Masked Autoencoders for Speech Modeling
Samir Sadok, Stéphane Lathuilière, Xavier Alameda-Pineda
Main category: cs.SD
TL;DR: RT-MAE is a masked autoencoder framework that combines supervised attribute modeling with unsupervised residual tokens to capture speech information beyond explicit factors like pitch, content, and speaker identity.
Details
Motivation: Current speech modeling relies on explicit attributes (pitch, content, speaker identity) but these cannot capture the full richness of natural speech, including timbre variations, noise, emotion, and other subtle characteristics.Method: RT-MAE uses a masked autoencoder framework that augments supervised attributes-based modeling with unsupervised residual trainable tokens designed to encode information not explained by explicit labeled factors.
Result: RT-MAE improves reconstruction quality, preserves content and speaker similarity while enhancing expressivity. It also demonstrates applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
Conclusion: The RT-MAE framework successfully bridges supervised and unsupervised learning for speech modeling, capturing richer speech characteristics beyond explicit attributes and showing practical value in speech enhancement applications.
Abstract: Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
[307] Physics-Aware Novel-View Acoustic Synthesis with Vision-Language Priors and 3D Acoustic Environment Modeling
Congyi Fan, Jian Guan, Youtian Lin, Dongli Xu, Tong Ye, Qiaoxi Zhu, Pengming Feng, Wenwu Wang
Main category: cs.SD
TL;DR: Phys-NVAS: A physics-aware novel-view acoustic synthesis framework that combines 3D geometry reconstruction with vision-language semantic priors for realistic binaural audio generation.
Details
Motivation: Spatial audio is crucial for immersive experiences, but current NVAS methods fail to capture global geometry and semantic cues like object layout and material properties, limiting their ability to model complex physical phenomena like reflection, diffraction, and absorption.Method: 1) Reconstructs global 3D acoustic environment from multi-view images and depth maps to estimate room size/shape; 2) Uses vision-language model to extract physics-aware priors of objects, layouts, and materials; 3) Unifies these cues via acoustic feature fusion adapter into physics-aware representation for binaural generation.
Result: Experiments on RWAVS dataset show Phys-NVAS produces binaural audio with improved realism and physical consistency compared to existing methods.
Conclusion: Phys-NVAS successfully integrates spatial geometry modeling with vision-language semantic priors to address limitations of previous NVAS approaches, enabling more realistic physics-aware acoustic synthesis.
Abstract: Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial fidelity but fail to capture global geometry and semantic cues such as object layout and material properties. To address this, we propose Phys-NVAS, the first physics-aware NVAS framework that integrates spatial geometry modeling with vision-language semantic priors. A global 3D acoustic environment is reconstructed from multi-view images and depth maps to estimate room size and shape, enhancing spatial awareness of sound propagation. Meanwhile, a vision-language model extracts physics-aware priors of objects, layouts, and materials, capturing absorption and reflection beyond geometry. An acoustic feature fusion adapter unifies these cues into a physics-aware representation for binaural generation. Experiments on RWAVS demonstrate that Phys-NVAS yields binaural audio with improved realism and physical consistency.
[308] Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization
Zhen Liao, Gaole Dai, Mengqiao Chen, Wenqing Cheng, Wei Xu
Main category: cs.SD
TL;DR: ConBiMamba for speaker diarization combines Conformer and Mamba strengths, replacing self-attention with ExtBiMamba for efficiency, adding Boundary-Enhanced Transition Loss for better change point detection, and Layer-wise Feature Aggregation for multi-layer representation. Achieves SOTA on 4/6 datasets.
Details
Motivation: Conformer and Mamba have limitations for speaker diarization: Mamba struggles with local details and nonlinear patterns, while Conformer's self-attention has high memory overhead for long sequences and instability in long-range dependencies. Diarization requires both precise local modeling and robust speaker consistency over extended spans.Method: Proposes Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system following Pyannote pipeline. Integrates Conformer’s convolutional and feed-forward structures for local feature extraction, replaces self-attention with ExtBiMamba for efficient long sequence handling. Adds Boundary-Enhanced Transition Loss to improve speaker change point detection and Layer-wise Feature Aggregation for multi-layer representation utilization.
Result: Evaluated on six diarization datasets, achieves state-of-the-art performance on four of them. System reduces memory cost of self-attention while improving local feature extraction and speaker change point detection.
Conclusion: ConBiMamba effectively addresses limitations of both Conformer and Mamba for speaker diarization by combining their strengths, introducing specialized losses for boundary detection, and enhancing multi-layer feature utilization, resulting in superior performance across multiple datasets.
Abstract: Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer’s self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer’s convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer’s self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.
[309] SLM-SS: Speech Language Model for Generative Speech Separation
Tianhua Li, Chenda Li, Wei Wang, Xin Zhou, Xihui Chen, Jianqing Gao, Yanmin Qian
Main category: cs.SD
TL;DR: SLM-SS applies speech language models to speech separation, framing it as discrete multi-codebook sequence generation to enhance intelligibility and coherence of separated signals.
Details
Motivation: Current neural network-based speech separation methods improve signal-level metrics but often fail to maintain speech intelligibility, which negatively impacts downstream tasks like speech recognition.Method: Frames speech separation as discrete multi-codebook sequence generation using Encoder-Decoder models to map quantized speech mixtures to target tokens. Introduces both autoregressive and non-autoregressive models for improved decoding efficiency.
Result: Experimental results on LibriMix dataset show significantly better preservation of speech intelligibility and improved linguistic consistency in downstream tasks compared to existing approaches.
Conclusion: SLM-SS successfully enhances speech intelligibility and coherence in separated signals by applying speech language models, leading to better performance in downstream applications.
Abstract: Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.
[310] A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
Iwona Christop, Mateusz Czyżnikiewicz, Paweł Skórzewski, Łukasz Bondaruk, Jakub Kubiak, Marcin Lewandowski, Marek Kubis
Main category: cs.SD
TL;DR: Proposes Audio Reasoning Tasks (ART), a new benchmark for multimodal models to test reasoning skills across different audio tasks, addressing limitations of current benchmarks that test audio tasks in isolation.
Details
Motivation: Current benchmarks for multimodal large language models only test audio tasks like speaker diarization or gender identification in isolation, but cannot verify whether models can combine reasoning skills across different audio task categories.Method: Proposes Audio Reasoning Tasks (ART) as a new benchmark specifically designed to assess multimodal models’ ability to solve problems requiring reasoning over audio signals.
Result: Not specified in the abstract - this appears to be a proposal paper introducing the ART benchmark concept rather than presenting experimental results.
Conclusion: The ART benchmark addresses a critical gap in multimodal model evaluation by enabling testing of reasoning capabilities that combine different audio task categories, moving beyond isolated task testing.
Abstract: The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
[311] Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification
Zhihua Fang, Liang He
Main category: cs.SD
TL;DR: Proposes hyperbolic space-based speaker embedding methods (H-Softmax and HAM-Softmax) that leverage hyperbolic geometry’s hierarchical representation capabilities to improve speaker verification performance.
Details
Motivation: Euclidean space speaker embeddings are insufficient for modeling hierarchical information in speaker features. Hyperbolic space, with its negative curvature and ability to represent hierarchical structures efficiently in finite volume, is better suited for speaker embedding distributions.Method: Two hyperbolic space-based methods: H-Softmax projects embeddings and speaker centers into hyperbolic space and computes hyperbolic distances to incorporate hierarchical information. HAM-Softmax adds margin constraints to H-Softmax to enhance inter-class separability.
Result: H-Softmax achieves 27.84% average relative EER reduction compared to standard Softmax. HAM-Softmax achieves 14.23% average relative EER reduction compared to AM-Softmax. Both methods improve speaker verification performance while preserving hierarchical structure modeling capability.
Conclusion: Hyperbolic space-based speaker embedding methods effectively improve speaker verification performance by leveraging hyperbolic geometry’s hierarchical representation properties, demonstrating the value of geometric considerations in speaker embedding learning.
Abstract: Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) based on hyperbolic space. H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM-Softmax further enhances inter-class separability by introducing margin constraint on this basis. Experimental results show that H-Softmax and HAM-Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM-Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at https://github.com/PunkMale/HAM-Softmax.
[312] Advanced Modeling of Interlanguage Speech Intelligibility Benefit with L1-L2 Multi-Task Learning Using Differentiable K-Means for Accent-Robust Discrete Token-Based ASR
Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu
Main category: cs.SD
TL;DR: Proposed method improves ASR robustness to foreign-accented speech by advanced modeling of interlanguage speech intelligibility benefit (ISIB) using differentiable k-means and joint L1-L2 optimization.
Details
Motivation: Building ASR systems robust to foreign-accented speech is crucial in today's globalized world. Prior work showed ISIB (where foreign-accented speech is more intelligible to listeners sharing the speaker's native language) can enhance phonetic token-based ASR, but needed more advanced modeling.Method: Proposes advanced ISIB modeling using differentiable k-means and optimizing the entire module for both L1 (native language) and L2 (target language) ASR. This allows end-to-end optimization rather than just using L1 for k-means clustering.
Result: Outperformed baselines in both scenarios: using only native speech and when incorporating limited accented speech. Achieved ~20% relative improvement in recognition accuracy in the latter scenario.
Conclusion: Advanced ISIB modeling with differentiable k-means and joint L1-L2 optimization significantly improves ASR performance on foreign-accented speech, demonstrating the value of incorporating linguistic knowledge about accent intelligibility patterns.
Abstract: Building ASR systems robust to foreign-accented speech is an important challenge in today’s globalized world. A prior study explored the way to enhance the performance of phonetic token-based ASR on accented speech by reproducing the phenomenon known as interlanguage speech intelligibility benefit (ISIB), where foreign-accented speech is more intelligible to listeners sharing the speaker’s native language than to native listeners. ISIB was technically implemented by using the speaker’s L1 to learn k-means cluster centroids in an SSL feature space to obtain phonetic tokens. In this study, we propose a more advanced modeling of ISIB. By employing differentiable k-means and optimizing the entire module for both L1 and L2 ASR, the proposed method outperformed the baselines, both when using only native speech and when additionally incorporating a limited amount of accented speech. Notably, in the latter scenario, our method achieved approximately a 20% relative improvement in recognition accuracy.
[313] Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means
Kentaro Onda, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Main category: cs.SD
TL;DR: Proposes Phonological Tokenizer - fine-tunes phonetic tokens via differentiable k-means with multi-task ASR and resynthesis to capture both linguistic and prosodic information while discarding speaker identity.
Details
Motivation: Current discrete speech tokens (acoustic vs phonetic) are insufficient for prosody-sensitive tasks like speechLMs. Acoustic tokens retain too much speaker info, phonetic tokens lose prosody. Need tokens that capture phonological info (linguistic + prosodic) while abstracting away speaker identity.Method: Fine-tunes phonetic tokens using differentiable k-means with multi-task objective combining ASR (for linguistic content) and speech resynthesis (for prosodic information). Creates tokens that retain phonological information while discarding speaker identity.
Result: Experimental validation on diverse tasks confirms tokens retain both linguistic and prosodic (phonological) information while appropriately discarding speaker identity.
Conclusion: Phonological Tokenizer provides better discrete token representation for prosody-sensitive tasks like speechLMs by capturing essential phonological information while abstracting unnecessary speaker details.
Abstract: In recent years, there has been growing interest in representing speech with discrete tokens, which serve as pseudo-text for speech language models (speechLMs) and as efficient intermediate representations for downstream tasks. These tokens are typically categorized as acoustic and phonetic tokens: the former holds detailed acoustic information for reconstruction while the latter mainly captures linguistic content. In human speech communication, however, unnecessary acoustic details such as speaker information are abstracted, while both linguistic and prosodic information are utilized for speech comprehension and production. Given this, neither type of token seems an ideal representation for tasks sensitive to prosody, such as speechLMs. In this study, we propose the Phonological Tokenizer, a method that fine-tunes phonetic tokens via differentiable k-means with a multi-task objective of ASR and speech resynthesis. Experimental validation on diverse tasks confirms that our tokens retain phonological (both linguistic and prosodic) information while appropriately discarding speaker identity.
[314] SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment
Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin
Main category: cs.SD
TL;DR: SingMOS-Pro is an enhanced dataset for automatic singing quality assessment that extends the previous SingMOS with more diverse annotations (lyrics, melody, overall quality) across 7,981 singing clips from 41 models, providing reliable human ratings for benchmarking evaluation methods.
Details
Motivation: Current singing voice generation lacks effective evaluation methods - human listening tests are expensive and time-consuming, while existing objective metrics capture only limited perceptual aspects of singing quality.Method: Created SingMOS-Pro dataset with 7,981 singing clips from 41 models across 12 datasets, each rated by at least five experienced annotators for lyrics, melody, and overall quality. Investigated strategies for utilizing MOS data with heterogeneous standards and benchmarked existing evaluation methods.
Result: The dataset provides comprehensive annotations with broader coverage and greater diversity than previous versions. Established strong baselines and practical references for future research in singing quality assessment.
Conclusion: SingMOS-Pro addresses the critical challenge of evaluating singing quality by providing a reliable, publicly available dataset with multi-dimensional annotations, enabling better benchmarking and development of automatic singing quality assessment methods.
Abstract: Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro extends the annotations of the additional data to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent state-of-the-art approaches. Each clip is rated by at least five experienced annotators to ensure reliability and consistency. Furthermore, we investigate strategies for effectively utilizing MOS data annotated under heterogeneous standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset is publicly available at https://huggingface.co/datasets/TangRain/SingMOS-Pro.
[315] Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding
Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, Kong Aik Lee
Main category: cs.SD
TL;DR: SimWhisper-Codec: A semantic-first speech codec that adapts Whisper ASR model for high-fidelity acoustic reconstruction without external supervision, achieving better semantic preservation and acoustic quality than existing codecs.
Details
Motivation: Existing speech codecs face a fundamental conflict between acoustic fidelity and semantic preservation. Current methods typically augment acoustic codecs with complex semantic supervision, but this paper explores the opposite direction: starting from a semantically-capable model and adapting it for acoustic reconstruction.Method: The authors propose SimWhisper-Codec, which leverages a frozen, simplified Whisper encoder without requiring external supervision. Through empirical analysis, they discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned ASR model.
Result: SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs like Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of the semantic-first approach.
Conclusion: The semantic-first approach of adapting a semantically-capable model for acoustic reconstruction is effective, and SimWhisper-Codec demonstrates that targeted simplification of Whisper can create a balanced speech codec that outperforms existing methods without requiring complex external supervision.
Abstract: Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.
[316] Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks
Abdul Hannan, Daniele Falavigna, Shah Nawaz, Mubashir Noman, Markus Schedl, Alessio Brutti
Main category: cs.SD
TL;DR: DLD framework combines knowledge distillation with layer dropping for dynamic speech networks, achieving better performance-computation trade-off with reduced training time.
Details
Motivation: Edge devices need dynamic architectures that adapt to varying resource constraints. Existing layer dropping methods negatively impact performance in both low and high dropping scenarios, deteriorating the performance-computation trade-off.Method: Proposes a distillation-based layer dropping (DLD) framework that combines knowledge distillation with layer dropping in an end-to-end fashion for dynamic speech networks.
Result: Achieves state-of-the-art performance, reducing word error rate by 9.32% for high dropping cases and 2.25% for no dropping cases, with 33.3% reduction in training time. Tested on conformer and WavLM models across three public benchmarks.
Conclusion: DLD framework effectively addresses the limitations of existing layer dropping methods, providing better performance-computation trade-offs for dynamic speech networks on edge devices.
Abstract: Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model’s performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32%$ and $2.25%$ for high and no dropping cases with $33.3%$ reduction in training time.
cs.LG
[317] NavFormer: IGRF Forecasting in Moving Coordinate Frames
Yoontae Hwang, Dongwoo Lee, Minseok Choi, Yong Sup Ihn, Daham Kim, Deok-Young Lee
Main category: cs.LG
TL;DR: NavFormer uses rotation-invariant features and a Canonical SPD module to forecast invariant IGRF total intensity from triad magnetometer data, achieving lower error than baselines across various training scenarios.
Details
Motivation: Triad magnetometer components vary with sensor attitude even when the IGRF total intensity remains constant, creating challenges for accurate forecasting that requires rotation-invariant approaches.Method: Uses rotation-invariant scalar features and a Canonical SPD module that stabilizes the spectrum of window-level second moments of triads without sign discontinuities. The module builds a canonical frame from a Gram matrix per window and applies state-dependent spectral scaling in original coordinates.
Result: Experiments across five flights show lower error than strong baselines in standard training, few-shot training, and zero-shot transfer scenarios.
Conclusion: NavFormer provides robust IGRF forecasting for autonomous navigators by effectively handling rotation-variant magnetometer data through invariant feature extraction and spectral stabilization techniques.
Abstract: Triad magnetometer components change with sensor attitude even when the IGRF total intensity target stays invariant. NavFormer forecasts this invariant target with rotation invariant scalar features and a Canonical SPD module that stabilizes the spectrum of window level second moments of the triads without sign discontinuities. The module builds a canonical frame from a Gram matrix per window and applies state dependent spectral scaling in the original coordinates. Experiments across five flights show lower error than strong baselines in standard training, few shot training, and zero shot transfer. The code is available at: https://anonymous.4open.science/r/NavFormer-Robust-IGRF-Forecasting-for-Autonomous-Navigators-0765
[318] Latent Structural Similarity Networks for Unsupervised Discovery in Multivariate Time Series
Olusegun Owoeye
Main category: cs.LG
TL;DR: Proposes a task-agnostic discovery layer for multivariate time series that builds relational hypothesis graphs over entities without assumptions about linearity, stationarity, or downstream objectives.
Details
Motivation: To create an analyzable abstraction for multivariate time series that can discover candidate relationships between entities without being constrained by specific modeling assumptions or optimization objectives, enabling exploratory analysis of complex temporal relationships.Method: Uses unsupervised sequence-to-sequence autoencoder to learn window-level sequence representations, aggregates these into entity-level embeddings, then induces a sparse similarity network by thresholding latent-space similarity measures to create relational hypothesis graphs.
Result: Demonstrated on hourly cryptocurrency returns dataset, showing latent similarity induces coherent network structure; classical econometric relations used as external diagnostic to contextualize discovered edges.
Conclusion: The framework provides a task-agnostic discovery layer that compresses pairwise search space and exposes candidate relationships for investigation, serving as an analyzable abstraction rather than a predictive model optimized for specific decision rules.
Abstract: This paper proposes a task-agnostic discovery layer for multivariate time series that constructs a relational hypothesis graph over entities without assuming linearity, stationarity, or a downstream objective. The method learns window-level sequence representations using an unsupervised sequence-to-sequence autoencoder, aggregates these representations into entity-level embeddings, and induces a sparse similarity network by thresholding a latent-space similarity measure. This network is intended as an analyzable abstraction that compresses the pairwise search space and exposes candidate relationships for further investigation, rather than as a model optimized for prediction, trading, or any decision rule. The framework is demonstrated on a challenging real-world dataset of hourly cryptocurrency returns, illustrating how latent similarity induces coherent network structure; a classical econometric relation is also reported as an external diagnostic lens to contextualize discovered edges.
[319] Variational Quantum Circuit-Based Reinforcement Learning for Dynamic Portfolio Optimization
Vincent Gurgul, Ying Chen, Stefan Lessmann
Main category: cs.LG
TL;DR: Quantum Reinforcement Learning (QRL) using Variational Quantum Circuits achieves comparable risk-adjusted performance to classical Deep RL with far fewer parameters, but practical deployment faces latency challenges.
Details
Motivation: To develop quantum analogues of classical reinforcement learning algorithms for dynamic portfolio optimization, leveraging quantum computing's potential advantages in parameter efficiency and robustness in complex, non-stationary financial environments.Method: Implemented quantum versions of Deep Deterministic Policy Gradient (DDPG) and Deep Q-Network (DQN) using Variational Quantum Circuits, and empirically evaluated them on real-world financial data.
Result: Quantum agents achieved risk-adjusted performance comparable to or exceeding classical Deep RL models with several orders of magnitude more parameters, and exhibited reduced variability across market regimes indicating robust behavior.
Conclusion: QRL is theoretically competitive with state-of-the-art classical RL and may become practically advantageous as deployment overheads diminish, positioning it as a promising paradigm for dynamic decision-making in complex environments like financial markets.
Abstract: This paper presents a Quantum Reinforcement Learning (QRL) solution to the dynamic portfolio optimization problem based on Variational Quantum Circuits. The implemented QRL approaches are quantum analogues of the classical neural-network-based Deep Deterministic Policy Gradient and Deep Q-Network algorithms. Through an empirical evaluation on real-world financial data, we show that our quantum agents achieve risk-adjusted performance comparable to, and in some cases exceeding, that of classical Deep RL models with several orders of magnitude more parameters. In addition to improved parameter efficiency, quantum agents exhibit reduced variability across market regimes, indicating robust behaviour under changing conditions. However, while quantum circuit execution is inherently fast at the hardware level, practical deployment on cloud-based quantum systems introduces substantial latency, making end-to-end runtime currently dominated by infrastructural overhead and limiting practical applicability. Taken together, our results suggest that QRL is theoretically competitive with state-of-the-art classical reinforcement learning and may become practically advantageous as deployment overheads diminish. This positions QRL as a promising paradigm for dynamic decision-making in complex, high-dimensional, and non-stationary environments such as financial markets. The complete codebase is released as open source at: https://github.com/VincentGurgul/qrl-dpo-public
[320] VAE with Hyperspherical Coordinates: Improving Anomaly Detection from Hypervolume-Compressed Latent Space
Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado
Main category: cs.LG
TL;DR: VAEs struggle with anomaly detection in high-dimensional latent spaces due to hyperspherical distribution patterns. The paper proposes using hyperspherical coordinates to compress latent vectors, improving unsupervised and OOD anomaly detection performance.
Details
Motivation: Standard VAEs have issues detecting anomalies in high-dimensional latent spaces because latent vectors tend to distribute on hypersphere equators, making anomaly detection challenging. The exponential growth of hypervolume with dimension affects VAE generative capacity.Method: Proposes formulating VAE latent variables using hyperspherical coordinates, which allows compressing latent vectors towards specific directions on the hypersphere. This creates a more expressive approximate posterior distribution.
Result: The method improves both fully unsupervised and out-of-distribution anomaly detection capabilities of VAEs. Achieves best performance on considered datasets, outperforming existing methods on complex real-world datasets (Mars Rover camera landscapes, unusual galaxies) and standard benchmarks (Cifar10, ImageNet subsets).
Conclusion: Using hyperspherical coordinates for VAE latent variables addresses high-dimensional statistical challenges, enabling better anomaly detection by creating more expressive posterior distributions and overcoming hyperspherical distribution limitations.
Abstract: Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, one can hope to detect out-of-distribution (abnormal) latent vectors, but several issues arise when the latent space is high dimensional. This includes an exponential growth of the hypervolume with the dimension, which severely affects the generative capacity of the VAE. In this paper, we draw insights from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are distributed on the `equators’ of a hypersphere, challenging the detection of anomalies. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards a given direction on the hypersphere, thereby allowing for a more expressive approximate posterior. We show that this improves both the fully unsupervised and OOD anomaly detection ability of the VAE, achieving the best performance on the datasets we considered, outperforming existing methods. For the unsupervised and OOD modalities, respectively, these are: i) detecting unusual landscape from the Mars Rover camera and unusual Galaxies from ground based imagery (complex, real world datasets); ii) standard benchmarks like Cifar10 and subsets of ImageNet as the in-distribution (ID) class.
[321] IPBC: An Interactive Projection-Based Framework for Human-in-the-Loop Semi-Supervised Clustering of High-Dimensional Data
Mohammad Zare
Main category: cs.LG
TL;DR: IPBC is an interactive clustering framework that combines nonlinear projection with human feedback to iteratively improve cluster quality through visual analysis and simple constraints.
Details
Motivation: High-dimensional datasets are difficult to cluster effectively due to distance metric issues and cluster collapse in lower dimensions. Traditional dimensionality reduction produces static embeddings with limited interpretability and no way to incorporate analyst intuition.Method: Interactive Project-Based Clustering (IPBC) integrates nonlinear projection with a feedback loop where users adjust viewing angles and provide must-link/cannot-link constraints. These constraints reshape the projection objective to pull related points closer and push unrelated points apart. The optimized 2D layout enables conventional clustering, with explainability mapping clusters back to original features.
Result: Experiments show that only a small number of interactive refinement steps can substantially improve cluster quality across various benchmark datasets.
Conclusion: IPBC transforms clustering into a collaborative discovery process where machine representation and human insight reinforce each other, making high-dimensional clustering more effective and interpretable.
Abstract: High-dimensional datasets are increasingly common across scientific and industrial domains, yet they remain difficult to cluster effectively due to the diminishing usefulness of distance metrics and the tendency of clusters to collapse or overlap when projected into lower dimensions. Traditional dimensionality reduction techniques generate static 2D or 3D embeddings that provide limited interpretability and do not offer a mechanism to leverage the analyst’s intuition during exploration. To address this gap, we propose Interactive Project-Based Clustering (IPBC), a framework that reframes clustering as an iterative human-guided visual analysis process. IPBC integrates a nonlinear projection module with a feedback loop that allows users to modify the embedding by adjusting viewing angles and supplying simple constraints such as must-link or cannot-link relationships. These constraints reshape the objective of the projection model, gradually pulling semantically related points closer together and pushing unrelated points further apart. As the projection becomes more structured and expressive through user interaction, a conventional clustering algorithm operating on the optimized 2D layout can more reliably identify distinct groups. An additional explainability component then maps each discovered cluster back to the original feature space, producing interpretable rules or feature rankings that highlight what distinguishes each cluster. Experiments on various benchmark datasets show that only a small number of interactive refinement steps can substantially improve cluster quality. Overall, IPBC turns clustering into a collaborative discovery process in which machine representation and human insight reinforce one another.
[322] CP Loss: Channel-wise Perceptual Loss for Time Series Forecasting
Yaohua Zha, Chunlin Fan, Peiyuan Liu, Yong Jiang, Tao Dai, Hai Wu, Shu-Tao Xia
Main category: cs.LG
TL;DR: Proposes Channel-wise Perceptual Loss (CP Loss) for multi-channel time-series forecasting that learns channel-specific perceptual spaces instead of using uniform MSE loss.
Details
Motivation: Existing forecasting models use channel-agnostic loss functions like MSE that apply uniform metrics across all channels, failing to capture channel-specific dynamics like sharp fluctuations or trend shifts in heterogeneous multi-channel time-series data.Method: Designs learnable channel-wise filters that decompose raw signals into disentangled multi-scale representations, forming channel-specific perceptual spaces. The filters are jointly optimized with the main forecasting model to ensure perceptual spaces are prediction-oriented. Losses are calculated within these learned perceptual spaces.
Result: Code is available at https://github.com/zyh16143998882/CP_Loss, indicating implementation and validation of the proposed approach.
Conclusion: CP Loss addresses heterogeneity in multi-channel time-series by learning channel-specific perceptual spaces tailored to each channel’s characteristics, improving capture of channel-specific dynamics compared to uniform loss functions.
Abstract: Multi-channel time-series data, prevalent across diverse applications, is characterized by significant heterogeneity in its different channels. However, existing forecasting models are typically guided by channel-agnostic loss functions like MSE, which apply a uniform metric across all channels. This often leads to fail to capture channel-specific dynamics such as sharp fluctuations or trend shifts. To address this, we propose a Channel-wise Perceptual Loss (CP Loss). Its core idea is to learn a unique perceptual space for each channel that is adapted to its characteristics, and to compute the loss within this space. Specifically, we first design a learnable channel-wise filter that decomposes the raw signal into disentangled multi-scale representations, which form the basis of our perceptual space. Crucially, the filter is optimized jointly with the main forecasting model, ensuring that the learned perceptual space is explicitly oriented towards the prediction task. Finally, losses are calculated within these perception spaces to optimize the model. Code is available at https://github.com/zyh16143998882/CP_Loss.
[323] How Much Temporal Modeling is Enough? A Systematic Study of Hybrid CNN-RNN Architectures for Multi-Label ECG Classification
Alireza Jafari, Fatemeh Jafari
Main category: cs.LG
TL;DR: CNN with single BiLSTM layer achieves best balance for multi-label ECG classification, outperforming deeper recurrent architectures while maintaining clinical relevance.
Details
Motivation: Multi-label ECG classification faces challenges from coexisting cardiac conditions, class imbalance, and long-range temporal dependencies. While deep recurrent architectures are increasingly used, their necessity and clinical justification haven't been rigorously examined.Method: Systematic comparative evaluation of CNNs combined with various recurrent configurations (LSTM, GRU, BiLSTM, and stacked variants) for multi-label ECG classification on PTB-XL dataset with 23 diagnostic categories. CNN serves as morphology baseline, with recurrent layers progressively integrated to assess temporal modeling contributions.
Result: CNN with single BiLSTM layer achieved best trade-off: Hamming loss (0.0338), macro-AUPRC (0.4715), micro-F1 (0.6979), subset accuracy (0.5723). Stacked recurrent models occasionally improved recall for rare classes but showed diminishing returns and potential generalization degradation.
Conclusion: Architectural alignment with ECG’s intrinsic temporal structure, not increased recurrent depth, is key for robust performance. Single BiLSTM layer provides optimal balance between performance and complexity for clinical deployment.
Abstract: Accurate multi-label classification of electrocardiogram (ECG) signals remains challenging due to the coexistence of multiple cardiac conditions, pronounced class imbalance, and long-range temporal dependencies in multi-lead recordings. Although recent studies increasingly rely on deep and stacked recurrent architectures, the necessity and clinical justification of such architectural complexity have not been rigorously examined. In this work, we perform a systematic comparative evaluation of convolutional neural networks (CNNs) combined with multiple recurrent configurations, including LSTM, GRU, Bidirectional LSTM (BiLSTM), and their stacked variants, for multi-label ECG classification on the PTB-XL dataset comprising 23 diagnostic categories. The CNN component serves as a morphology-driven baseline, while recurrent layers are progressively integrated to assess their contribution to temporal modeling and generalization performance. Experimental results indicate that a CNN integrated with a single BiLSTM layer achieves the most favorable trade-off between predictive performance and model complexity. This configuration attains superior Hamming loss (0.0338), macro-AUPRC (0.4715), micro-F1 score (0.6979), and subset accuracy (0.5723) compared with deeper recurrent combinations. Although stacked recurrent models occasionally improve recall for specific rare classes, our results provide empirical evidence that increasing recurrent depth yields diminishing returns and may degrade generalization due to reduced precision and overfitting. These findings suggest that architectural alignment with the intrinsic temporal structure of ECG signals, rather than increased recurrent depth, is a key determinant of robust performance and clinically relevant deployment.
[324] The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
Ren Zhuang, Ben Wang, Shuifa Sun
Main category: cs.LG
TL;DR: TGR is a training-free framework that uses manifold-informed latent foresight search with memory-efficient chunk-wise KV cache resets to improve long chain-of-thought reasoning without high computational costs.
Details
Motivation: Existing approaches for scaling test-time compute in long chain-of-thought reasoning face a fundamental trade-off: either incurring high training expenses or producing redundant trajectories, limiting practical deployment.Method: TGR performs manifold-informed latent foresight search under strict memory bounds. It scores candidate latent anchors using lightweight look-ahead estimates combined with soft geometric regularizers for smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length.
Result: On challenging math and code benchmarks, TGR improves robust trajectory coverage (measured by area under Pass@k curve) by up to 13 points on Qwen3-8B, with only 1.1-1.3x computational overhead.
Conclusion: TGR provides an effective training-free solution that breaks the trade-off between computational cost and coverage quality in long chain-of-thought reasoning, achieving significant performance improvements with minimal overhead.
Abstract: Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@$k$ curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1–1.3 times.
[325] Time series forecasting with Hahn Kolmogorov-Arnold networks
Md Zahidul Hasan, A. Ben Hamza, Nizar Bouguila
Main category: cs.LG
TL;DR: HaKAN: A lightweight, interpretable model using Hahn polynomial-based KANs for multivariate time series forecasting, outperforming recent Transformer and MLP methods.
Details
Motivation: Transformers have quadratic complexity and permutation-equivariant attention limitations, while MLPs suffer from spectral bias. There's a need for more efficient and interpretable alternatives for long-term time series forecasting.Method: HaKAN uses Hahn polynomial-based learnable activation functions in Kolmogorov-Arnold Networks. It integrates channel independence, patching, Hahn-KAN blocks with residual connections, and a bottleneck structure. The Hahn-KAN block has inter- and intra-patch KAN layers to capture global and local temporal patterns.
Result: Extensive experiments on various forecasting benchmarks show HaKAN consistently outperforms recent state-of-the-art methods. Ablation studies validate the effectiveness of its core components.
Conclusion: HaKAN provides a versatile, lightweight, and interpretable alternative to Transformers and MLPs for multivariate time series forecasting, addressing their computational and representational limitations while achieving superior performance.
Abstract: Recent Transformer- and MLP-based models have demonstrated strong performance in long-term time series forecasting, yet Transformers remain limited by their quadratic complexity and permutation-equivariant attention, while MLPs exhibit spectral bias. We propose HaKAN, a versatile model based on Kolmogorov-Arnold Networks (KANs), leveraging Hahn polynomial-based learnable activation functions and providing a lightweight and interpretable alternative for multivariate time series forecasting. Our model integrates channel independence, patching, a stack of Hahn-KAN blocks with residual connections, and a bottleneck structure comprised of two fully connected layers. The Hahn-KAN block consists of inter- and intra-patch KAN layers to effectively capture both global and local temporal patterns. Extensive experiments on various forecasting benchmarks demonstrate that our model consistently outperforms recent state-of-the-art methods, with ablation studies validating the effectiveness of its core components.
[326] Analysis of Control Bellman Residual Minimization for Markov Decision Problem
Donghwan Lee, Hyukjun Yang
Main category: cs.LG
TL;DR: This paper establishes foundational results for Bellman residual minimization in policy optimization/control tasks, addressing a gap where previous work focused mainly on policy evaluation.
Details
Motivation: Bellman residual minimization has advantages over dynamic programming (more stable convergence with function approximation) but has received less attention for control tasks compared to policy evaluation. The authors aim to bridge this gap.Method: The paper establishes foundational results for control Bellman residual minimization for policy optimization, though the abstract doesn’t specify the exact methods used.
Result: The abstract states the paper establishes foundational results, but doesn’t specify concrete findings - likely theoretical results about convergence, stability, or performance guarantees for Bellman residual minimization in control settings.
Conclusion: Bellman residual minimization for policy optimization is worth investigating despite practical challenges, and this paper provides foundational theoretical results to advance this under-explored area.
Abstract: Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.
[327] Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model
Zhiyu An, Wan Du
Main category: cs.LG
TL;DR: Homomorphism Error (HE) is a structural metric that quantifies deviations from approximate homomorphisms between expression algebra and model representations, predicting compositional generalization failure and serving as a training signal.
Details
Motivation: Compositional generalization remains challenging for neural networks, and behavioral evaluations offer limited insight into why failures occur at the representational level. The paper aims to develop a structural metric to understand and improve compositional generalization.Method: Introduces Homomorphism Error (HE) metric that measures deviations from approximate homomorphisms between expression algebra and model hidden-state space. Instantiates HE for two compositional operators in SCAN-style tasks: modifier HE for unary composition and sequence HE for binary composition. Tests HE across controlled experiments with small decoder-only Transformers under various conditions including noise injection, model depth variations, training data coverage, and noise tokens.
Result: HE predicts out-of-distribution compositional generalization with R^2 = 0.73 correlation between modifier HE and OOD accuracy. Model depth has minimal effect on HE or OOD accuracy, training data coverage shows threshold effects, and noise tokens systematically increase HE. HE-regularized training significantly reduces both modifier HE (p = 1.1x10⁻⁴) and sequence HE (p = 0.001) and improves OOD accuracy (p = 0.023).
Conclusion: Homomorphism Error serves as both a diagnostic tool for understanding compositional generalization failures and an actionable training signal for improving compositional generalization in neural networks.
Abstract: Compositional generalization-the ability to interpret novel combinations of familiar components-remains a persistent challenge for neural networks. Behavioral evaluations reveal when models fail but offer limited insight into why failures arise at the representational level. We introduce Homomorphism Error (HE), a structural metric that quantifies deviations from approximate homomorphisms between the expression algebra and a model’s hidden-state space. We instantiate HE for two compositional operators in SCAN-style tasks: modifier HE for unary composition and sequence HE for binary composition, measured by learning representation-level operators that predict composed representations from their parts. Across controlled experiments with small decoder-only Transformers, HE predicts out-of-distribution (OOD) compositional generalization under noise injection, achieving R^2 = 0.73 correlation between modifier HE and OOD accuracy. Ablations show that model depth has minimal effect on either HE or OOD accuracy, training data coverage exhibits threshold effects (insufficient coverage sharply increases HE and degrades OOD performance), and randomly inserted noise tokens systematically increase HE. Finally, we test if HE-regularized training improves OOD accuracy. Experiment shows that explicitly enforcing low modifier HE during training significantly reduces modifier HE (p = 1.1x10-4) and sequence HE (p = 0.001) and yields a statistically significant improvement in OOD accuracy (p = 0.023). Together, these results indicate the potential of HE to be both a diagnostic and an actionable training signal for improving compositional generalization. Code to reproduce our experiments is open-sourced.
[328] How Is Uncertainty Propagated in Knowledge Distillation?
Ziyao Cui, Jian Pei
Main category: cs.LG
TL;DR: Knowledge distillation suffers from stochastic uncertainties that distort learning. The paper analyzes uncertainty propagation across model classes, distinguishes inter- vs intra-student variance, and proposes variance-aware strategies (response averaging and inverse-variance weighting) to produce more stable students that better reflect teacher uncertainty.
Details
Motivation: Standard knowledge distillation collapses stochastic uncertainties (teacher outputs, student training, student inference) to single point estimates, which distorts what is learned. The paper aims to systematically study how uncertainty propagates through distillation and propose corrections to address these mismatches.Method: 1) Systematic study of uncertainty propagation across three model classes: linear regression, feed-forward neural networks, and LLMs. 2) Distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student’s predictive distribution). 3) Introduce two variance-aware strategies: averaging multiple teacher responses (reduces noise at O(1/k)) and variance-weighting (combines teacher and student estimates via inverse-variance weighting for minimum-variance estimator). 4) Provide formal guarantees in linear regression and validate in neural networks and LLMs.
Result: Standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. The proposed variance-aware strategies demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. The methods produce more stable students that better reflect teacher uncertainty.
Conclusion: The paper reframes knowledge distillation as an uncertainty transformation problem. Variance-aware distillation produces more stable students that better capture teacher uncertainty, addressing the stochastic nature of the distillation process across different model classes.
Abstract: Knowledge distillation transfers behavior from a teacher to a student model, but the process is inherently stochastic: teacher outputs, student training, and student inference can all be random. Collapsing these uncertainties to a single point estimate can distort what is learned. We systematically study how uncertainty propagates through knowledge distillation across three representative model classes–linear regression, feed-forward neural networks, and large language models (LLMs)–and propose simple corrections. We distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student’s predictive distribution), showing that standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. To address these mismatches, we introduce two variance-aware strategies: averaging multiple teacher responses, which reduces noise at rate $O(1/k)$, and variance-weighting, which combines teacher and student estimates via inverse-variance weighting to yield a minimum-variance estimator. We provide formal guarantees in linear regression, validate the methods in neural networks, and demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. These results reframe knowledge distillation as an uncertainty transformation and show that variance-aware distillation produces more stable students that better reflect teacher uncertainty.
[329] ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks
Shalima Binta Manir, Tim Oates
Main category: cs.LG
TL;DR: The paper introduces ASEHybrid, a geometry-aware GNN architecture that improves performance on heterophilous graphs when graph structure provides label-relevant information beyond node features.
Details
Motivation: Standard GNNs struggle on low-homophily graphs, but homophily alone doesn't explain performance variations. Label informativeness (mutual information between adjacent node labels) better characterizes when graph structure is useful.Method: Developed a unified theoretical framework connecting curvature-guided rewiring and positional geometry through label informativeness. Instantiated ASEHybrid using Forman curvature and Laplacian positional encodings. Theoretically analyzed adjusted homophily, label informativeness, spectral behavior, and established convergence/stability guarantees for curvature-guided rewiring.
Result: Gains observed precisely on label-informative heterophilous benchmarks (Chameleon, Squirrel, Texas, Tolokers, Minesweeper) where graph structure provides label-relevant information beyond node features. No meaningful improvement in high-baseline regimes.
Conclusion: Geometry-aware GNNs can improve over feature-only baselines if and only if graph structure carries label-relevant information beyond node features. Curvature-guided rewiring reshapes information flow rather than increasing expressivity beyond 1-WL test.
Abstract: Standard message-passing graph neural networks (GNNs) often struggle on graphs with low homophily, yet homophily alone does not explain this behavior, as graphs with similar homophily levels can exhibit markedly different performance and some heterophilous graphs remain easy for vanilla GCNs. Recent work suggests that label informativeness (LI), the mutual information between labels of adjacent nodes, provides a more faithful characterization of when graph structure is useful. In this work, we develop a unified theoretical framework that connects curvature-guided rewiring and positional geometry through the lens of label informativeness, and instantiate it in a practical geometry-aware architecture, ASEHybrid. Our analysis provides a necessary-and-sufficient characterization of when geometry-aware GNNs can improve over feature-only baselines: such gains are possible if and only if graph structure carries label-relevant information beyond node features. Theoretically, we relate adjusted homophily and label informativeness to the spectral behavior of label signals under Laplacian smoothing, show that degree-based Forman curvature does not increase expressivity beyond the one-dimensional Weisfeiler–Lehman test but instead reshapes information flow, and establish convergence and Lipschitz stability guarantees for a curvature-guided rewiring process. Empirically, we instantiate ASEHybrid using Forman curvature and Laplacian positional encodings and conduct controlled ablations on Chameleon, Squirrel, Texas, Tolokers, and Minesweeper, observing gains precisely on label-informative heterophilous benchmarks where graph structure provides label-relevant information beyond node features, and no meaningful improvement in high-baseline regimes.
[330] GraIP: A Benchmarking Framework For Neural Graph Inverse Problems
Semih Cantürk, Andrei Manolache, Arman Mielke, Chendi Qian, Antoine Siraudin, Christopher Morris, Mathias Niepert, Guy Wolf
Main category: cs.LG
TL;DR: The paper introduces Neural Graph Inverse Problem (GraIP), a unifying framework that formalizes various graph learning tasks as inverse problems to infer graph structures from data, rather than making predictions on given graphs.
Details
Motivation: Current graph learning methods for structure inference tasks are developed in isolated, task-specific ways without a unifying theoretical foundation, creating fragmentation in the field.Method: Proposes the GraIP conceptual framework that formalizes graph learning tasks as inverse problems, aiming to recover underlying graph structures by reversing forward processes (like message passing or network dynamics) that produced observed outputs.
Result: Demonstrates GraIP’s versatility across tasks including rewiring, causal discovery, and neural relational inference, proposes benchmark datasets and metrics, and evaluates existing baseline methods for these domains.
Conclusion: GraIP provides a unifying perspective that bridges disparate applications, offers principled approaches to structural learning in constrained/combinatorial settings, and encourages cross-pollination of methods across graph inverse problems.
Abstract: A wide range of graph learning tasks, such as structure discovery, temporal graph analysis, and combinatorial optimization, focus on inferring graph structures from data, rather than making predictions on given graphs. However, the respective methods to solve such problems are often developed in an isolated, task-specific manner and thus lack a unifying theoretical foundation. Here, we provide a stepping stone towards the formation of such a foundation and further development by introducing the Neural Graph Inverse Problem (GraIP) conceptual framework, which formalizes and reframes a broad class of graph learning tasks as inverse problems. Unlike discriminative approaches that directly predict target variables from given graph inputs, the GraIP paradigm addresses inverse problems, i.e., it relies on observational data and aims to recover the underlying graph structure by reversing the forward process, such as message passing or network dynamics, that produced the observed outputs. We demonstrate the versatility of GraIP across various graph learning tasks, including rewiring, causal discovery, and neural relational inference. We also propose benchmark datasets and metrics for each GraIP domain considered, and characterize and empirically evaluate existing baseline methods used to solve them. Overall, our unifying perspective bridges seemingly disparate applications and provides a principled approach to structural learning in constrained and combinatorial settings while encouraging cross-pollination of existing methods across graph inverse problems.
[331] One Global Model, Many Behaviors: Stockout-Aware Feature Engineering and Dynamic Scaling for Multi-Horizon Retail Demand Forecasting with a Cost-Aware Ordering Policy (VN2 Winner Report)
Bartosz Szabłowski
Main category: cs.LG
TL;DR: Winning VN2 inventory planning solution uses global multi-horizon forecasting with CatBoost GBDT and cost-aware ordering policy, achieving first place in competition.
Details
Motivation: Inventory planning for retail chains requires translating demand forecasts into ordering decisions that balance asymmetric shortages and holding costs, formalized in the VN2 Inventory Planning Challenge with weekly decision cycles and two-week lead times.Method: Two-stage predict-then-optimize pipeline: 1) Global multi-horizon forecasting model using CatBoost GBDT with stockout-aware feature engineering, per-series scaling, and time-based observation weights; 2) Cost-aware ordering policy that projects inventory and calculates target stock levels trading off shortage and holding costs.
Result: Achieved first place in the official VN2 competition simulation across six rounds by combining strong global forecasting with lightweight cost-aware policy.
Conclusion: The proposed approach effectively addresses retail inventory planning challenges and can be extended to real-world applications with additional operational constraints.
Abstract: Inventory planning for retail chains requires translating demand forecasts into ordering decisions, including asymmetric shortages and holding costs. The VN2 Inventory Planning Challenge formalizes this setting as a weekly decision-making cycle with a two-week product delivery lead time, where the total cost is defined as the shortage cost plus the holding cost. This report presents the winning VN2 solution: a two-stage predict-then-optimize pipeline that combines a single global multi-horizon forecasting model with a cost-aware ordering policy. The forecasting model is trained in a global paradigm, jointly using all available time series. A gradient-boosted decision tree (GBDT) model implemented in CatBoost is used as the base learner. The model incorporates stockout-aware feature engineering to address censored demand during out-of-stock periods, per-series scaling to focus learning on time-series patterns rather than absolute levels, and time-based observation weights to reflect shifts in demand patterns. In the decision stage, inventory is projected to the start of the delivery week, and a target stock level is calculated that explicitly trades off shortage and holding costs. Evaluated by the official competition simulation in six rounds, the solution achieved first place by combining a strong global forecasting model with a lightweight cost-aware policy. Although developed for the VN2 setting, the proposed approach can be extended to real-world applications and additional operational constraints.
[332] Toward Learning POMDPs Beyond Full-Rank Actions and State Observability
Seiji Shaw, Travis Manderson, Chad Kessens, Nicholas Roy
Main category: cs.LG
TL;DR: The paper presents a method to learn POMDP parameters (transition and observation matrices) from action-observation sequences using spectral approaches and tensor decompositions, relaxing assumptions of full observability and full-rank transitions.
Details
Motivation: Enable autonomous agents to learn and reason about systems with hidden states (like furniture with hidden locking mechanisms) by learning POMDP parameters from action-observation sequences, addressing limitations of existing methods that either don't estimate likelihoods or make restrictive assumptions.Method: Combines Predictive State Representations (PSRs) with tensor decompositions to learn observation and transition matrices up to a similarity transform. Learns partition-level transition models where states in the same partition share observation distributions for actions with full-rank transition matrices.
Result: The learned partition-level transition models achieve performance comparable to PSRs when used with standard sampling-based POMDP solvers, given sufficient data. The explicit likelihoods enable specification of planner behavior after model learning.
Conclusion: The proposed method successfully learns POMDP parameters from action-observation sequences, relaxing restrictive assumptions while providing explicit transition and observation likelihoods useful for downstream reasoning tasks and planner behavior specification.
Abstract: We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as furniture with hidden locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP’s actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from action-observation sequences. Spectral approaches to learning models of partially observable domains, such as learning Predictive State Representations (PSRs), are known to directly estimate the number of hidden states. These methods cannot, however, yield direct estimates of transition and observation likelihoods, which are important for many downstream reasoning tasks. Other approaches leverage tensor decompositions to estimate transition and observation likelihoods but often assume full state observability and full-rank transition matrices for all actions. To relax these assumptions, we study how PSRs learn transition and observation matrices up to a similarity transform, which may be estimated via tensor methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that these partition-level transition models learned by our method, with a sufficient amount of data, meets the performance of PSRs as models to be used by standard sampling-based POMDP solvers. Furthermore, the explicit observation and transition likelihoods can be leveraged to specify planner behavior after the model has been learned.
[333] Bi-Level Online Provisioning and Scheduling with Switching Costs and Cross-Level Constraints
Jialei Liu, C. Emre Koksal, Ming Shi
Main category: cs.LG
TL;DR: Bi-level online learning algorithm combining OCO and CMDP for network resource allocation with two time scales, handling switching costs and cross-level constraints with near-optimal regret.
Details
Motivation: Existing OCO methods can't capture MDP network dynamics like queue evolution, while CMDP algorithms assume fixed constraint thresholds, but in provisioning-and-scheduling systems, thresholds vary with online budget decisions.Method: Bi-level OCO-CMDP learning algorithm with dual feedback returning budget multiplier for upper-level updates, and lower level solving budget-adaptive safe exploration via extended occupancy-measure linear program.
Result: Established near-optimal regret and high-probability satisfaction of cross-level constraints that couple budgets to scheduling decisions.
Conclusion: The proposed algorithm effectively addresses the gaps in existing methods by combining OCO and CMDP for two-time-scale network resource allocation with switching costs and cross-level constraints.
Abstract: We study a bi-level online provisioning and scheduling problem motivated by network resource allocation, where provisioning decisions are made at a slow time scale while queue-/state-dependent scheduling is performed at a fast time scale. We model this two-time-scale interaction using an upper-level online convex optimization (OCO) problem and a lower-level constrained Markov decision process (CMDP). Existing OCO typically assumes stateless decisions and thus cannot capture MDP network dynamics such as queue evolution. Meanwhile, CMDP algorithms typically assume a fixed constraint threshold, whereas in provisioning-and-scheduling systems, the threshold varies with online budget decisions. To address these gaps, we study bi-level OCO-CMDP learning under switching costs (budget reprovisioning/system reconfiguration) and cross-level constraints that couple budgets to scheduling decisions. Our new algorithm solves this learning problem via several non-trivial developments, including a carefully designed dual feedback that returns the budget multiplier as sensitivity information for the upper-level update and a lower level that solves a budget-adaptive safe exploration problem via an extended occupancy-measure linear program. We establish near-optimal regret and high-probability satisfaction of the cross-level constraints.
[334] FSD-CAP: Fractional Subgraph Diffusion with Class-Aware Propagation for Graph Feature Imputation
Xin Qiao, Shijie Sun, Anqi Dong, Cong Hua, Xia Zhao, Longfei Zhang, Guangming Zhu, Liang Zhang
Main category: cs.LG
TL;DR: FSD-CAP is a two-stage graph feature imputation framework that handles extreme sparsity (up to 99.5% missing features) using localized fractional diffusion and class-aware refinement.
Details
Motivation: Existing graph feature imputation methods struggle with high missing rates, often producing unreliable estimates and propagating errors across the graph structure.Method: Two-stage approach: 1) Graph-distance-guided subgraph expansion with fractional diffusion operator for localized propagation, 2) Class-aware refinement using pseudo-labels and neighborhood entropy for consistency.
Result: Achieves 80.06% (structural) and 81.01% (uniform) node classification accuracy with 99.5% missing features, close to 81.31% with full features. Link prediction AUC of 91.65% (structural) and 92.41% (uniform) vs 95.06% full features.
Conclusion: FSD-CAP effectively handles extreme feature sparsity in graphs, performing close to models with complete features and outperforming other methods on large-scale and heterophily datasets.
Abstract: Imputing missing node features in graphs is challenging, particularly under high missing rates. Existing methods based on latent representations or global diffusion often fail to produce reliable estimates, and may propagate errors across the graph. We propose FSD-CAP, a two-stage framework designed to improve imputation quality under extreme sparsity. In the first stage, a graph-distance-guided subgraph expansion localizes the diffusion process. A fractional diffusion operator adjusts propagation sharpness based on local structure. In the second stage, imputed features are refined using class-aware propagation, which incorporates pseudo-labels and neighborhood entropy to promote consistency. We evaluated FSD-CAP on multiple datasets. With $99.5%$ of features missing across five benchmark datasets, FSD-CAP achieves average accuracies of $80.06%$ (structural) and $81.01%$ (uniform) in node classification, close to the $81.31%$ achieved by a standard GCN with full features. For link prediction under the same setting, it reaches AUC scores of $91.65%$ (structural) and $92.41%$ (uniform), compared to $95.06%$ for the fully observed case. Furthermore, FSD-CAP demonstrates superior performance on both large-scale and heterophily datasets when compared to other models.
[335] A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
Claire O’Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O’Brien, Ashwinee Panda, Ryan Lagasse
Main category: cs.LG
TL;DR: Targeted neuron-level fine-tuning using SAEs and linear probes reduces sycophantic behavior in LLMs with minimal data, matching/exceeding SOTA performance while avoiding full-model side effects.
Details
Motivation: Broad fine-tuning for behavioral alignment causes distributional shift and low interpretability. Need precise, data-efficient methods that target only behavior-relevant neurons.Method: Use sparse autoencoders (SAEs) and linear probes to identify top 3% MLP neurons predictive of target behavior, decode them into residual space, then fine-tune only those neurons via gradient masking.
Result: Method matches/exceeds SOTA on four sycophancy benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models, showing effectiveness with limited data.
Conclusion: Sparse neuron-level updates provide scalable, precise alternative to full-model fine-tuning, maintaining effectiveness in low-data scenarios while avoiding unwanted side effects.
Abstract: Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available
[336] Vector-Valued Distributional Reinforcement Learning Policy Evaluation: A Hilbert Space Embedding Approach
Mehrdad Mohammadi, Qi Zheng, Ruoqing Zhu
Main category: cs.LG
TL;DR: KE-DRL: Offline multi-dimensional distributional RL using kernel mean embeddings to estimate value distributions in continuous spaces, replacing computationally expensive Wasserstein metrics with integral probability metrics.
Details
Motivation: Direct computation of Wasserstein distances in multi-dimensional continuous state-action spaces is computationally challenging, especially for distributional RL. Need efficient methods for off-policy evaluation in complex real-world decision-making scenarios.Method: Leverage Hilbert space mappings to estimate kernel mean embeddings of multi-dimensional value distributions under target policies. Use reproducing kernel Hilbert spaces to replace Wasserstein metrics with integral probability metrics. Focus on Matern family kernels with Lipschitz continuity and boundedness assumptions.
Result: Theoretical contraction properties of distributional Bellman operator under proposed metric with Matern kernels. Uniform convergence guarantees. Simulations show robust off-policy evaluation and recovery of kernel mean embeddings under mild assumptions.
Conclusion: Kernel mean embedding approach enables efficient distributional RL in multi-dimensional continuous spaces, offering computational advantages over Wasserstein metrics while maintaining theoretical guarantees. Promising for complex real-world decision-making and risk evaluation scenarios.
Abstract: We propose an (offline) multi-dimensional distributional reinforcement learning framework (KE-DRL) that leverages Hilbert space mappings to estimate the kernel mean embedding of the multi-dimensional value distribution under a proposed target policy. In our setting, the state-action variables are multi-dimensional and continuous. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with an integral probability metric. This enables efficient estimation in multi-dimensional state-action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric involving the Matern family of kernels and provide uniform convergence guarantees. Simulations and empirical results demonstrate robust off-policy evaluation and recovery of the kernel mean embedding under mild assumptions, namely, Lipschitz continuity and boundedness of the kernels, highlighting the potential of embedding-based approaches in complex real-world decision-making scenarios and risk evaluation.
[337] Towards Self-Optimizing Electron Microscope: Robust Tuning of Aberration Coefficients via Physics-Aware Multi-Objective Bayesian Optimization
Utkarsh Pratiush, Austin Houston, Richard Liu, Gerd Duscher, Sergei Kalinin
Main category: cs.LG
TL;DR: MOBO framework for rapid, data-efficient aberration correction in STEM using Bayesian optimization with user-defined objectives and Pareto fronts to handle trade-offs.
Details
Motivation: Conventional aberration correction methods in STEM are sample-inefficient (serial, gradient-free searches) and struggle with multiple interacting parameters, while deep learning methods lack flexibility for varying conditions without extensive retraining.Method: Multi-Objective Bayesian Optimization (MOBO) framework using Gaussian Process regression to model aberration landscape probabilistically, with active learning to select informative lens settings and Pareto fronts to expose trade-offs between competing objectives.
Result: The approach is more robust than traditional optimization algorithms, effectively tunes focus, astigmatism, and higher-order aberrations, and enables “self-optimizing” microscopy by dynamically sustaining optimal performance.
Conclusion: MOBO provides a flexible, data-efficient framework for aberration correction that balances competing experimental priorities and enables adaptive, optimal STEM performance during experiments.
Abstract: Realizing high-throughput aberration-corrected Scanning Transmission Electron Microscopy (STEM) exploration of atomic structures requires rapid tuning of multipole probe correctors while compensating for the inevitable drift of the optical column. While automated alignment routines exist, conventional approaches rely on serial, gradient-free searches (e.g., Nelder-Mead) that are sample-inefficient and struggle to correct multiple interacting parameters simultaneously. Conversely, emerging deep learning methods offer speed but often lack the flexibility to adapt to varying sample conditions without extensive retraining. Here, we introduce a Multi-Objective Bayesian Optimization (MOBO) framework for rapid, data-efficient aberration correction. Importantly, this framework does not prescribe a single notion of image quality; instead, it enables user-defined, physically motivated reward formulations (e.g., symmetry-induced objectives) and uses Pareto fronts to expose the resulting trade-offs between competing experimental priorities. By using Gaussian Process regression to model the aberration landscape probabilistically, our workflow actively selects the most informative lens settings to evaluate next, rather than performing an exhaustive blind search. We demonstrate that this active learning loop is more robust than traditional optimization algorithms and effectively tunes focus, astigmatism, and higher-order aberrations. By balancing competing objectives, this approach enables “self-optimizing” microscopy by dynamically sustaining optimal performance during experiments.
[338] When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control
Nima Leclerc, Chris Miller, Nicholas Brawand
Main category: cs.LG
TL;DR: Meta-learning scaling law shows adaptation benefits saturate exponentially with gradient steps and scale linearly with task variance, validated on quantum gate calibration and classical control.
Details
Motivation: Quantum hardware suffers from device heterogeneity and environmental drift, forcing trade-offs between suboptimal non-adaptive controllers and costly per-device recalibration.Method: Derived a scaling law lower bound for meta-learning showing adaptation gain saturates exponentially with gradient steps and scales linearly with task variance, validated on quantum gate calibration and classical linear-quadratic control.
Result: Negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10× training noise), with implications for reducing per-device calibration time on cloud quantum processors.
Conclusion: Results offer a transferable framework for decision-making in adaptive control, showing these laws emerge from general optimization geometry rather than quantum-specific physics.
Abstract: Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but $>40%$ fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.
[339] Attention-Enhanced Graph Filtering for False Data Injection Attack Detection and Localization
Ruslan Abdulin, Mohammad Rasoul Narimani
Main category: cs.LG
TL;DR: Proposes a joint false data injection attack detection and localization framework using ARMA graph convolutional filters with Transformer encoder for power grid cybersecurity.
Details
Motivation: IoT-enabled devices in power systems expand cyberattack surfaces, exposing critical infrastructure to false data injection attacks that compromise measurement integrity. Existing detection methods using graph-based learning have limitations: they rely on high-dimensional representations and shallow classifiers, fail to capture both local structural dependencies and global contextual relationships, and Transformer architectures can be too deep for localized grid dynamics.Method: Joint FDIA detection and localization framework integrating auto-regressive moving average (ARMA) graph convolutional filters with an Encoder-Only Transformer architecture. ARMA-based graph filters provide robust, topology-aware feature extraction adaptable to abrupt spectral changes, while Transformer encoder uses self-attention to capture long-range dependencies without sacrificing local context.
Result: Evaluated using real-world load data from NYISO applied to IEEE 14- and 300-bus systems. The model effectively exploits both state and topology of power grid, achieving high accuracy in detecting FDIA events and localizing compromised nodes.
Conclusion: The proposed ARMA-Transformer framework successfully addresses limitations of existing FDIA detection methods by combining topology-aware feature extraction with long-range dependency modeling, providing effective cybersecurity for modern power systems against false data injection attacks.
Abstract: The increasing deployment of Internet-of-Things (IoT)-enabled measurement devices in modern power systems has expanded the cyberattack surface of the grid. As a result, this critical infrastructure is increasingly exposed to cyberattacks, including false data injection attacks (FDIAs) that compromise measurement integrity and threaten reliable system operation. Existing FDIA detection methods primarily exploit spatial correlations and network topology using graph-based learning; however, these approaches often rely on high-dimensional representations and shallow classifiers, limiting their ability to capture local structural dependencies and global contextual relationships. Moreover, naively incorporating Transformer architectures can result in overly deep models that struggle to model localized grid dynamics. This paper proposes a joint FDIA detection and localization framework that integrates auto-regressive moving average (ARMA) graph convolutional filters with an Encoder-Only Transformer architecture. The ARMA-based graph filters provide robust, topology-aware feature extraction and adaptability to abrupt spectral changes, while the Transformer encoder leverages self-attention to capture long-range dependencies among grid elements without sacrificing essential local context. The proposed method is evaluated using real-world load data from the New York Independent System Operator (NYISO) applied to the IEEE 14- and 300-bus systems. Numerical results demonstrate that the proposed model effectively exploits both the state and topology of the power grid, achieving high accuracy in detecting FDIA events and localizing compromised nodes.
[340] Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning
Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, Dong Yu
Main category: cs.LG
TL;DR: VPPO improves RL for LLM reasoning by using PRMs only to detect the first error in reasoning paths, then partitioning trajectories into correct prefixes and erroneous suffixes for better credit assignment.
Details
Motivation: Existing RL approaches for LLMs rely on sparse outcome rewards that fail to credit correct intermediate steps. Process reward models (PRMs) offer step-level supervision but their scores are noisy and current PRM benchmarks focus on detecting first incorrect steps, which is misaligned with how PRMs are typically used in RL.Method: Verifiable Prefix Policy Optimization (VPPO) uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake.
Result: Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K metrics.
Conclusion: VPPO bridges the gap between PRM evaluation and RL usage by focusing on error localization rather than noisy step-wise scores, yielding stable, interpretable learning signals and improving credit assignment for LLM reasoning.
Abstract: Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs). However, most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions. Process reward models (PRMs) offer fine-grained step-level supervision, but their scores are often noisy and difficult to evaluate. As a result, recent PRM benchmarks focus on a more objective capability: detecting the first incorrect step in a reasoning path. However, this evaluation target is misaligned with how PRMs are typically used in RL, where their step-wise scores are treated as raw rewards to maximize. To bridge this gap, we propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake. This design yields stable, interpretable learning signals and improves credit assignment. Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K.
[341] Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
Fangzhou Wu, Sandeep Silwal, Qiuyi, Zhang
Main category: cs.LG
TL;DR: KV caching optimization for LLM inference with integrated cache eviction and query routing algorithms that significantly improve performance metrics.
Details
Motivation: KV caching accelerates LLM inference but faces challenges with limited memory and dynamic query patterns, especially in multi-LLM serving where cache hit rate and load balancing conflict.Method: Developed unified mathematical model capturing KV cache eviction and query routing trade-offs, combining provably competitive randomized KV cache eviction with learning-based adaptive query routing.
Result: Achieved up to 6.92× cache hit rate improvement, 11.96× latency reduction, 14.06× TTFT reduction, and 77.4% throughput increase over state-of-the-art methods across 4 benchmarks and 3 prefix-sharing settings.
Conclusion: The integrated approach of principled cache eviction with adaptive routing effectively balances query load and cache hit rate, significantly improving LLM serving performance in memory-constrained environments.
Abstract: KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92$\times$ in cache hit rate, 11.96$\times$ reduction in latency, 14.06$\times$ reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at https://github.com/fzwark/KVRouting.
[342] Accelerated training of Gaussian processes using banded square exponential covariances
Emily C. Ehrhardt, Felipe Tobar
Main category: cs.LG
TL;DR: A novel banded-matrix approximation for efficient GP training by eliminating near-zero off-diagonal entries in SE covariance matrices.
Details
Motivation: Square-exponential covariance matrices contain many off-diagonal entries extremely close to zero, suggesting computational inefficiency that can be exploited for faster GP training.Method: Construct a principled procedure to eliminate near-zero off-diagonal entries, producing a banded-matrix approximation to the original covariance. This allows reduced-cost computation of inverse and determinant for efficient likelihood approximation.
Result: Theoretical analysis shows the method preserves original covariance structure in 1D setting with SE kernel. Computational efficiency validated against variational free energy approach to sparse GPs.
Conclusion: The banded-matrix approximation provides a computationally efficient alternative for GP training while maintaining covariance structure, offering advantages over existing sparse GP methods.
Abstract: We propose a novel approach to computationally efficient GP training based on the observation that square-exponential (SE) covariance matrices contain several off-diagonal entries extremely close to zero. We construct a principled procedure to eliminate those entries to produce a \emph{banded}-matrix approximation to the original covariance, whose inverse and determinant can be computed at a reduced computational cost, thus contributing to an efficient approximation to the likelihood function. We provide a theoretical analysis of the proposed method to preserve the structure of the original covariance in the 1D setting with SE kernel, and validate its computational efficiency against the variational free energy approach to sparse GPs.
[343] EVEREST: An Evidential, Tail-Aware Transformer for Rare-Event Time-Series Forecasting
Antanas Zilinskas, Robert N. Shorten, Jakub Marecek
Main category: cs.LG
TL;DR: EVEREST is a transformer-based architecture for probabilistic rare-event forecasting that handles class imbalance, uncertainty, and tail risk with interpretable attention, achieving state-of-the-art performance on space-weather flare prediction.
Details
Motivation: Forecasting rare events in multivariate time-series is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. Existing methods struggle with these issues in high-stakes domains like space weather, industrial monitoring, and satellite diagnostics.Method: EVEREST integrates four components: (1) learnable attention bottleneck for soft temporal aggregation, (2) evidential head for aleatoric/epistemic uncertainty via Normal-Inverse-Gamma distribution, (3) extreme-value head for tail risk using Generalized Pareto Distribution, and (4) lightweight precursor head for early detection. Jointly optimized with composite loss (focal loss, evidential NLL, tail-sensitive EVT penalty).
Result: Achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares on decade of space-weather data. Compact model (0.81M parameters), efficient training on commodity hardware, with no inference overhead.
Conclusion: EVEREST provides effective probabilistic rare-event forecasting with calibrated predictions, tail-aware risk estimation, and interpretability. Limitations include fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
Abstract: Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
[344] Is Finer Better? The Limits of Microscaling Formats in Large Language Models
Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang
Main category: cs.LG
TL;DR: Microscaling quantization shows unexpected performance degradation with smaller block sizes due to narrow tensor distributions and limited dynamic range, solved by using FP8 unsigned E5M3 format for scales.
Details
Motivation: Microscaling data formats enable aggressive model compression but require hardware-friendly implementations. The paper investigates a surprising anomaly where smaller block sizes degrade performance instead of improving representation, which contradicts expectations.Method: Combined experimental and theoretical analysis: experimentally analyzed distributions of Large Language Models to identify conditions driving anomalous behavior; theoretically developed a framework that shows agreement with experimental data from both pretrained models and ideal distributions.
Result: Identified that the anomaly is driven by interplay between narrow tensor distributions and limited dynamic range of quantized scales. Proposed FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for scales in FP4 microscaling data types.
Conclusion: UE5M3 achieves comparable performance to conventional FP8 unsigned E4M3 scales while eliminating the need for global scaling operations on weights and activations, providing a practical solution to the microscaling quantization anomaly.
Abstract: Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.
[345] A Unifying View of Coverage in Linear Off-Policy Evaluation
Philip Amortila, Audrey Huang, Akshay Krishnamurthy, Nan Jiang
Main category: cs.LG
TL;DR: Novel finite-sample analysis of LSTDQ for linear off-policy evaluation, introducing a new coverage parameter called “feature-dynamics coverage” that unifies coverage definitions across different assumptions.
Details
Motivation: Current finite-sample guarantees for linear OPE rely on coverage parameters that are poorly understood in the minimal setting where only the target value function is linearly realizable. Existing coverage definitions have undesirable properties and are disconnected from standard literature definitions, creating a need for a unified understanding.Method: Developed a novel finite-sample analysis of LSTDQ algorithm using an instrumental-variable perspective. Introduced a new coverage parameter called “feature-dynamics coverage” that captures linear coverage in an induced dynamical system for feature evolution.
Result: The proposed feature-dynamics coverage parameter successfully recovers specialized coverage parameters under stronger assumptions (like Bellman-completeness), providing a unified framework for coverage in linear OPE that connects previously fragmented definitions.
Conclusion: The feature-dynamics coverage parameter offers a unified understanding of coverage in linear off-policy evaluation, bridging the gap between minimal and stronger assumption settings while maintaining desirable theoretical properties.
Abstract: Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $$ \textrm{Evaluation error} \le \textrm{poly}(C^π, d, 1/n,\log(1/δ)), $$ where $d$ is the dimension of the features and $C^π$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions – such as Bellman-completeness – our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.
[346] Unravelling the (In)compatibility of Statistical-Parity and Equalized-Odds
Mortaza S. Bargh, Sunil Choenni, Floris ter Braak
Main category: cs.LG
TL;DR: The paper analyzes the relationship between Statistical-Parity and Equalized-Odds fairness measures, showing how base-rate imbalances cause incompatibility between them, and argues for examining base-rate balance before enforcing Statistical-Parity in practice.
Details
Motivation: Statistical fairness measures are crucial for detecting fairness issues in data and algorithms. Statistical-Parity is widely adopted in legal frameworks but doesn't require ground truth, while Equalized-Odds requires reliable ground truth but is important for false prediction parity. Understanding their relationship is essential for practical fairness assessment.Method: The paper presents a novel analysis of the relationship between Statistical-Parity and Equalized-Odds fairness measures, focusing on how base-rates of sensitive groups affect their compatibility. The analysis examines when base-rate imbalance causes incompatibility between these two fairness criteria.
Result: The analysis shows that base-rate imbalance causes incompatibility between Statistical-Parity and Equalized-Odds measures. The approach provides insights into how to make design trade-offs between these measures in practice and demonstrates when they cannot be simultaneously satisfied.
Conclusion: Before enforcing or relying on Statistical-Parity criterion, practitioners should examine base-rate (im)balance and investigate potential incompatibility with Equalized-Odds. The insights may trigger initiatives to improve current practices and legal frameworks for algorithmic fairness assessment.
Abstract: A key challenge in employing data, algorithms and data-driven systems is to adhere to the principle of fairness and justice. Statistical fairness measures belong to an important category of technical/formal mechanisms for detecting fairness issues in data and algorithms. In this contribution we study the relations between two types of statistical fairness measures namely Statistical-Parity and Equalized-Odds. The Statistical-Parity measure does not rely on having ground truth, i.e., (objectively) labeled target attributes. This makes Statistical-Parity a suitable measure in practice for assessing fairness in data and data classification algorithms. Therefore, Statistical-Parity is adopted in many legal and professional frameworks for assessing algorithmic fairness. The Equalized-Odds measure, on the contrary, relies on having (reliable) ground-truth, which is not always feasible in practice. Nevertheless, there are several situations where the Equalized-Odds definition should be satisfied to enforce false prediction parity among sensitive social groups. We present a novel analyze of the relation between Statistical-Parity and Equalized-Odds, depending on the base-rates of sensitive groups. The analysis intuitively shows how and when base-rate imbalance causes incompatibility between Statistical-Parity and Equalized-Odds measures. As such, our approach provides insight in (how to make design) trade-offs between these measures in practice. Further, based on our results, we plea for examining base-rate (im)balance and investigating the possibility of such an incompatibility before enforcing or relying on the Statistical-Parity criterion. The insights provided, we foresee, may trigger initiatives to improve or adjust the current practice and/or the existing legal frameworks.
[347] XIMP: Cross Graph Inter-Message Passing for Molecular Property Prediction
Anatol Ehrlich, Lorenz Kummer, Vojtech Voracek, Franka Bause, Nils M. Kriege
Main category: cs.LG
TL;DR: XIMP introduces cross-graph inter-message passing that enables simultaneous message passing within and across multiple molecular graph representations, outperforming state-of-the-art methods in data-scarce molecular property prediction.
Details
Motivation: Graph neural networks often underperform in data-scarce regimes for molecular property prediction and fail to surpass traditional fingerprints, creating a need for better approaches that can leverage multiple complementary molecular representations.Method: XIMP performs message passing both within and across multiple related graph representations simultaneously. For small molecules, it combines molecular graphs with scaffold-aware junction trees and pharmacophore-encoding extended reduced graphs. Unlike prior work, it supports arbitrary numbers of abstractions and both direct/indirect communication between them in each layer.
Result: Across ten diverse molecular property prediction tasks, XIMP outperforms state-of-the-art baselines in most cases, demonstrating enhanced generalization in low-data settings.
Conclusion: XIMP leverages interpretable abstractions as inductive bias that guides learning toward established chemical concepts, improving molecular property prediction especially in data-scarce scenarios.
Abstract: Accurate molecular property prediction is central to drug discovery, yet graph neural networks often underperform in data-scarce regimes and fail to surpass traditional fingerprints. We introduce cross-graph inter-message passing (XIMP), which performs message passing both within and across multiple related graph representations. For small molecules, we combine the molecular graph with scaffold-aware junction trees and pharmacophore-encoding extended reduced graphs, integrating complementary abstractions. While prior work is either limited to a single abstraction or non-iterative communication across graphs, XIMP supports an arbitrary number of abstractions and both direct and indirect communication between them in each layer. Across ten diverse molecular property prediction tasks, XIMP outperforms state-of-the-art baselines in most cases, leveraging interpretable abstractions as an inductive bias that guides learning toward established chemical concepts, enhancing generalization in low-data settings.
[348] OATS: Online Data Augmentation for Time Series Foundation Models
Junwei Deng, Chang Xu, Jiaqi W. Ma, Ming Jin, Chenghao Liu, Jiang Bian
Main category: cs.LG
TL;DR: OATS is an online data augmentation method for time series foundation models that dynamically generates synthetic data tailored to different training stages using a diffusion-based framework with explore-exploit balancing.
Details
Motivation: Existing time series augmentation methods rely on heuristics and static paradigms, failing to account for how sample contributions vary across training stages. The authors are motivated by dynamic data optimization principles to create adaptive augmentation.Method: OATS uses valuable training samples as guiding signals to dynamically generate high-quality synthetic data conditioned on them. It employs a diffusion-based framework for realistic time series generation and incorporates an explore-exploit mechanism to balance efficiency and effectiveness.
Result: Experiments on time series foundation models show OATS consistently outperforms regular training and yields substantial performance gains over static data augmentation baselines across six validation datasets and two TSFM architectures.
Conclusion: OATS provides a principled online augmentation strategy that adapts to different training stages, demonstrating superior performance over static augmentation methods for time series foundation models.
Abstract: Time Series Foundation Models (TSFMs) are a powerful paradigm for time series analysis and are often enhanced by synthetic data augmentation to improve the training data quality. Existing augmentation methods, however, typically rely on heuristics and static paradigms. Motivated by dynamic data optimization, which shows that the contribution of samples varies across training stages, we propose OATS (Online Data Augmentation for Time Series Foundation Models), a principled strategy that generates synthetic data tailored to different training steps. OATS leverages valuable training samples as principled guiding signals and dynamically generates high-quality synthetic data conditioned on them. We further design a diffusion-based framework to produce realistic time series and introduce an explore-exploit mechanism to balance efficiency and effectiveness. Experiments on TSFMs demonstrate that OATS consistently outperforms regular training and yields substantial performance gains over static data augmentation baselines across six validation datasets and two TSFM architectures. The code is available at the link https://github.com/microsoft/TimeCraft.
[349] Principled Fine-tuning of LLMs from User-Edits: A Medley of Preference, Supervision, and Reward
Dipendra Misra, Aldo Pacchiano, Ta-Chung Chi, Ge Gao
Main category: cs.LG
TL;DR: The paper proposes a theoretical framework and ensembling method for fine-tuning LLMs using user-edit deployment data, unifying different feedback types (preferences, supervised labels, cost) that are typically studied separately.
Details
Motivation: User-edit deployment data (context, agent response, user edits) is naturally generated in applications like writing assistants and coding agents, making it a valuable source for adapting and personalizing LLMs. This setup unifies various feedback types that are usually studied separately.Method: The paper first derives theoretical bounds for learning algorithms using individual feedback types (preferences, supervised labels, cost). Then proposes a simple ensembling procedure to jointly learn from all these feedback types simultaneously.
Result: On two domains adapted from Gao et al. 2024, the ensembling procedure outperforms methods that learn from individual feedback types. The procedure also shows robust adaptation to different user-edit distributions at test time.
Conclusion: The paper initiates theoretical investigation of learning from user edits, demonstrating that joint learning through ensembling across different feedback types is more effective than individual approaches, with practical benefits for LLM adaptation and personalization.
Abstract: We study how to fine-tune LLMs using user-edit deployment data consisting of a set of context, an agent’s response, and user edits. This deployment data is naturally generated by users in applications such as LLMs-based writing assistants and coding agents. The natural origin of user edits makes it a desired source for adapting and personalizing LLMs. In this setup, there emerges a unification of various feedback types namely preferences, supervised labels, and cost that are typically studied separately in the literature. In this paper, we initiate the theoretical investigation of learning from user edits. We first derive bounds for learning algorithms that learn from each of these feedback types. We prove that these algorithms have different trade-offs depending upon the user, data distribution, and model class. We then propose a simple ensembling procedure to jointly learn from these feedback types. On two domains adapted from Gao et al. 2024, we show our ensembling procedure outperforms these methods that learn from individual feedback. Further, we show that our proposed procedure can robustly adapt to different user-edit distributions at test time.
[350] Critical Organization of Deep Neural Networks, and p-Adic Statistical Field Theories
W. A. Zúñiga-Galindo
Main category: cs.LG
TL;DR: The paper studies thermodynamic limits of neural networks using p-adic integers to model hierarchical structures, showing unique vs. infinite state transitions and analyzing random networks in infinite-width limits.
Details
Motivation: To rigorously understand the thermodynamic limit of deep and recurrent neural networks, particularly how hierarchical structures can be mathematically modeled and how networks transition between unique and infinite states.Method: Uses p-adic integers to codify hierarchical structures, recasting DNN/RNN topologies as p-adic tree-like structures. Studies critical organization via bifurcation analysis and analyzes random networks using generalized Gaussian random variables in function spaces.
Result: Shows networks admit unique states in certain parameter regions and infinite states outside these regions, with critical organization described as strange attractors. For random networks in infinite-width case, output distribution admits power-type expansion with Gaussian leading term.
Conclusion: The paper establishes mathematical connections between hierarchical neural network structures and p-adic representations, providing rigorous analysis of state transitions and statistical properties in thermodynamic limits of neural networks.
Abstract: We rigorously study the thermodynamic limit of deep neural networks (DNNS) and recurrent neural networks (RNNs), assuming that the activation functions are sigmoids. A thermodynamic limit is a continuous neural network, where the neurons form a continuous space with infinitely many points. We show that such a network admits a unique state in a certain region of the parameter space, which depends continuously on the parameters. This state breaks into an infinite number of states outside the mentioned region of parameter space. Then, the critical organization is a bifurcation in the parameter space, where a network transitions from a unique state to infinitely many states. We use p-adic integers to codify hierarchical structures. Indeed, we present an algorithm that recasts the hierarchical topologies used in DNNs and RNNs as p-adic tree-like structures. In this framework, the hierarchical and the critical organizations are connected. We study rigorously the critical organization of a toy model, a hierarchical edge detector for grayscale images based on p-adic cellular neural networks. The critical organization of such a network can be described as a strange attractor. In the second part, we study random versions of DNNs and RNNs. In this case, the network parameters are generalized Gaussian random variables in a space of quadratic integrable functions. We compute the probability distribution of the output given the input, in the infinite-width case. We show that it admits a power-type expansion, where the constant term is a Gaussian distribution.
[351] Speed is Confidence
Joshua V. Dillon
Main category: cs.LG
TL;DR: The paper proposes using “first-to-halt” inference inspired by biological neural systems to reduce compute while maintaining accuracy. By basing ensemble predictions on the first model to finish rather than averaging, and training with parallel latent states but backpropagating only through the lowest-loss winner, they achieve near-perfect Sudoku accuracy with 10x less compute than test-time augmentation.
Details
Motivation: Biological neural systems are fast but energy-constrained, using "act on the first signal" principles like winner-take-all circuits and time-to-first-spike coding. The authors aim to apply this biological efficiency principle to neural networks, reducing computational cost while maintaining accuracy.Method: Two key methods: 1) Inference: Use ensembles of Tiny Recursive Models (TRM) but base predictions solely on the first model to halt rather than averaging all predictions. 2) Training: Maintain K=4 parallel latent states during training but backpropagate only through the lowest-loss “winner.” Also developed a modified SwiGLU activation to make the approach viable under resource constraints.
Result: Achieved 97.2% puzzle accuracy on Sudoku-Extreme using 10x less compute than test-time augmentation (baseline: 86.1% single-pass, 97.3% with TTA). Single model with K=4 training achieved 96.9% ± 0.6% accuracy with single forward pass, matching TTA performance without augmentation. Training efficiency: 7k steps (40 min) for baseline performance, 36k steps (1.5-6 hours) for higher accuracy.
Conclusion: The “first-to-halt” principle from biological systems can be effectively applied to neural networks, enabling significant compute reduction while maintaining accuracy. Inference speed serves as an implicit confidence measure, and this capability can be manifested as a training-only cost through winner-take-all backpropagation. Resource constraints drove innovation in efficient architectures.
Abstract: Biological neural systems must be fast but are energy-constrained. Evolution’s solution: act on the first signal. Winner-take-all circuits and time-to-first-spike coding implicitly treat when a neuron fires as an expression of confidence. We apply this principle to ensembles of Tiny Recursive Models (TRM). By basing the ensemble prediction solely on the first to halt rather than averaging predictions, we achieve 97.2% puzzle accuracy on Sudoku-Extreme while using 10x less compute than test-time augmentation (the baseline achieves 86.1% single-pass, 97.3% with TTA). Inference speed is an implicit indication of confidence. But can this capability be manifested as a training-only cost? Evidently yes: by maintaining K = 4 parallel latent states during training but backpropping only through the lowest-loss “winner,” a single model achieves 96.9% +/- 0.6% puzzle accuracy with a single forward pass-matching TTA performance without any test-time augmentation. As in nature, this work was also resource constrained: all experimentation used a single RTX 5090. This necessitated efficiency and compelled our invention of a modified SwiGLU which made Muon viable. With Muon and K = 1 training, we exceed TRM baseline performance in 7k steps (40 min). Higher accuracy requires 36k steps: 1.5 hours for K = 1, 6 hours for K = 4.
[352] EPAS: Efficient Training with Progressive Activation Sharing
Rezaul Karim, Maryam Dialameh, Yang Liu, Boxing Chen, Walid Ahmed
Main category: cs.LG
TL;DR: EPAS introduces progressive activation sharing during training to reduce compute by exploiting redundant QK/KV activations in deeper transformer layers, improving both training and inference throughput while maintaining model quality.
Details
Motivation: To address computational inefficiency in transformers by leveraging the observation that deeper layers exhibit redundant QK/KV activations, enabling compute reduction through activation sharing without sacrificing model performance.Method: Progressive training approach that gradually grows a sharing region from deep to shallow layers during training, switching decoder layers to activation sharing mode to reduce compute while maintaining learning dynamics.
Result: Up to 11.1% training throughput improvement and 29% inference throughput improvement in LLaMA models (125M-7B parameters) with similar loss curves; 10% accuracy improvement in continual pretraining of TinyLLaMA over SOTA methods.
Conclusion: EPAS effectively bridges progressive training with activation sharing to exploit deeper layer redundancy, offering variable compute budgets during inference and demonstrating significant throughput gains while maintaining model quality.
Abstract: We present a novel method for Efficient training with Progressive Activation Sharing (EPAS). This method bridges progressive training paradigm with the phenomenon of redundant QK (or KV ) activations across deeper layers of transformers. EPAS gradually grows a sharing region during training by switching decoder layers to activation sharing mode. This results in throughput increase due to reduced compute. To utilize deeper layer redundancy, the sharing region starts from the deep end of the model and grows towards the shallow end. The EPAS trained models allow for variable region lengths of activation sharing for different compute budgets during inference. Empirical evaluations with QK activation sharing in LLaMA models ranging from 125M to 7B parameters show up to an 11.1% improvement in training throughput and up to a 29% improvement in inference throughput while maintaining similar loss curve to the baseline models. Furthermore, applying EPAS in continual pretraining to transform TinyLLaMA into an attention-sharing model yields up to a 10% improvement in average accuracy over state-of-the-art methods, emphasizing the significance of progressive training in cross layer activation sharing models.
[353] Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation
Bochao Liu, Shiming Ge, Pengju Wang, Shikun Li, Tongliang Liu
Main category: cs.LG
TL;DR: A data-free model-to-model conversion method that transforms pretrained models into privacy-preserving versions using differentially private synthetic distillation without accessing original private data.
Details
Motivation: Deployed deep learning models trained on private datasets pose privacy leakage risks, as attackers could recover sensitive data or label information from the models. There's a need for privacy-preserving model deployment solutions.Method: Proposes differentially private synthetic distillation - a cooperative-competitive learning approach with three players: 1) generator learns to create synthetic data, 2) teacher and student compute differentially private labels via data/label noise perturbation, 3) student updates with noisy labels while generator uses student as discriminator for adversarial training. Uses alternate optimization without accessing original private data.
Result: Theoretically proven to guarantee differential privacy and convergence. The transcribed student model maintains good performance while providing privacy protection. The generator can produce private synthetic data for downstream tasks. Outperforms 26 state-of-the-art methods in extensive experiments.
Conclusion: Privacy-preserving model transcription enables secure model deployment by converting pretrained models into privacy-protected versions without accessing original data, offering both theoretical privacy guarantees and practical performance.
Abstract: While many deep learning models trained on private datasets have been deployed in various practical tasks, they may pose a privacy leakage risk as attackers could recover informative data or label knowledge from models. In this work, we present \emph{privacy-preserving model transcription}, a data-free model-to-model conversion solution to facilitate model deployment with a privacy guarantee. To this end, we propose a cooperative-competitive learning approach termed \emph{differentially private synthetic distillation} that learns to convert a pretrained model (teacher) into its privacy-preserving counterpart (student) via a trainable generator without access to private data. The learning collaborates with three players in a unified framework and performs alternate optimization: i)~the generator is learned to generate synthetic data, ii)~the teacher and student accept the synthetic data and compute differential private labels by flexible data or label noisy perturbation, and iii)~the student is updated with noisy labels and the generator is updated by taking the student as a discriminator for adversarial training. We theoretically prove that our approach can guarantee differential privacy and convergence. The transcribed student has good performance and privacy protection, while the resulting generator can generate private synthetic data for downstream tasks. Extensive experiments clearly demonstrate that our approach outperforms 26 state-of-the-arts.
[354] Out-of-Distribution Generalization for Neural Physics Solvers
Zhao Wei, Chin Chun Ooi, Jian Cheng Wong, Abhishek Gupta, Pao-Hsiung Chiu, Yew-Soon Ong
Main category: cs.LG
TL;DR: NOVA is a neural physics solver that achieves strong generalization beyond training data for PDE problems, enabling reliable extrapolation to novel scenarios with 1-2 orders of magnitude lower out-of-distribution errors than baselines.
Details
Motivation: Current neural physics solvers have poor generalization beyond their training support, which limits exploration of novel designs and long-time horizon predictions in scientific discovery applications.Method: NOVA learns physics-aligned representations from an initial sparse set of scenarios, enabling generalizable neural physics solvers that can handle distributional shifts in PDE parameters, geometries, and initial conditions.
Result: NOVA consistently achieves 1-2 orders of magnitude lower out-of-distribution errors than data-driven baselines across complex nonlinear problems including heat transfer, diffusion-reaction, and fluid flow. It also stabilizes long-time dynamical rollouts and improves generative design in applications like Turing systems and fluidic chip optimization.
Conclusion: Unlike neural physics solvers constrained to retrieval/emulation within known spaces, NOVA enables reliable extrapolation beyond known regimes, providing a key capability for exploring novel hypothesis spaces in scientific discovery.
Abstract: Neural physics solvers are increasingly used in scientific discovery, given their potential for rapid in silico insights into physical, materials, or biological systems and their long-time evolution. However, poor generalization beyond their training support limits exploration of novel designs and long-time horizon predictions. We introduce NOVA, a route to generalizable neural physics solvers that can provide rapid, accurate solutions to scenarios even under distributional shifts in partial differential equation parameters, geometries and initial conditions. By learning physics-aligned representations from an initial sparse set of scenarios, NOVA consistently achieves 1-2 orders of magnitude lower out-of-distribution errors than data-driven baselines across complex, nonlinear problems including heat transfer, diffusion-reaction and fluid flow. We further showcase NOVA’s dual impact on stabilizing long-time dynamical rollouts and improving generative design through application to the simulation of nonlinear Turing systems and fluidic chip optimization. Unlike neural physics solvers that are constrained to retrieval and/or emulation within an a priori space, NOVA enables reliable extrapolation beyond known regimes, a key capability given the need for exploration of novel hypothesis spaces in scientific discovery
[355] FloydNet: A Learning Paradigm for Global Relational Reasoning
Jingcheng Yu, Mingliang Zeng, Qiwei Ye
Main category: cs.LG
TL;DR: FloydNet introduces a dynamic programming-based architecture for graph reasoning that outperforms GNNs by maintaining global all-pairs relationships instead of local message passing, achieving state-of-the-art results on algorithmic tasks and graph reasoning benchmarks.
Details
Motivation: GNNs are limited by their local message-passing mechanism which creates a bottleneck for global, holistic reasoning. The paper argues that dynamic programming, which solves problems by iteratively refining a global state, offers a more powerful learning paradigm for complex multi-step reasoning tasks.Method: FloydNet maintains a global, all-pairs relationship tensor and learns a generalized dynamic programming operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies.
Result: FloydNet achieves 3-WL (2-FWL) expressive power theoretically and demonstrates state-of-the-art performance: near-perfect scores (>99%) on CLRS-30 algorithmic benchmark, significantly outperforms strong heuristics for exact TSP solutions, and empirically matches 3-WL test on BREC benchmark.
Conclusion: Learned dynamic programming-style refinement is established as a powerful and practical alternative to message passing for high-level graph reasoning, offering superior global reasoning capabilities compared to traditional GNN architectures.
Abstract: Developing models capable of complex, multi-step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message-passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all-pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies. Theoretically, we prove that FloydNet achieves 3-WL (2-FWL) expressive power, and its generalized form aligns with the k-FWL hierarchy. FloydNet demonstrates state-of-the-art performance across challenging domains: it achieves near-perfect scores (often >99%) on the CLRS-30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on the BREC benchmark. Our results establish this learned, DP-style refinement as a powerful and practical alternative to message passing for high-level graph reasoning.
[356] OWLEYE: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection
Lecheng Zheng, Dongqi Fu, Zihao Li, Jingrui He
Main category: cs.LG
TL;DR: OWLEYE is a zero-shot graph anomaly detection framework that learns transferable normal behavior patterns from multiple graphs, enabling anomaly detection on unseen graphs without retraining through cross-domain feature alignment, multi-pattern dictionary learning, and attention-based reconstruction.
Details
Motivation: Current graph anomaly detection methods struggle with cross-domain generalization due to varying feature semantics and dimensions across different graph datasets. Existing approaches cannot effectively handle unseen graphs without retraining, limiting their practical application in real-world scenarios where graph data comes from diverse domains with different characteristics.Method: OWLEYE uses three key components: 1) Cross-domain feature alignment module to harmonize feature distributions while preserving domain-specific semantics, 2) Multi-domain multi-pattern dictionary learning to encode shared structural and attribute-based patterns for continuous learning, and 3) Truncated attention-based reconstruction module for in-context learning to detect anomalies without labeled data on unseen graphs.
Result: Extensive experiments on real-world datasets demonstrate that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines, establishing a strong foundation for scalable and label-efficient anomaly detection across different domains.
Conclusion: OWLEYE successfully addresses the challenge of cross-domain graph anomaly detection by developing a zero-shot framework that can detect anomalies in unseen graphs without retraining, overcoming the limitations of varying feature semantics and dimensions across different graph domains.
Abstract: Graph data is informative to represent complex relationships such as transactions between accounts, communications between devices, and dependencies among machines or processes. Correspondingly, graph anomaly detection (GAD) plays a critical role in identifying anomalies across various domains, including finance, cybersecurity, manufacturing, etc. Facing the large-volume and multi-domain graph data, nascent efforts attempt to develop foundational generalist models capable of detecting anomalies in unseen graphs without retraining. To the best of our knowledge, the different feature semantics and dimensions of cross-domain graph data heavily hinder the development of the graph foundation model, leaving further in-depth continual learning and inference capabilities a quite open problem. Hence, we propose OWLEYE, a novel zero-shot GAD framework that learns transferable patterns of normal behavior from multiple graphs, with a threefold contribution. First, OWLEYE proposes a cross-domain feature alignment module to harmonize feature distributions, which preserves domain-specific semantics during alignment. Second, with aligned features, to enable continuous learning capabilities, OWLEYE designs the multi-domain multi-pattern dictionary learning to encode shared structural and attribute-based patterns. Third, for achieving the in-context learning ability, OWLEYE develops a truncated attention-based reconstruction module to robustly detect anomalies without requiring labeled data for unseen graph-structured data. Extensive experiments on real-world datasets demonstrate that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines, establishing a strong foundation for scalable and label-efficient anomaly detection.
[357] TinyTorch: Building Machine Learning Systems from First Principles
Vijay Janapa Reddi
Main category: cs.LG
TL;DR: TinyTorch is a 20-module curriculum teaching ML systems engineering by having students implement PyTorch core components in pure Python, focusing on practical systems awareness alongside algorithms.
Details
Motivation: Current ML education separates algorithms from systems, leaving graduates unprepared for real production debugging and creating a gap between research and reliable deployment.Method: Students implement PyTorch core components (tensors, autograd, optimizers, neural networks) in pure Python using three pedagogical principles: progressive disclosure, systems-first integration, and historical milestone validation.
Result: TinyTorch requires only a laptop with 4GB RAM and no GPU, making ML systems education accessible worldwide while enabling students to recreate key ML breakthroughs from Perceptron to Transformers.
Conclusion: TinyTorch aims to prepare AI engineers who understand not only what ML systems do, but why they work and how to make them scale, bridging the gap between ML research and practical deployment.
Abstract: Machine learning systems engineering requires a deep understanding of framework internals. Yet most current education separates algorithms from systems. Students learn gradient descent without measuring memory usage, and attention mechanisms without profiling computational cost. This split leaves graduates unprepared to debug real production failures and widens the gap between machine learning research and reliable deployment. We present TinyTorch, a 20 module curriculum in which students implement the core components of PyTorch, including tensors, autograd, optimizers, and neural networks, entirely in pure Python. The curriculum is built around three pedagogical principles. Progressive disclosure gradually introduces complexity as students build confidence. Systems first integration embeds memory and performance awareness from the very beginning. Historical milestone validation guides students to recreate key breakthroughs, from the Perceptron in 1958 to modern Transformers, using only code they have written themselves. TinyTorch requires only a laptop with 4GB of RAM and no GPU, making machine learning systems education accessible worldwide. Its goal is to prepare the next generation of AI engineers, practitioners who understand not only what machine learning systems do, but why they work and how to make them scale. The curriculum is available as open source at mlsysbook.ai slash tinytorch.
[358] Native LLM and MLLM Inference at Scale on Apple Silicon
Wayner Barrios
Main category: cs.LG
TL;DR: vllm-mlx is a native Apple Silicon inference framework built on MLX that outperforms existing tools for both text and multimodal models through continuous batching and content-based prefix caching.
Details
Motivation: Apple Silicon's growing adoption for ML development creates demand for efficient inference solutions, but existing tools lack native optimization (PyTorch MPS) or focus only on text models (llama.cpp), leaving multimodal workloads underserved.Method: Built natively on MLX for Apple Silicon. For text models: continuous batching that scales throughput with concurrent requests. For multimodal models: content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing regardless of input format.
Result: 21-87% higher throughput than llama.cpp across models from Qwen3-0.6B to Nemotron-30B; 4.3x aggregate throughput at 16 concurrent requests; up to 525 tokens/sec on text models; 28x speedup on repeated image queries (latency reduced from 21.7s to <1s); 24.7x cache speedup on video analysis with up to 64 frames.
Conclusion: vllm-mlx provides efficient LLM and MLLM inference on Apple Silicon with significant performance improvements over existing solutions, particularly for multimodal workloads through innovative caching techniques, and is released as open source.
Abstract: The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models (llama.cpp), leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21% to 87% higher throughput than llama.cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max demonstrates throughput of up to 525 tokens per second on text models and 28x speedup on repeated image queries, reducing multimodal latency from 21.7 seconds to under 1 second. Video analysis with up to 64 frames achieves 24.7x cache speedup. We release our implementation as open source to support efficient inference on consumer Apple hardware.
[359] GPCR-Filter: a deep learning framework for efficient and precise GPCR modulator discovery
Jingjie Ning, Xiangzhen Shen, Li Hou, Shiyi Shen, Jiahao Yang, Junrui Li, Hong Shan, Sanan Wu, Sihan Gao, Huaqiang Eric Xu, Xinheng He
Main category: cs.LG
TL;DR: GPCR-Filter is a deep learning framework that combines protein language models and graph neural networks to predict GPCR-ligand interactions, outperforming existing methods and identifying novel agonists.
Details
Motivation: GPCRs are crucial pharmacological targets but modulator discovery is challenging due to complex allosteric effects and limitations of conventional assays which are slow, costly, and not optimized for capturing dynamic interactions.Method: Developed GPCR-Filter framework that integrates ESM-3 protein language model for GPCR sequence representations with graph neural networks for ligand structures, using attention-based fusion to learn receptor-ligand functional relationships. Trained on a high-quality dataset of over 90,000 experimentally validated GPCR-ligand pairs.
Result: Outperformed state-of-the-art compound-protein interaction models across multiple evaluation settings, demonstrated strong generalization to unseen receptors and ligands, and successfully identified micromolar-level agonists of the 5-HT1A receptor with distinct chemical frameworks.
Conclusion: GPCR-Filter establishes a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.
Abstract: G protein-coupled receptors (GPCRs) govern diverse physiological processes and are central to modern pharmacology. Yet discovering GPCR modulators remains challenging because receptor activation often arises from complex allosteric effects rather than direct binding affinity, and conventional assays are slow, costly, and not optimized for capturing these dynamics. Here we present GPCR-Filter, a deep learning framework specifically developed for GPCR modulator discovery. We assembled a high-quality dataset of over 90,000 experimentally validated GPCR-ligand pairs, providing a robust foundation for training and evaluation. GPCR-Filter integrates the ESM-3 protein language model for high-fidelity GPCR sequence representations with graph neural networks that encode ligand structures, coupled through an attention-based fusion mechanism that learns receptor-ligand functional relationships. Across multiple evaluation settings, GPCR-Filter consistently outperforms state-of-the-art compound-protein interaction models and exhibits strong generalization to unseen receptors and ligands. Notably, the model successfully identified micromolar-level agonists of the 5-HT\textsubscript{1A} receptor with distinct chemical frameworks. These results establish GPCR-Filter as a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.
[360] A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction
Jinkyu Sung, Myunggeum Jee, Joonseok Lee
Main category: cs.LG
TL;DR: Proposes a scalable Gaussian copula-based method for link sign prediction on signed graphs that efficiently models edge-edge dependencies while maintaining competitive performance.
Details
Motivation: Traditional graph methods fail on signed graphs because negative edges violate the homophily assumption, and existing approaches require auxiliary structures. Direct modeling of edge dependencies with Gaussian copula is computationally intractable for moderate-scale graphs.Method: Extends CopulaGNN by: 1) representing the correlation matrix as a Gramian of edge embeddings to reduce parameters, and 2) reformulating the conditional probability distribution to dramatically reduce inference cost.
Result: The method achieves significantly faster convergence than baselines while maintaining competitive prediction performance to state-of-the-art models. Theoretical analysis proves linear convergence, verifying scalability.
Conclusion: The proposed approach provides an efficient and scalable solution for link sign prediction on signed graphs by addressing computational challenges of direct edge dependency modeling while preserving prediction accuracy.
Abstract: Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN. However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.
[361] Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder
Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen
Main category: cs.LG
TL;DR: Proposes an autoencoder framework with non-uniform variance regularization and isometric constraint to generalize PCA for nonlinear dimensionality reduction while preserving ordered representations and variance retention.
Details
Motivation: Linear autoencoders can recover PCA's ordered principal components with regularization, but these approaches fail in nonlinear settings where remaining variance cannot be properly captured independently of nonlinear mapping.Method: Integrates non-uniform variance regularization with an isometric constraint in autoencoder framework, serving as natural generalization of PCA for nonlinear dimensionality reduction.
Result: The proposed framework preserves key PCA advantages like ordered representations and variance retention while remaining effective for nonlinear dimensionality reduction tasks.
Conclusion: The novel autoencoder framework successfully generalizes PCA to nonlinear settings by combining non-uniform variance regularization with isometric constraints, maintaining desirable PCA properties while handling nonlinear mappings.
Abstract: Autoencoders have long been considered a nonlinear extension of Principal Component Analysis (PCA). Prior studies have demonstrated that linear autoencoders (LAEs) can recover the ordered, axis-aligned principal components of PCA by incorporating non-uniform $\ell_2$ regularization or by adjusting the loss function. However, these approaches become insufficient in the nonlinear setting, as the remaining variance cannot be properly captured independently of the nonlinear mapping. In this work, we propose a novel autoencoder framework that integrates non-uniform variance regularization with an isometric constraint. This design serves as a natural generalization of PCA, enabling the model to preserve key advantages, such as ordered representations and variance retention, while remaining effective for nonlinear dimensionality reduction tasks.
[362] Foresight Learning for SEC Risk Prediction
Benjamin Turtel, Paul Wilczewski, Danny Franklin, Kris Skotheim
Main category: cs.LG
TL;DR: Automated pipeline converts SEC risk disclosures into quantified probability estimates using only public data, training a compact LLM that outperforms larger models on risk materialization prediction.
Details
Motivation: SEC risk disclosures are qualitative and lack probability quantification, limiting their usefulness for probabilistic analysis. There's no large-scale supervision linking disclosed risks to actual outcomes.Method: Fully automated pipeline converts SEC risk disclosures into firm-specific, time-bounded risk queries, labels them by resolving outcomes against subsequent disclosures, and trains a compact LLM to estimate risk materialization probabilities.
Result: The compact model substantially improves over pretrained/heuristic baselines and outperforms frontier models like GPT-5 on probabilistic accuracy and calibration, while being deployable on a single GPU.
Conclusion: Foresight Learning enables scalable, automated training of domain-specific expert models using only raw chronological text, achieving frontier performance without proprietary data or manual annotation.
Abstract: Risk disclosures in SEC filings describe potential adverse events but rarely quantify their likelihood, limiting their usefulness for probabilistic analysis. A central obstacle is the absence of large-scale, risk-level supervision linking disclosed risks to realized outcomes. We introduce a fully automated data generation pipeline that converts qualitative SEC risk disclosures into temporally grounded supervision using only public data. For each filing, the pipeline generates firm-specific, time-bounded risk queries from the Risk Factors section and labels them by automatically resolving outcomes against subsequent disclosures. Using this dataset of risk queries and outcomes grounded in SEC filings, we train a compact large language model to estimate the probability that a disclosed risk will materialize within a specified horizon. Despite its modest size, the resulting model substantially improves over pretrained and heuristic baselines, and outperforms frontier general-purpose models, including GPT-5, on probabilistic accuracy and calibration. More broadly, this work demonstrates that Foresight Learning enables scalable and fully automated training of domain-specific expert models using only raw, chronological, in-domain text – without proprietary data, external corpora, or manual annotation. The resulting models achieve frontier-level performance while remaining deployable on a single GPU. This result suggests a general pathway for learning calibrated, decision-relevant signals from naturally occurring enterprise documents. To support transparency and reproducibility, we open-source the evaluation dataset used in this study. Evaluation Data: https://huggingface.co/datasets/LightningRodLabs/sec_risk_questions_test_set Data Generation Platform: https://lightningrod.ai/ SDK: https://github.com/lightning-rod-labs/lightningrod-python-sdk
[363] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning
Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
Main category: cs.LG
TL;DR: Multi-Adversary GDRO framework dynamically adapts training distribution for LLM reasoning, using difficulty classification and compute-neutral rollout allocation to improve performance on hard problems.
Details
Motivation: Standard RL paradigms for LLM reasoning use uniform prompt sampling and fixed rollouts, which is inefficient for heterogeneous, heavy-tailed reasoning data - wasting compute on easy problems while under-training hard ones.Method: Proposes Multi-Adversary GDRO with: 1) Online Difficulty Classifier partitioning prompts into dynamic pass@k groups; 2) Prompt-GDRO using EMA-debiased multiplicative-weights bandit sampler to target difficulty margin; 3) Rollout-GDRO using shadow-price controller to reallocate rollouts across groups under fixed compute budget.
Result: On DAPO 14.1k dataset with Qwen3-Base models, Prompt-GDRO and Rollout-GDRO achieve +10.6% and +10.1% average relative gains in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to GRPO baseline.
Conclusion: The framework enables emergent curriculum learning where adversaries shift resources to the evolving reasoning frontier, enhancing reasoning model performance through dynamic adaptation to difficulty distribution.
Abstract: Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model’s performance.
[364] Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization
Dai Hai Nguyen, Duc Dung Nguyen, Atsuyoshi Nakamura, Hiroshi Mamitsuka
Main category: cs.LG
TL;DR: A-MWGraD: An accelerated variant of Multiple Wasserstein Gradient Descent (MWGraD) for multi-objective optimization over probability distributions, achieving faster convergence rates through Nesterov-inspired acceleration.
Details
Motivation: To improve upon the convergence speed of existing multi-objective optimization methods in Wasserstein space, particularly MWGraD which has O(1/t) convergence rate for geodesically convex objectives.Method: Proposes A-MWGraD, an accelerated variant inspired by Nesterov’s acceleration, with continuous-time dynamics analysis and practical kernel-based discretization for implementation.
Result: A-MWGraD achieves O(1/t²) convergence for geodesically convex objectives and O(e^{-√βt}) for β-strongly geodesically convex objectives, outperforming MWGraD’s O(1/t) rate. Numerical experiments show better convergence speed and sampling efficiency.
Conclusion: The accelerated A-MWGraD algorithm provides significant improvements in convergence rates over MWGraD for multi-objective optimization in Wasserstein space, with practical implementation through kernel-based discretization.
Abstract: We study multi-objective optimization over probability distributions in Wasserstein space. Recently, Nguyen et al. (2025) introduced Multiple Wasserstein Gradient Descent (MWGraD) algorithm, which exploits the geometric structure of Wasserstein space to jointly optimize multiple objectives. Building on this approach, we propose an accelerated variant, A-MWGraD, inspired by Nesterov’s acceleration. We analyze the continuous-time dynamics and establish convergence to weakly Pareto optimal points in probability space. Our theoretical results show that A-MWGraD achieves a convergence rate of O(1/t^2) for geodesically convex objectives and O(e^{-\sqrtβt}) for $β$-strongly geodesically convex objectives, improving upon the O(1/t) rate of MWGraD in the geodesically convex setting. We further introduce a practical kernel-based discretization for A-MWGraD and demonstrate through numerical experiments that it consistently outperforms MWGraD in convergence speed and sampling efficiency on multi-target sampling tasks.
[365] Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model
Qi Si, Xuyang Liu, Penglei Wang, Xin Guo, Yuan Qi, Yuan Cheng
Main category: cs.LG
TL;DR: SOLD: A reinforcement learning framework that optimizes latent diffusion models for RNA inverse folding, achieving superior structural accuracy across multiple non-differentiable objectives.
Details
Motivation: Current RNA inverse folding methods focus on sequence recovery but struggle with structural objectives like secondary structure consistency, minimum free energy, and LDDT, leading to suboptimal structural accuracy. Existing approaches, including diffusion-based methods, cannot effectively handle non-differentiable structural objectives.Method: Proposes SOLD (Step-wise Optimization of Latent Diffusion Model), a reinforcement learning framework integrated with a latent diffusion model. Uses RNA-FM embeddings to capture co-evolutionary patterns, and employs RL to optimize single-step noise without sampling full diffusion trajectories, enabling efficient refinement of multiple structural objectives.
Result: SOLD surpasses its LDM baseline and state-of-the-art methods across all metrics, demonstrating superior performance in RNA inverse folding with improved structural accuracy.
Conclusion: SOLD establishes a robust framework for RNA inverse folding that effectively handles non-differentiable structural objectives, with profound implications for biotechnological and therapeutic applications.
Abstract: RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence-structure interactions, we develop an LDM incorporating pre-trained RNA-FM embeddings from a large-scale RNA model. These embeddings capture co-evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion-based methods, cannot effectively handle non-differentiable structural objectives. By contrast, RL excels in this task by using policy-driven reward optimization to navigate complex, non-gradient-based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step-wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single-step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state-of-the-art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.
[366] Contrast-Source-Based Physics-Driven Neural Network for Inverse Scattering Problems
Yutong Du, Zicheng Liu
Main category: cs.LG
TL;DR: Proposes CSPDNN, a physics-driven neural network for inverse scattering that predicts induced currents with adaptive total variation loss for efficient and robust reconstruction under varying conditions.
Details
Motivation: Supervised DNNs for inverse scattering require large datasets limiting generalization, while untrained neural networks (UNNs) have long inference times. Need for efficient, robust solvers that don't require large datasets.Method: Contrast-source-based physics-driven neural network (CSPDNN) that predicts induced current distribution rather than directly reconstructing contrast. Incorporates adaptive total variation loss for robustness under varying contrast and noise conditions.
Result: Improved imaging performance validated through comprehensive numerical simulations and experimental data. Method achieves efficient reconstruction without requiring large datasets.
Conclusion: CSPDNN provides an effective solution for inverse scattering problems by combining physics-driven neural networks with contrast-source formulation and adaptive regularization, overcoming limitations of both supervised DNNs and traditional UNNs.
Abstract: Deep neural networks (DNNs) have recently been applied to inverse scattering problems (ISPs) due to their strong nonlinear mapping capabilities. However, supervised DNN solvers require large-scale datasets, which limits their generalization in practical applications. Untrained neural networks (UNNs) address this issue by updating weights from measured electric fields and prior physical knowledge, but existing UNN solvers suffer from long inference time. To overcome these limitations, this paper proposes a contrast-source-based physics-driven neural network (CSPDNN), which predicts the induced current distribution to improve efficiency and incorporates an adaptive total variation loss for robust reconstruction under varying contrast and noise conditions. The improved imaging performance is validated through comprehensive numerical simulations and experimental data.
[367] LLM-Assisted Logic Rule Learning: Scaling Human Expertise for Time Series Anomaly Detection
Haoting Zhang, Shekhar Jain
Main category: cs.LG
TL;DR: LLM-powered framework converts human expertise into interpretable logic rules for supply chain time series anomaly detection, outperforming unsupervised methods while maintaining low latency and cost.
Details
Motivation: Classical unsupervised anomaly detection yields results misaligned with business requirements, while manual expert analysis doesn't scale to millions of products in supply chains.Method: Three-stage framework: 1) LLM-based labeling using domain knowledge, 2) automated generation and iterative improvement of symbolic rules via LLM-driven optimization, 3) rule augmentation with business-relevant anomaly categories using LLMs.
Result: Outperforms unsupervised learning methods in both detection accuracy and interpretability; provides consistent, deterministic results with low computational latency and cost compared to direct LLM deployment.
Conclusion: LLMs can bridge the gap between scalable automation and expert-driven decision-making in operational settings for supply chain anomaly detection.
Abstract: Time series anomaly detection is critical for supply chain management to take proactive operations, but faces challenges: classical unsupervised anomaly detection based on exploiting data patterns often yields results misaligned with business requirements and domain knowledge, while manual expert analysis cannot scale to millions of products in the supply chain. We propose a framework that leverages large language models (LLMs) to systematically encode human expertise into interpretable, logic-based rules for detecting anomaly patterns in supply chain time series data. Our approach operates in three stages: 1) LLM-based labeling of training data instructed by domain knowledge, 2) automated generation and iterative improvements of symbolic rules through LLM-driven optimization, and 3) rule augmentation with business-relevant anomaly categories supported by LLMs to enhance interpretability. The experiment results showcase that our approach outperforms the unsupervised learning methods in both detection accuracy and interpretability. Furthermore, compared to direct LLM deployment for time series anomaly detection, our approach provides consistent, deterministic results with low computational latency and cost, making it ideal for production deployment. The proposed framework thus demonstrates how LLMs can bridge the gap between scalable automation and expert-driven decision-making in operational settings.
[368] Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu
Main category: cs.LG
TL;DR: MEA (Multi-head Explicit Attention) is a Transformer attention variant that explicitly models cross-head interaction through learnable linear combinations of key/value vectors across heads, enabling faster convergence, better performance, and 50% KV-cache compression with minimal accuracy loss.
Details
Motivation: Recent studies show that inter-head interaction in Transformer attention heads can enhance attention performance, motivating the development of an attention mechanism that explicitly models cross-head communication.Method: MEA consists of two components: 1) Head-level Linear Composition (HLC) module that applies learnable linear combinations to key and value vectors across heads for inter-head communication, and 2) head-level Group Normalization to align statistical properties of recombined heads.
Result: MEA shows strong pretraining robustness allowing larger learning rates and faster convergence, leading to lower validation loss and improved task performance. It enables 50% KV-cache compression with negligible performance loss on knowledge-intensive/scientific reasoning tasks and only 3.59% accuracy drop on Olympiad-level math benchmarks.
Conclusion: MEA is a simple yet effective attention variant that explicitly models cross-head interaction, offering improved training efficiency, better performance, and practical KV-cache compression capabilities for memory-efficient inference.
Abstract: In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank “virtual heads”. This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.
[369] E-QRGMM: Efficient Generative Metamodeling for Covariate-Dependent Uncertainty Quantification
Zhiyang Liang, Qingkai Zhang
Main category: cs.LG
TL;DR: E-QRGMM accelerates quantile-regression-based generative metamodeling using cubic Hermite interpolation and gradient estimation, reducing grid complexity from O(n^{1/2}) to O(n^{1/5}) while preserving convergence rates for covariate-dependent uncertainty quantification.
Details
Motivation: Existing methods like conformal prediction and classical bootstrap struggle with covariate-specific conditioning for uncertainty quantification in simulation-based inference, which is crucial for high-stakes decision-making.Method: Efficient Quantile-Regression-Based Generative Metamodeling (E-QRGMM) integrates cubic Hermite interpolation with gradient estimation to accelerate the original QRGMM approach while maintaining theoretical guarantees.
Result: Theoretically, E-QRGMM preserves QRGMM’s convergence rate while reducing grid complexity from O(n^{1/2}) to O(n^{1/5}) for most quantile levels. Empirically, it achieves superior trade-off between distributional accuracy and training speed compared to QRGMM and other deep generative models.
Conclusion: E-QRGMM provides a practical solution for covariate-dependent uncertainty quantification by enabling bootstrap-based confidence intervals for arbitrary estimands, substantially improving computational efficiency while maintaining accuracy.
Abstract: Covariate-dependent uncertainty quantification in simulation-based inference is crucial for high-stakes decision-making but remains challenging due to the limitations of existing methods such as conformal prediction and classical bootstrap, which struggle with covariate-specific conditioning. We propose Efficient Quantile-Regression-Based Generative Metamodeling (E-QRGMM), a novel framework that accelerates the quantile-regression-based generative metamodeling (QRGMM) approach by integrating cubic Hermite interpolation with gradient estimation. Theoretically, we show that E-QRGMM preserves the convergence rate of the original QRGMM while reducing grid complexity from $O(n^{1/2})$ to $O(n^{1/5})$ for the majority of quantile levels, thereby substantially improving computational efficiency. Empirically, E-QRGMM achieves a superior trade-off between distributional accuracy and training speed compared to both QRGMM and other advanced deep generative models on synthetic and practical datasets. Moreover, by enabling bootstrap-based construction of confidence intervals for arbitrary estimands of interest, E-QRGMM provides a practical solution for covariate-dependent uncertainty quantification.
[370] Decoupled Split Learning via Auxiliary Loss
Anower Zihad, Felix Owino, Haibo Yang, Ming Tang, Chao Huang
Main category: cs.LG
TL;DR: Split learning with local loss signals instead of backpropagation reduces communication by 50% and memory by up to 58% while maintaining performance.
Details
Motivation: Traditional split learning requires end-to-end backpropagation, which incurs large communication overhead (exchanging forward activations and backward gradients every iteration) and significant memory usage for storing activations and gradients.Method: Client and server train their model partitions semi-independently using local loss signals. Client’s network has an auxiliary classifier at split point for local error signal, while server trains on client’s transmitted activations using true loss function. This decouples training and eliminates backward gradient transmission.
Result: Achieves performance on par with standard split learning using backpropagation. Reduces communication by 50% (transmitting activations/gradients) and peak memory usage by up to 58%.
Conclusion: Beyond-backpropagation training method for split learning effectively reduces communication and memory overhead while maintaining model performance, making split learning more practical for resource-constrained environments.
Abstract: Split learning is a distributed training paradigm where a neural network is partitioned between clients and a server, which allows data to remain at the client while only intermediate activations are shared. Traditional split learning relies on end-to-end backpropagation across the client-server split point. This incurs a large communication overhead (i.e., forward activations and backward gradients need to be exchanged every iteration) and significant memory use (for storing activations and gradients). In this paper, we develop a beyond-backpropagation training method for split learning. In this approach, the client and server train their model partitions semi-independently, using local loss signals instead of propagated gradients. In particular, the client’s network is augmented with a small auxiliary classifier at the split point to provide a local error signal, while the server trains on the client’s transmitted activations using the true loss function. This decoupling removes the need to send backward gradients, which cuts communication costs roughly in half and also reduces memory overhead (as each side only stores local activations for its own backward pass). We evaluate our approach on CIFAR-10 and CIFAR-100. Our experiments show two key results. First, the proposed approach achieves performance on par with standard split learning that uses backpropagation. Second, it significantly reduces communication (of transmitting activations/gradient) by 50% and peak memory usage by up to 58%.
[371] Neural Neural Scaling Laws
Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho
Main category: cs.LG
TL;DR: NeuNeu is a neural network that predicts downstream task performance scaling from validation perplexity and observed accuracy trajectories, achieving better accuracy than parametric scaling laws.
Details
Motivation: Existing scaling law predictions have limitations: aggregate validation loss obscures signal, and no simple parametric family can capture diverse scaling behaviors (monotonic improvement, plateau, degradation).Method: NeuNeu frames scaling law prediction as time-series extrapolation, combining temporal context from observed accuracy trajectories with token-level validation losses, learning predictions without assuming any functional form.
Result: Achieves 2.04% mean absolute error on 66 downstream tasks (38% reduction vs logistic scaling laws), and generalizes zero-shot to unseen model families, parameter counts, and tasks.
Conclusion: Predicting downstream scaling laws directly from data outperforms parametric alternatives, suggesting data-driven approaches are superior for scaling law prediction.
Abstract: Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks – a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.
[372] Smoothing the Score Function for Generalization in Diffusion Models: An Optimization-based Explanation Framework
Xinyu Zhou, Jiawei Zhang, Stephen J. Wright
Main category: cs.LG
TL;DR: The paper develops a theoretical framework explaining memorization in diffusion models, showing empirical score functions are weighted sums of Gaussian scores with sharp softmax weights, causing single-sample dominance and sampling collapse. It proposes two methods to enhance generalization: Noise Unconditioning and Temperature Smoothing.
Details
Motivation: Diffusion models face memorization issues where generated samples can exactly replicate training samples, which is a fundamental challenge that needs theoretical understanding and practical solutions to improve generalization while maintaining generation quality.Method: 1) Develop theoretical framework showing empirical score function is weighted sum of Gaussian score functions with sharp softmax weights; 2) Propose Noise Unconditioning to adaptively adjust score function weights to prevent single-point dominance; 3) Propose Temperature Smoothing to control smoothness via softmax temperature parameter.
Result: Experiments across multiple datasets validate the theoretical analysis and demonstrate that both proposed methods effectively improve generalization while maintaining high generation quality, mitigating memorization issues in diffusion models.
Conclusion: The paper provides a theoretical explanation for memorization in diffusion models and offers practical solutions (Noise Unconditioning and Temperature Smoothing) that enhance generalization by preventing single-sample dominance in the score function, validated through comprehensive experiments.
Abstract: Diffusion models achieve remarkable generation quality, yet face a fundamental challenge known as memorization, where generated samples can replicate training samples exactly. We develop a theoretical framework to explain this phenomenon by showing that the empirical score function (the score function corresponding to the empirical distribution) is a weighted sum of the score functions of Gaussian distributions, in which the weights are sharp softmax functions. This structure causes individual training samples to dominate the score function, resulting in sampling collapse. In practice, approximating the empirical score function with a neural network can partially alleviate this issue and improve generalization. Our theoretical framework explains why: In training, the neural network learns a smoother approximation of the weighted sum, allowing the sampling process to be influenced by local manifolds rather than single points. Leveraging this insight, we propose two novel methods to further enhance generalization: (1) Noise Unconditioning enables each training sample to adaptively determine its score function weight to increase the effect of more training samples, thereby preventing single-point dominance and mitigating collapse. (2) Temperature Smoothing introduces an explicit parameter to control the smoothness. By increasing the temperature in the softmax weights, we naturally reduce the dominance of any single training sample and mitigate memorization. Experiments across multiple datasets validate our theoretical analysis and demonstrate the effectiveness of the proposed methods in improving generalization while maintaining high generation quality.
[373] Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Chen Chen, Lai Wei
Main category: cs.LG
TL;DR: Keel replaces Post-LN’s ResNet residual path with Highway-style connections to prevent gradient vanishing, enabling stable training of 1000+ layer Transformers and outperforming Pre-LN in depth scaling.
Details
Motivation: Current LLM scaling faces limitations: width scaling has diminishing returns, context length extension doesn't improve expressivity, and depth scaling offers superior theoretical expressivity but current Transformers struggle with training stability at extreme depths. Post-LN was abandoned due to instability, but could offer better depth scaling if its gradient vanishing issues were solved.Method: Keel modifies the Post-LN Transformer architecture by replacing the ResNet-style residual pathway with a Highway-style connection. This preserves gradient flow through the residual branch, preventing signal vanishing from top to bottom layers. The approach requires no specialized initialization or complex optimization tricks.
Result: Keel enables stable training at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN Transformers. It demonstrates robust training without the instability issues that plagued original Post-LN.
Conclusion: Post-LN, when paired with Highway-style connections (as in Keel), provides a simple and effective foundation for building deeply scalable LLMs. This opens possibilities for future infinite-depth architectures and addresses the current limitations in LLM scaling.
Abstract: Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
[374] Process-Aware Procurement Lead Time Prediction for Shipyard Delay Mitigation
Yongjae Lee, Eunhee Park, Daesan Park, Dongho Kim, Jongho Choi, Hyerim Bae
Main category: cs.LG
TL;DR: Novel framework combining event logs and static attributes with deep sequential neural networks to predict procurement lead time in shipbuilding, achieving 22.6-50.4% improvement over existing methods.
Details
Motivation: Procurement lead time prediction is challenging in engineered-to-order industries like shipbuilding, where delays in critical components like pipe spools can disrupt entire project timelines. Traditional approaches focus only on static physical attributes, ignoring the dynamic, multi-stakeholder business process involving continuous sequences of internal and external events.Method: Proposes a framework combining event logs (procurement event records) with static attributes. Temporal attributes of each event are extracted to capture continuity and temporal context. Uses deep sequential neural network combined with multi-layered perceptron to integrate static and dynamic features, capturing both structural and contextual information.
Result: Experimental evaluation using real-world pipe spool procurement data from a major South Korean shipbuilding corporation. Three prediction tasks evaluated: production, post-processing, and procurement lead time prediction. Achieved 22.6% to 50.4% improvement in mean absolute error over best-performing existing approaches across all three tasks.
Conclusion: The results demonstrate the value of incorporating procurement process information (event logs) alongside static attributes for more accurate procurement lead time prediction. The proposed framework effectively captures the dynamic, multi-stakeholder nature of procurement processes in engineered-to-order industries.
Abstract: Accurately predicting procurement lead time (PLT) remains a challenge in engineered-to-order industries such as shipbuilding and plant construction, where delays in a single key component can disrupt project timelines. In shipyards, pipe spools are critical components; installed deep within hull blocks soon after steel erection, any delay in their procurement can halt all downstream tasks. Recognizing their importance, existing studies predict PLT using the static physical attributes of pipe spools. However, procurement is inherently a dynamic, multi-stakeholder business process involving a continuous sequence of internal and external events at the shipyard, factors often overlooked in traditional approaches. To address this issue, this paper proposes a novel framework that combines event logs, dataset records of the procurement events, with static attributes to predict PLT. The temporal attributes of each event are extracted to reflect the continuity and temporal context of the process. Subsequently, a deep sequential neural network combined with a multi-layered perceptron is employed to integrate these static and dynamic features, enabling the model to capture both structural and contextual information in procurement. Comparative experiments are conducted using real-world pipe spool procurement data from a globally renowned South Korean shipbuilding corporation. Three tasks are evaluated, which are production, post-processing, and procurement lead time prediction. The results show a 22.6% to 50.4% improvement in prediction performance in terms of mean absolute error over the best-performing existing approaches across the three tasks. These findings indicate the value of considering procurement process information for more accurate PLT prediction.
[375] Queue Length Regret Bounds for Contextual Queueing Bandits
Seoungbin Bae, Garyeong Kang, Dabeen Lee
Main category: cs.LG
TL;DR: Contextual queueing bandits framework for scheduling with unknown service rates, using job features to match jobs with servers, achieving sublinear queue length regret.
Details
Motivation: Need to schedule jobs with heterogeneous contextual features while simultaneously learning unknown server-specific service rates, where service rates depend on job features via logistic models with unknown parameters.Method: Introduces contextual queueing bandits framework with two algorithms: CQB-ε for stochastic contexts (with ε-greedy exploration) and CQB-Opt for adversarial contexts. Uses policy-switching queues with coupling arguments and novel regret decomposition to handle queue state differences.
Result: CQB-ε achieves Õ(T^{-1/4}) regret for stochastic contexts, CQB-Opt achieves O(log² T) regret for adversarial contexts. Experimental results validate theoretical findings.
Conclusion: The framework successfully addresses the challenge of scheduling while learning unknown service rates in queueing systems with contextual features, providing both theoretical guarantees and empirical validation.
Abstract: We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.
[376] LightSBB-M: Bridging Schrödinger and Bass for Generative Diffusion Modeling
Alexandre Alouadi, Pierre Henry-Labordère, Grégoire Loeper, Othmane Mazhar, Huyên Pham, Nizar Touzi
Main category: cs.LG
TL;DR: LightSBB-M is an efficient algorithm that computes optimal Schrodinger Bridge and Bass (SBB) transport plans in few iterations, outperforming state-of-the-art SB and diffusion baselines by up to 32% on Wasserstein distance.
Details
Motivation: The Schrodinger Bridge and Bass (SBB) formulation extends classical Schrodinger Bridge by jointly controlling drift and volatility, but efficient computation of optimal SBB transport plans is needed.Method: LightSBB-M exploits dual representation of SBB objective to obtain analytic expressions for optimal drift and volatility, with tunable parameter beta interpolating between pure drift (SB) and pure volatility (Bass martingale transport).
Result: Achieves lowest 2-Wasserstein distance on synthetic datasets with up to 32% improvement over SB and diffusion baselines, and demonstrates generative capability on unpaired image-to-image translation (adult to child faces in FFHQ).
Conclusion: LightSBB-M provides scalable, high-fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real-world generative tasks.
Abstract: The Schrodinger Bridge and Bass (SBB) formulation, which jointly controls drift and volatility, is an established extension of the classical Schrodinger Bridge (SB). Building on this framework, we introduce LightSBB-M, an algorithm that computes the optimal SBB transport plan in only a few iterations. The method exploits a dual representation of the SBB objective to obtain analytic expressions for the optimal drift and volatility, and it incorporates a tunable parameter beta greater than zero that interpolates between pure drift (the Schrodinger Bridge) and pure volatility (Bass martingale transport). We show that LightSBB-M achieves the lowest 2-Wasserstein distance on synthetic datasets against state-of-the-art SB and diffusion baselines with up to 32 percent improvement. We also illustrate the generative capability of the framework on an unpaired image-to-image translation task (adult to child faces in FFHQ). These findings demonstrate that LightSBB-M provides a scalable, high-fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real-world generative tasks. The code is available at https://github.com/alexouadi/LightSBB-M.
[377] Generalizable IoT Traffic Representations for Cross-Network Device Identification
Arunan Sivanathan, David Warren, Deepak Mishra, Sushmita Ruj, Natasha Fernandes, Quan Z. Sheng, Minh Tran, Ben Luo, Daniel Coscia, Gustavo Batista, Hassan Habibi Gharakaheili
Main category: cs.LG
TL;DR: The paper proposes unsupervised encoder-decoder models to learn generalizable traffic representations for IoT device identification from unlabeled network flows, achieving high device-type classification performance with simple classifiers on frozen embeddings.
Details
Motivation: Existing IoT device identification approaches rely on supervised pipelines or task-specific fine-tuning, resulting in traffic representations that are tightly coupled to labeled datasets and deployment environments, limiting generalizability across different settings.Method: Develop compact encoder architectures that learn per-flow embeddings from unlabeled IoT traffic using unsupervised encoder-decoder models, then evaluate them using a frozen-encoder protocol with simple supervised classifiers on disjoint labeled subsets.
Result: Achieved macro F1-scores exceeding 0.9 for device-type classification using more than 18 million real IoT traffic flows collected across multiple years and deployment environments, demonstrating robustness under cross-environment deployment.
Conclusion: Unsupervised learning of traffic representations enables generalizable IoT device identification, with compact models performing comparably to larger ones, showing that larger models don’t necessarily yield more robust representations for IoT traffic.
Abstract: Machine learning models have demonstrated strong performance in classifying network traffic and identifying Internet-of-Things (IoT) devices, enabling operators to discover and manage IoT assets at scale. However, many existing approaches rely on end-to-end supervised pipelines or task-specific fine-tuning, resulting in traffic representations that are tightly coupled to labeled datasets and deployment environments, which can limit generalizability. In this paper, we study the problem of learning generalizable traffic representations for IoT device identification. We design compact encoder architectures that learn per-flow embeddings from unlabeled IoT traffic and evaluate them using a frozen-encoder protocol with a simple supervised classifier. Our specific contributions are threefold. (1) We develop unsupervised encoder–decoder models that learn compact traffic representations from unlabeled IoT network flows and assess their quality through reconstruction-based analysis. (2) We show that these learned representations can be used effectively for IoT device-type classification using simple, lightweight classifiers trained on frozen embeddings. (3) We provide a systematic benchmarking study against the state-of-the-art pretrained traffic encoders, showing that larger models do not necessarily yield more robust representations for IoT traffic. Using more than 18 million real IoT traffic flows collected across multiple years and deployment environments, we learn traffic representations from unlabeled data and evaluate device-type classification on disjoint labeled subsets, achieving macro F1-scores exceeding 0.9 for device-type classification and demonstrating robustness under cross-environment deployment.
[378] StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths
Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
Main category: cs.LG
TL;DR: StableQAT is a unified QAT framework that stabilizes ultra-low bitwidth training via a novel Fourier-based surrogate for backpropagation, generalizing STE with smooth, bounded gradients.
Details
Motivation: Existing QAT methods (STE-based or soft quantizers) suffer from gradient mismatch, instability, or high computational overhead, especially at ultra-low bitwidths (2-4 bits), making stable optimization challenging under memory/latency constraints.Method: Proposes StableQAT with a novel lightweight surrogate for backpropagation derived from discrete Fourier analysis of the rounding operator. This surrogate family strictly generalizes STE, providing smooth, bounded, inexpensive gradients that improve training stability.
Result: StableQAT demonstrates stable and efficient QAT at 2-4 bit regimes with improved training stability, robustness, and superior performance compared to standard QAT techniques, with negligible training overhead.
Conclusion: StableQAT provides a unified, efficient framework for stable quantization-aware training at ultra-low bitwidths, addressing key limitations of existing methods through a theoretically grounded Fourier-based surrogate approach.
Abstract: Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.
[379] Metric $k$-clustering using only Weak Comparison Oracles
Rahul Raychaudhury, Aryan Esmailpour, Sainyam Galhotra, Stavros Sintos
Main category: cs.LG
TL;DR: The paper presents clustering algorithms that use only relative distance comparisons (quadruplet oracle) instead of exact distances, achieving constant approximation with polylogarithmic query complexity.
Details
Motivation: Classical clustering algorithms require exact pairwise distances, which is unrealistic in modern applications where distances may be unavailable or expensive to compute. Instead, relative comparisons (like those from learned models or human feedback) are more practical but noisy.Method: Develop randomized algorithms using only a noisy quadruplet oracle that provides relative distance comparisons. The approach works for arbitrary metric spaces and improves for spaces with bounded doubling dimension.
Result: Achieves O(n·k·polylog(n)) query complexity for arbitrary metrics, improving to O((n+k²)·polylog(n)) for bounded doubling dimension. For bounded doubling metrics, achieves (1+ε)-approximation with same asymptotic complexity.
Conclusion: The framework demonstrates how noisy, low-cost oracles (like those from large language models) can be systematically integrated into scalable clustering algorithms, providing practical solutions when exact distances are unavailable.
Abstract: Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances – an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $1+\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.
[380] From Observations to Events: Event-Aware World Model for Reinforcement Learning
Zhao-Han Peng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He
Main category: cs.LG
TL;DR: EAWM introduces an event-aware world model for MBRL that learns to segment observations into discrete events, improving generalization across structurally similar scenes and robustness to spurious variations.
Details
Motivation: Existing MBRL methods struggle with generalization across structurally similar scenes and are vulnerable to spurious variations like textures or color shifts. Humans segment continuous sensory streams into discrete events for decision-making, inspiring an event-based approach.Method: Proposes Event-Aware World Model (EAWM) with automated event generator and Generic Event Segmentor (GES) to identify event boundaries. Learns event-aware representations through event prediction and provides unified formulation of world model architectures.
Result: EAWM boosts performance of strong MBRL baselines by 10%-45% on Atari 100K, Craftax 1M, DeepMind Control 500K, and DMC-GB2 500K benchmarks, setting new state-of-the-art results.
Conclusion: Event-aware representations inspired by human cognitive processes significantly improve MBRL sample efficiency and generalization, with EAWM demonstrating broad applicability across diverse benchmarks.
Abstract: While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision-making. Motivated by this principle, we propose the Event-Aware World Model (EAWM), a general framework that learns event-aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio-temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC-GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%-45%, setting new state-of-the-art results across benchmarks. Our code is released at https://github.com/MarquisDarwin/EAWM.
[381] Robust Uncertainty Estimation under Distribution Shift via Difference Reconstruction
Xinran Xu, Li Rong Wang, Xiuyi Fan
Main category: cs.LG
TL;DR: DRUE is a new uncertainty estimation method that reconstructs inputs from two intermediate layers and measures their discrepancy as uncertainty score, outperforming previous reconstruction-based approaches in OOD detection for medical imaging.
Details
Motivation: Existing uncertainty estimation methods that compare input samples with their reconstructions suffer from information loss and sensitivity to superficial details, limiting their effectiveness for reliable decision-making in high-stakes medical applications like glaucoma detection.Method: Difference Reconstruction Uncertainty Estimation (DRUE) reconstructs inputs from two intermediate layers of a model and measures the discrepancy between their outputs as the uncertainty score, avoiding direct comparison with original input.
Result: DRUE consistently achieves superior AUC and AUPR across multiple OOD datasets in glaucoma detection tasks, demonstrating robustness and reliability under distribution shift.
Conclusion: DRUE provides a principled and effective framework for enhancing model reliability in uncertain environments, particularly for medical imaging applications where accurate uncertainty estimation is critical.
Abstract: Estimating uncertainty in deep learning models is critical for reliable decision-making in high-stakes applications such as medical imaging. Prior research has established that the difference between an input sample and its reconstructed version produced by an auxiliary model can serve as a useful proxy for uncertainty. However, directly comparing reconstructions with the original input is degraded by information loss and sensitivity to superficial details, which limits its effectiveness. In this work, we propose Difference Reconstruction Uncertainty Estimation (DRUE), a method that mitigates this limitation by reconstructing inputs from two intermediate layers and measuring the discrepancy between their outputs as the uncertainty score. To evaluate uncertainty estimation in practice, we follow the widely used out-of-distribution (OOD) detection paradigm, where in-distribution (ID) training data are compared against datasets with increasing domain shift. Using glaucoma detection as the ID task, we demonstrate that DRUE consistently achieves superior AUC and AUPR across multiple OOD datasets, highlighting its robustness and reliability under distribution shift. This work provides a principled and effective framework for enhancing model reliability in uncertain environments.
[382] GraphSB: Boosting Imbalanced Node Classification on Graphs through Structural Balance
Zhixiao Wang, Chaofan Zhu, Qihan Feng, Jian Zhang, Xiaobin Rui, Philip S Yu
Main category: cs.LG
TL;DR: GraphSB addresses imbalanced node classification by optimizing graph structure through Structural Balance before node synthesis, outperforming existing methods and serving as a plug-and-play module.
Details
Motivation: Existing imbalanced node classification methods focus on data-level (synthesizing minority nodes) or algorithm-level (optimizing learning process) approaches, but neither addresses the inherently imbalanced graph structure that causes majority-class dominance and minority-class assimilation in GNNs.Method: Proposes GraphSB framework with Structural Balance strategy: two-stage structure optimization including 1) Structure Enhancement (mining hard samples via dual-view analysis and enhancing minority connectivity via adaptive augmentation), and 2) Relation Diffusion (propagating enhanced minority context while capturing higher-order structural dependencies).
Result: GraphSB significantly outperforms state-of-the-art methods. Structural Balance can be integrated as a plug-and-play module into existing methods, increasing their accuracy by average 4.57%.
Conclusion: Addressing imbalanced graph structure through Structural Balance before node synthesis is crucial for effective imbalanced node classification in GNNs, providing both a standalone solution and an enhancement module for existing methods.
Abstract: Imbalanced node classification is a critical challenge in graph learning, where most existing methods typically utilize Graph Neural Networks (GNNs) to learn node representations. These methods can be broadly categorized into the data-level and the algorithm-level. The former aims to synthesize minority-class nodes to mitigate quantity imbalance, while the latter tries to optimize the learning process to highlight minority classes. However, neither of them addresses the inherently imbalanced graph structure, which is a fundamental factor that incurs majority-class dominance and minority-class assimilation in GNNs. Our theoretical analysis further supports this critical insight. Therefore, we propose GraphSB (Graph Structural Balance), a novel framework that incorporates Structural Balance as a key strategy to address the underlying imbalanced graph structure before node synthesis. Structural Balance performs a two-stage structure optimization: Structure Enhancement that mines hard samples near decision boundaries through dual-view analysis and enhances connectivity for minority classes through adaptive augmentation, and Relation Diffusion that propagates the enhanced minority context while simultaneously capturing higher-order structural dependencies. Thus, GraphSB balances structural distribution before node synthesis, enabling more effective learning in GNNs. Extensive experiments demonstrate that GraphSB significantly outperforms the state-of-the-art methods. More importantly, the proposed Structural Balance can be seamlessly integrated into state-of-the-art methods as a simple plug-and-play module, increasing their accuracy by an average of 4.57%.
[383] Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
Quy-Anh Dang, Chris Ngo
Main category: cs.LG
TL;DR: Selective Steering is a new activation steering method that uses norm-preserving rotation and discriminative layer selection to achieve better control over LLM behavior while maintaining model capabilities.
Details
Motivation: Existing activation steering methods for LLM alignment have limitations: activation addition requires careful tuning and is sensitive to norm variations, directional ablation provides only binary control, and Angular Steering violates norm preservation causing distribution shift and generation collapse.Method: Two key innovations: (1) mathematically rigorous norm-preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite-signed class alignment.
Result: Experiments across nine models show Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks.
Conclusion: Selective Steering provides a principled, efficient framework for controllable and stable LLM behavior modification, addressing limitations of existing activation steering techniques.
Abstract: Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer-specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm-preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite-signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification. Code: https://github.com/knoveleng/steering
[384] DSP-Reg: Domain-Sensitive Parameter Regularization for Robust Domain Generalization
Xudong Han, Senkang Hu, Yihang Tao, Yu Guo, Philip Birch, Sam Tak Wu Kwong, Yuguang Fang
Main category: cs.LG
TL;DR: DSP-Reg is a domain generalization framework that identifies domain-sensitive parameters via covariance analysis and regularizes them to improve generalization to unseen domains.
Details
Motivation: Existing domain generalization methods focus on domain-invariant features but neglect parameter-level analysis, making models unable to differentiate between domain-sensitive and domain-invariant parameters, which limits generalization ability.Method: Proposes a covariance-based parameter sensitivity analysis framework to quantify parameter sensitivity to domain shifts, then introduces Domain-Sensitive Parameter Regularization (DSP-Reg) that uses soft regularization to encourage reliance on domain-invariant parameters while suppressing domain-specific ones.
Result: Outperforms state-of-the-art approaches on benchmarks (PACS, VLCS, OfficeHome, DomainNet) with average accuracy of 66.7%, surpassing all baselines.
Conclusion: Parameter-level analysis and regularization provide more granular control over model learning, leading to improved robustness and generalization to unseen domains compared to feature-level approaches.
Abstract: Domain Generalization (DG) is a critical area that focuses on developing models capable of performing well on data from unseen distributions, which is essential for real-world applications. Existing approaches primarily concentrate on learning domain-invariant features, which assume that a model robust to variations in the source domains will generalize well to unseen target domains. However, these approaches neglect a deeper analysis at the parameter level, which makes the model hard to explicitly differentiate between parameters sensitive to domain shifts and those robust, potentially hindering its overall ability to generalize. In order to address these limitations, we first build a covariance-based parameter sensitivity analysis framework to quantify the sensitivity of each parameter in a model to domain shifts. By computing the covariance of parameter gradients across multiple source domains, we can identify parameters that are more susceptible to domain variations, which serves as our theoretical foundation. Based on this, we propose Domain-Sensitive Parameter Regularization (DSP-Reg), a principled framework that guides model optimization by a soft regularization technique that encourages the model to rely more on domain-invariant parameters while suppressing those that are domain-specific. This approach provides a more granular control over the model’s learning process, leading to improved robustness and generalization to unseen domains. Extensive experiments on benchmarks, such as PACS, VLCS, OfficeHome, and DomainNet, demonstrate that DSP-Reg outperforms state-of-the-art approaches, achieving an average accuracy of 66.7% and surpassing all baselines.
[385] SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems
Saeed Nasehi Basharzad, Farhana Choudhury, Egemen Tanin
Main category: cs.LG
TL;DR: SEAFormer is a novel transformer model that efficiently solves large-scale Real-World Vehicle Routing Problems by incorporating both node-level and edge-level information through clustered proximity attention and edge-aware modules.
Details
Motivation: Real-world VRPs have complex sequence-dependent constraints that existing neural methods struggle with because they overlook sequence dependencies and underutilize edge-level information, which are crucial for RWVRP complexity.Method: SEAFormer uses two key innovations: 1) Clustered Proximity Attention (CPA) that exploits locality-aware clustering to reduce attention complexity from O(n²) to O(n) while preserving global perspective, and 2) lightweight edge-aware module that captures pairwise features through residual fusion.
Result: SEAFormer achieves superior results over state-of-the-art methods across four RWVRP variants at various scales. It’s the first neural method to effectively solve 1,000+ node RWVRPs and also achieves superior performance on classic VRPs.
Conclusion: SEAFormer provides a versatile solution for both research benchmarks and real-world applications, effectively addressing the limitations of previous neural methods in handling complex, sequence-dependent RWVRPs at scale.
Abstract: Real-world Vehicle Routing Problems (RWVRPs) require solving complex, sequence-dependent challenges at scale with constraints such as delivery time window, replenishment or recharging stops, asymmetric travel cost, etc. While recent neural methods achieve strong results on large-scale classical VRP benchmarks, they struggle to address RWVRPs because their strategies overlook sequence dependencies and underutilize edge-level information, which are precisely the characteristics that define the complexity of RWVRPs. We present SEAFormer, a novel transformer that incorporates both node-level and edge-level information in decision-making through two key innovations. First, our Clustered Proximity Attention (CPA) exploits locality-aware clustering to reduce the complexity of attention from $O(n^2)$ to $O(n)$ while preserving global perspective, allowing SEAFormer to efficiently train on large instances. Second, our lightweight edge-aware module captures pairwise features through residual fusion, enabling effective incorporation of edge-based information and faster convergence. Extensive experiments across four RWVRP variants with various scales demonstrate that SEAFormer achieves superior results over state-of-the-art methods. Notably, SEAFormer is the first neural method to solve 1,000+ node RWVRPs effectively, while also achieving superior performance on classic VRPs, making it a versatile solution for both research benchmarks and real-world applications.
[386] OSIRIS: Bridging Analog Circuit Design and Machine Learning with Scalable Dataset Generation
Giuseppe Chiari, Michele Piccoli, Davide Zoni
Main category: cs.LG
TL;DR: OSIRIS is a scalable dataset generation pipeline for analog IC design that creates comprehensive circuit variations with performance metrics, addressing the lack of open, high-quality datasets for ML-based analog design automation.
Details
Motivation: Analog IC design automation faces challenges due to complex layout-performance interdependencies and parasitic effects. ML approaches are promising but limited by lack of open, high-quality datasets for benchmarking and generalizability.Method: OSIRIS pipeline systematically explores analog circuit design space, generating comprehensive performance metrics and metadata. The authors also release a dataset of 87,100 circuit variations and provide an RL-based baseline method for optimization.
Result: Created a scalable dataset generation pipeline (OSIRIS) and released a substantial dataset of 87,100 analog circuit variations with comprehensive metrics, enabling ML-driven EDA research.
Conclusion: OSIRIS addresses the critical data availability gap in analog IC design automation, providing a foundation for ML-based approaches and enabling benchmarking and generalizability in this challenging domain.
Abstract: The automation of analog integrated circuit (IC) design remains a longstanding challenge, primarily due to the intricate interdependencies among physical layout, parasitic effects, and circuit-level performance. These interactions impose complex constraints that are difficult to accurately capture and optimize using conventional design methodologies. Although recent advances in machine learning (ML) have shown promise in automating specific stages of the analog design flow, the development of holistic, end-to-end frameworks that integrate these stages and iteratively refine layouts using post-layout, parasitic-aware performance feedback is still in its early stages. Furthermore, progress in this direction is hindered by the limited availability of open, high-quality datasets tailored to the analog domain, restricting both the benchmarking and the generalizability of ML-based techniques. To address these limitations, we present OSIRIS, a scalable dataset generation pipeline for analog IC design. OSIRIS systematically explores the design space of analog circuits while producing comprehensive performance metrics and metadata, thereby enabling ML-driven research in electronic design automation (EDA). In addition, we release a dataset consisting of 87,100 circuit variations generated with OSIRIS, accompanied by a reinforcement learning (RL)-based baseline method that exploits OSIRIS for analog design optimization.
[387] From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Online Test-Time Backdoor Defense
Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang
Main category: cs.LG
TL;DR: PRISM is a new defense framework that uses universal vision-language models as external semantic auditors to detect backdoor attacks, achieving <1% attack success rate while maintaining clean accuracy.
Details
Motivation: Traditional test-time defenses are fragile against advanced backdoor attacks because they rely on internal diagnosis methods that remain entangled with the victim model's corrupted parameters. There's a need to decouple safety from the victim model through independent semantic auditing.Method: PRISM uses Universal Vision-Language Models as evolving semantic gatekeepers with two key mechanisms: 1) Hybrid VLM Teacher that dynamically refines visual prototypes online, and 2) Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time.
Result: Extensive evaluation across 17 datasets and 11 attack types shows PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to <1% on CIFAR-10 while improving clean accuracy.
Conclusion: PRISM establishes a new standard for model-agnostic, externalized security by shifting from internal diagnosis to external semantic auditing using VLMs as independent safety auditors.
Abstract: Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model’s corrupted parameters. We propose a paradigm shift from Internal Diagnosis to External Semantic Auditing, arguing that effective defense requires decoupling safety from the victim model via an independent, semantically grounded auditor. To this end, we present a framework harnessing Universal Vision-Language Models (VLMs) as evolving semantic gatekeepers. We introduce PRISM (Prototype Refinement & Inspection via Statistical Monitoring), which overcomes the domain gap of general VLMs through two key mechanisms: a Hybrid VLM Teacher that dynamically refines visual prototypes online, and an Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time. Extensive evaluation across 17 datasets and 11 attack types demonstrates that PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to <1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.
[388] APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition
Finn Rietz, Pedro Zuidberg dos Martires, Johannes Andreas Stork
Main category: cs.LG
TL;DR: APC is a hierarchical RL method that adaptively composes multiple NF priors from demonstration data, estimating their applicability to target tasks while refining useful priors or sidestepping misaligned ones to optimize reward.
Details
Motivation: Existing RL approaches that incorporate demonstration data often assume demonstrations are optimal and fully aligned with target tasks, but in practice demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when integrated into RL.Method: Adaptive Policy Composition (APC) - a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to priors, APC estimates each prior’s applicability to the target task while leveraging them for exploration. It refines useful priors or sidesteps misaligned ones when necessary.
Result: Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
Conclusion: APC provides an effective approach for incorporating demonstration data into RL that handles the practical challenges of sparse, suboptimal, or misaligned demonstrations by adaptively composing and refining priors based on their estimated applicability to target tasks.
Abstract: Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior’s applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
[389] Fixed Aggregation Features Can Rival GNNs
Celia Rubio-Madrigal, Rebekka Burkholz
Main category: cs.LG
TL;DR: Training-free fixed aggregation features (FAFs) transform graph learning into tabular problems, matching or beating state-of-the-art GNNs on most benchmarks using simple mean aggregation.
Details
Motivation: To challenge the prevailing belief that trainable neighborhood aggregations are essential for graph neural networks' success, and to explore whether simpler, training-free approaches can achieve comparable performance.Method: Introduces Fixed Aggregation Features (FAFs) - a training-free approach that transforms graph learning tasks into tabular problems by using fixed aggregation functions (often just mean aggregation) to create node features, then applies standard tabular classifiers like multilayer perceptrons.
Result: Across 14 benchmarks, well-tuned MLPs trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks, with only Roman Empire and Minesweeper datasets requiring deeper GNNs. Simple mean aggregation often suffices.
Conclusion: The results call for: 1) richer benchmarks that truly benefit from learning diverse neighborhood aggregations, 2) strong tabular baselines as standard in graph learning research, and 3) employing tabular models for graph data to gain new insights into related tasks.
Abstract: Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a training-free approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks – often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.
[390] Time-to-Injury Forecasting in Elite Female Football: A DeepHit Survival Approach
Victoria Catterall, Cise Midoglu, Stephen Lynch
Main category: cs.LG
TL;DR: DeepHit neural network outperforms traditional ML models in predicting time-to-injury from longitudinal athlete monitoring data, providing interpretable, time-varying risk estimates for football injury prevention.
Details
Motivation: Existing injury prediction approaches in football rely on static pre-season data and binary outcomes, limiting real-world utility. There's a need for more dynamic, time-sensitive, and interpretable injury forecasting methods.Method: Used DeepHit neural network with multilayer perceptron backbone on SoccerMon dataset (two seasons of elite female footballers’ training, match, and wellness data). Applied data cleaning, feature engineering, and three imputation strategies. Compared against optimized Random Forest, XGBoost, and Logistic Regression baselines using chronological and leave-one-player-out validation.
Result: DeepHit achieved concordance index of 0.762, outperforming baseline models. Provided individualized, time-varying risk estimates. SHAP analysis identified clinically relevant predictors consistent with established risk factors, enhancing interpretability.
Conclusion: Survival modelling with DeepHit shows strong potential for advancing injury forecasting in football, offering accurate, explainable, and actionable insights for injury prevention across competitive levels.
Abstract: Injury occurrence in football poses significant challenges for athletes and teams, carrying personal, competitive, and financial consequences. While machine learning has been applied to injury prediction before, existing approaches often rely on static pre-season data and binary outcomes, limiting their real-world utility. This study investigates the feasibility of using a DeepHit neural network to forecast time-to-injury from longitudinal athlete monitoring data, while providing interpretable predictions. The analysis utilised the publicly available SoccerMon dataset, containing two seasons of training, match, and wellness records from elite female footballers. Data was pre-processed through cleaning, feature engineering, and the application of three imputation strategies. Baseline models (Random Forest, XGBoost, Logistic Regression) were optimised via grid search for benchmarking, while the DeepHit model, implemented with a multilayer perceptron backbone, was evaluated using chronological and leave-one-player-out (LOPO) validation. DeepHit achieved a concordance index of 0.762, outperforming baseline models and delivering individualised, time-varying risk estimates. Shapley Additive Explanations (SHAP) identified clinically relevant predictors consistent with established risk factors, enhancing interpretability. Overall, this study provides a novel proof of concept: survival modelling with DeepHit shows strong potential to advance injury forecasting in football, offering accurate, explainable, and actionable insights for injury prevention across competitive levels.
[391] LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang
Main category: cs.LG
TL;DR: LLM-VA aligns answer and safety vectors in LLMs to reduce both jailbreak and over-refusal simultaneously, achieving better F1 scores than existing methods.
Details
Motivation: Current safety-aligned LLMs have two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods create a trade-off between these issues - reducing one increases the other.Method: LLM-VA aligns the answer vector (v_a) with the benign vector (v_b) through closed-form weight updates. It identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications without fine-tuning or architectural changes.
Result: Experiments on 12 LLMs show LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model’s safety bias without manual tuning.
Conclusion: By making the model’s willingness to answer causally dependent on its safety assessment through vector alignment, LLM-VA effectively addresses both jailbreak and over-refusal problems simultaneously.
Abstract: Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off – reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector $v_a$) and the judgment of input safety (benign vector $v_b$) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns $v_a$ with $v_b$ through closed-form weight updates, making the model’s willingness to answer causally dependent on its safety assessment – without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications. Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model’s safety bias without manual tuning. Code and models are available at https://hotbento.github.io/LLM-VA-Web/.
[392] GenCP: Towards Generative Modeling Paradigm of Coupled Physics
Tianrun Gao, Haoren Zheng, Wenhao Deng, Haodong Feng, Tao Zhang, Ruiqi Feng, Qianyi Chen, Tailin Wu
Main category: cs.LG
TL;DR: GenCP is a novel generative paradigm for coupled multiphysics simulation that integrates probability density evolution with iterative multiphysics coupling, enabling training on decoupled data and inferring coupled physics during sampling with error controllability guarantees.
Details
Motivation: Real-world physical systems involve complex coupling of multiple physics, making simulation valuable but challenging. Mainstream approaches struggle with decoupled data and have low efficiency and fidelity in strongly coupled spatio-temporal systems.Method: Formulates coupled-physics modeling as probability modeling problem, integrates probability density evolution in generative modeling with iterative multiphysics coupling, uses operator-splitting theory in probability evolution space to establish error controllability for “conditional-to-joint” sampling scheme.
Result: Evaluated on synthetic setting and three challenging multi-physics scenarios, demonstrating both principled insight and superior application performance compared to existing approaches.
Conclusion: GenCP provides an elegant generative paradigm for coupled multiphysics simulation that addresses limitations of existing methods by enabling training on decoupled data while inferring coupled physics during sampling with theoretical guarantees.
Abstract: Real-world physical systems are inherently complex, often involving the coupling of multiple physics, making their simulation both highly valuable and challenging. Many mainstream approaches face challenges when dealing with decoupled data. Besides, they also suffer from low efficiency and fidelity in strongly coupled spatio-temporal physical systems. Here we propose GenCP, a novel and elegant generative paradigm for coupled multiphysics simulation. By formulating coupled-physics modeling as a probability modeling problem, our key innovation is to integrate probability density evolution in generative modeling with iterative multiphysics coupling, thereby enabling training on data from decoupled simulation and inferring coupled physics during sampling. We also utilize operator-splitting theory in the space of probability evolution to establish error controllability guarantees for this “conditional-to-joint” sampling scheme. We evaluate our paradigm on a synthetic setting and three challenging multi-physics scenarios to demonstrate both principled insight and superior application performance of GenCP. Code is available at this repo: github.com/AI4Science-WestlakeU/GenCP.
[393] Scale-Consistent State-Space Dynamics via Fractal of Stationary Transformations
Geunhyeok Yu, Hyoseok Hwang
Main category: cs.LG
TL;DR: FROST introduces fractal inductive bias for scale-consistent latent dynamics in state-space models, enabling valid intermediate representations and natural early stopping based on intrinsic feature quality.
Details
Motivation: Deep learning models rely on depth without structural guarantees on intermediate representation validity, making early stopping and adaptive computation ill-posed. Current approaches lack geometric foundations for consistent latent dynamics across iterative refinement.Method: Formulates structural requirement for scale-consistent latent dynamics in state-space models, derives Fractal of Stationary Transformations (FROST) with fractal inductive bias enforcing self-similar representation manifold. Provides geometric analysis establishing contraction and stable convergence across iterations.
Result: Controlled experiments on ImageNet-100 empirically verify predicted scale-consistent behavior, showing adaptive efficiency emerges from aligned latent geometry. Intermediate states correspond to different resolutions of shared representation.
Conclusion: FROST’s scale-consistent structure enables natural halting with ranking-based formulation driven by intrinsic feature quality rather than extrinsic objectives, addressing limitations of depth-based models without structural guarantees.
Abstract: Recent deep learning models increasingly rely on depth without structural guarantees on the validity of intermediate representations, rendering early stopping and adaptive computation ill-posed. We address this limitation by formulating a structural requirement for state-space model’s scale-consistent latent dynamics across iterative refinement, and derive Fractal of Stationary Transformations (FROST), which enforces a self-similar representation manifold through a fractal inductive bias. Under this geometry, intermediate states correspond to different resolutions of a shared representation, and we provide a geometric analysis establishing contraction and stable convergence across iterations. As a consequence of this scale-consistent structure, halting naturally admits a ranking-based formulation driven by intrinsic feature quality rather than extrinsic objectives. Controlled experiments on ImageNet-100 empirically verify the predicted scale-consistent behavior, showing that adaptive efficiency emerges from the aligned latent geometry.
[394] AROMMA: Unifying Olfactory Embeddings for Single Molecules and Mixtures
Dayoung Kang, JongWon Kim, Jiho Park, Keonseock Lee, Ji-Woong Choi, Jinhyun So
Main category: cs.LG
TL;DR: AROMMA learns unified embeddings for single molecules and mixtures using chemical foundation models and attention-based aggregation, with knowledge distillation for missing annotations.
Details
Motivation: Existing olfaction datasets are small and fragmented, with separate representations for single molecules and mixtures that are not aligned, limiting generalizable odor learning.Method: Uses chemical foundation models to encode molecules, attention-based aggregator for mixtures (permutation invariant, asymmetric interactions), and aligns odor descriptors via knowledge distillation with class-aware pseudo-labeling.
Result: Achieves state-of-the-art performance in both single-molecule and molecule-pair datasets with up to 19.1% AUROC improvement, demonstrating robust generalization across domains.
Conclusion: AROMMA provides a unified framework for odor representation learning that effectively handles both single molecules and mixtures through aligned embeddings and enriched annotations.
Abstract: Public olfaction datasets are small and fragmented across single molecules and mixtures, limiting learning of generalizable odor representations. Recent works either learn single-molecule embeddings or address mixtures via similarity or pairwise label prediction, leaving representations separate and unaligned. In this work, we propose AROMMA, a framework that learns a unified embedding space for single molecules and two-molecule mixtures. Each molecule is encoded by a chemical foundation model and the mixtures are composed by an attention-based aggregator, ensuring both permutation invariance and asymmetric molecular interactions. We further align odor descriptor sets using knowledge distillation and class-aware pseudo-labeling to enrich missing mixture annotations. AROMMA achieves state-of-the-art performance in both single-molecule and molecule-pair datasets, with up to 19.1% AUROC improvement, demonstrating a robust generalization in two domains.
[395] From Atoms to Chains: Divergence-Guided Reasoning Curriculum for Unlabeled LLM Domain Adaptation
Yongqi Wang, Xiaofeng Ji, Jie Wang, Qingbin Li, Xiao Xiong, Zheming Yang, Jian Xu, Minghui Qiu, Xinxiao Wu
Main category: cs.LG
TL;DR: DGRC is a novel knowledge distillation method that creates a curriculum from atomic knowledge to reasoning chains by analyzing disagreements between teacher and student models, addressing the challenge of adapting LLMs to specialized domains without human-annotated data.
Details
Motivation: Adapting LLMs to specialized domains without human-annotated data is challenging. Traditional knowledge distillation often leads to coarse mimicry where students inefficiently target their own weaknesses and risk inheriting teacher's reasoning flaws. There's a need for reliable curriculum design when teachers themselves are not infallible experts.Method: Divergence-Guided Reasoning Curriculum (DGRC) constructs learning paths from atomic knowledge to reasoning chains by analyzing disagreements between teacher and student reasoning pathways. When conflicts occur, DGRC directs the teacher to perform diagnostic analysis: formulate atomic queries targeting specific divergence points, self-answer these queries to create high-confidence atomic QA pairs. These serve dual purposes: (1) atomic curriculum to rectify student’s knowledge gaps, (2) factual criteria to filter teacher’s original reasoning chains, yielding verified CoT curriculum.
Result: Experiments across medical and legal domains on student models of various sizes demonstrate effectiveness. Notably achieves 7.76% relative improvement for 1.5B student model in medical domain over strong unlabeled baseline.
Conclusion: DGRC effectively addresses the challenge of adapting LLMs to specialized domains without human supervision by leveraging the insight that while LLMs may fail at complex reasoning, they exhibit high fidelity on atomic sub-problems, enabling creation of reliable curricula from teacher-student disagreements.
Abstract: Adapting Large Language Models (LLMs) to specialized domains without human-annotated data is a crucial yet formidable challenge. Widely adopted knowledge distillation methods often devolve into coarse-grained mimicry, where the student model inefficiently targets its own weaknesses and risks inheriting the teacher’s reasoning flaws. This exposes a critical pedagogical dilemma: how to devise a reliable curriculum when the teacher itself is not an infallible expert. Our work resolves this by capitalizing on a key insight: while LLMs may exhibit fallibility in complex, holistic reasoning, they often exhibit high fidelity on focused, atomic sub-problems. Based on this, we propose Divergence-Guided Reasoning Curriculum (DGRC), which constructs a learning path from atomic knowledge to reasoning chains by dynamically deriving two complementary curricula from disagreements in reasoning pathways. When a student and teacher produce conflicting results, DGRC directs the teacher to perform a diagnostic analysis: it analyzes both reasoning paths to formulate atomic queries that target the specific points of divergence, and then self-answers these queries to create high-confidence atomic question-answer pairs. These pairs then serve a dual purpose: (1) providing an atomic curriculum to rectify the student’s knowledge gaps, and (2) serving as factual criteria to filter the teacher’s original reasoning chains, yielding a verified CoT curriculum that teaches the student how to integrate atomic knowledge into complete reasoning paths. Experiments across the medical and legal domains on student models of various sizes demonstrate the effectiveness of our DGRC framework. Notably, our method achieves a 7.76% relative improvement for the 1.5B student model in the medical domain over strong unlabeled baseline.
[396] Intersectional Fairness via Mixed-Integer Optimization
Jiří Němeček, Mark Kozdoba, Illia Kryvoviaz, Tomáš Pevný, Jakub Mareček
Main category: cs.LG
TL;DR: Proposes MIO framework for training intersectionally fair and interpretable classifiers to meet regulatory requirements in high-risk AI domains.
Details
Motivation: AI deployment in high-risk domains (finance, healthcare) requires fair and transparent models. Regulatory frameworks like EU's AI Act mandate bias mitigation but are vague about bias definitions. True fairness requires addressing bias at intersections of protected groups.Method: Unified framework using Mixed-Integer Optimization (MIO) to train intersectionally fair and intrinsically interpretable classifiers. Proves equivalence of two intersectional fairness measures (MSD and SPSF) in detecting most unfair subgroup.
Result: MIO-based algorithm improves performance in finding bias. Trains high-performing, interpretable classifiers that bound intersectional bias below acceptable threshold.
Conclusion: Offers robust solution for regulated industries and beyond by providing intersectionally fair and interpretable classifiers that meet regulatory requirements.
Abstract: The deployment of Artificial Intelligence in high-risk domains, such as finance and healthcare, necessitates models that are both fair and transparent. While regulatory frameworks, including the EU’s AI Act, mandate bias mitigation, they are deliberately vague about the definition of bias. In line with existing research, we argue that true fairness requires addressing bias at the intersections of protected groups. We propose a unified framework that leverages Mixed-Integer Optimization (MIO) to train intersectionally fair and intrinsically interpretable classifiers. We prove the equivalence of two measures of intersectional fairness (MSD and SPSF) in detecting the most unfair subgroup and empirically demonstrate that our MIO-based algorithm improves performance in finding bias. We train high-performing, interpretable classifiers that bound intersectional bias below an acceptable threshold, offering a robust solution for regulated industries and beyond.
[397] The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence
Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
Main category: cs.LG
TL;DR: The paper presents a measure-theoretic framework for understanding InfoNCE contrastive learning, revealing geometric bifurcations between unimodal and multimodal regimes and showing that population-level modality gaps are structural necessities rather than initialization artifacts.
Details
Motivation: Current understanding of InfoNCE contrastive learning is limited to the basic alignment-uniformity decomposition, leaving deeper geometric mechanisms under-characterized. The authors aim to provide a more fundamental geometric understanding of how contrastive learning shapes representation distributions.Method: Developed a measure-theoretic framework modeling learning as evolution of representation measures on fixed embedding manifolds. Established value and gradient consistency in large-batch limit to bridge stochastic objectives to explicit deterministic energy landscapes. Analyzed geometric bifurcations between unimodal and multimodal regimes.
Result: Uncovered fundamental geometric bifurcation: unimodal settings have strictly convex landscapes with unique Gibbs equilibrium (entropy acts as tie-breaker), while symmetric multimodal objectives contain persistent negative symmetric divergence that induces barrier-driven co-adaptation and enforces population-level modality gaps as structural necessities.
Conclusion: The framework shifts analytical focus from pointwise discrimination to population geometry, providing principled basis for diagnosing and controlling distributional misalignment in contrastive learning. Shows that modality gaps emerge from geometric structure rather than initialization artifacts.
Abstract: While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment–uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying “uniformity” as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.
[398] Safe Exploration via Policy Priors
Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause
Main category: cs.LG
TL;DR: SOOPER is a safe RL method that uses conservative policy priors and probabilistic dynamics models to explore optimistically while falling back to safe policies when needed, guaranteeing safety throughout learning.
Details
Motivation: Safe exploration is crucial for RL agents operating in real-world environments beyond simulations, requiring methods that can learn online while maintaining safety guarantees.Method: SOOPER uses suboptimal conservative policies (from offline data or simulators) as priors, combines probabilistic dynamics models for optimistic exploration, and implements pessimistic fallback to conservative policies when safety is at risk.
Result: The method guarantees safety throughout learning, establishes convergence to optimal policy via bounded cumulative regret, and outperforms state-of-the-art methods on safe RL benchmarks and real-world hardware.
Conclusion: SOOPER provides a scalable, theoretically-grounded approach to safe RL that balances exploration with safety guarantees, validated through extensive experiments and real-world deployment.
Abstract: Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
[399] R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning
Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, Peng Han
Main category: cs.LG
TL;DR: R^3 is a reinforcement learning method for large reasoning models that improves training stability and efficiency through cross-context replay, in-context self-reflection, and structural entropy ranking rewards.
Details
Motivation: Current group-based policy optimization methods for large reasoning models are fragile and inefficient because they rely on advantage gaps within batches, which can collapse under challenging tasks when intra-group advantages become similar.Method: R^3 introduces three components: (1) cross-context replay that maintains intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) in-context self-reflection that enables models to refine outputs using past failures, and (3) structural entropy ranking reward that assigns relative rewards to truncated/failed samples by ranking responses based on token-level entropy patterns.
Result: The method achieves state-of-the-art performance on several math benchmarks with significant improvements and fewer reasoning tokens compared to base models when implemented on Deepseek-R1-Distill-Qwen-1.5B and trained on DeepscaleR-40k.
Conclusion: R^3 effectively addresses the fragility and inefficiency of current group-based policy optimization methods for large reasoning models, providing a more stable and efficient training mechanism that improves performance on complex reasoning tasks.
Abstract: Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
[400] Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning
Tongxi Wang, Zhuoyang Xia, Xinran Chen, Shan Liu
Main category: cs.LG
TL;DR: AES (Adaptive Entropy Scheduling) dynamically adjusts entropy coefficients using online drift signals to handle environment drift in RL, reducing performance degradation and accelerating recovery.
Details
Motivation: Real-world RL faces environment drift, but existing methods use static entropy coefficients causing over-exploration during stable periods and under-exploration after drift, leading to slow recovery. There's also no principled understanding of how exploration intensity should scale with drift magnitude.Method: Proves entropy scheduling under non-stationarity reduces to a one-dimensional trade-off. Proposes AES which adaptively adjusts entropy coefficient/temperature online using observable drift proxies during training, requiring minimal structural changes and overhead.
Result: Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.
Conclusion: AES provides a principled approach to handle environment drift by adaptively scheduling exploration intensity based on measurable drift signals, improving RL robustness in non-stationary environments.
Abstract: Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We prove that entropy scheduling under non-stationarity can be reduced to a one-dimensional, round-by-round trade-off, faster tracking of the optimal solution after drift vs. avoiding gratuitous randomness when the environment is stable, so exploration strength can be driven by measurable online drift signals. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.
[401] Grasynda: Graph-based Synthetic Time Series Generation
Luis Amorim, Moises Santos, Paulo J. Azevedo, Carlos Soares, Vitor Cerqueira
Main category: cs.LG
TL;DR: Grasynda is a graph-based data augmentation method for time series forecasting that converts time series into network structures with transition probabilities, outperforming existing methods.
Details
Motivation: Deep learning for time series forecasting requires large datasets, but real-world data is often limited. Existing data augmentation methods have limitations in preserving data properties, creating a need for better synthetic generation approaches.Method: Converts univariate time series into graph networks where each state is a node and transitions are directed edges, encoding temporal dynamics in a transition probability matrix for synthetic generation.
Result: Grasynda consistently outperforms other time series data augmentation methods across three neural network variations on six benchmark datasets, including methods used in state-of-the-art time series foundation models.
Conclusion: Grasynda provides an effective graph-based approach for time series data augmentation that better preserves data properties and improves forecasting performance compared to existing methods.
Abstract: Data augmentation is a crucial tool in time series forecasting, especially for deep learning architectures that require a large training sample size to generalize effectively. However, extensive datasets are not always available in real-world scenarios. Although many data augmentation methods exist, their limitations include the use of transformations that do not adequately preserve data properties. This paper introduces Grasynda, a novel graph-based approach for synthetic time series generation that: (1) converts univariate time series into a network structure using a graph representation, where each state is a node and each transition is represented as a directed edge; and (2) encodes their temporal dynamics in a transition probability matrix. We performed an extensive evaluation of Grasynda as a data augmentation method for time series forecasting. We use three neural network variations on six benchmark datasets. The results indicate that Grasynda consistently outperforms other time series data augmentation methods, including ones used in state-of-the-art time series foundation models. The method and all experiments are publicly available.
[402] ProToken: Token-Level Attribution for Federated Large Language Models
Waris Gill, Ahmad Humayun, Ali Anwar, Muhammad Ali Gulzar
Main category: cs.LG
TL;DR: ProToken enables token-level client attribution in federated LLMs while maintaining privacy, achieving 98% accuracy across diverse models and domains.
Details
Motivation: Federated LLMs deployed in critical applications lack attribution mechanisms to identify which clients contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification.Method: ProToken uses two key insights: (1) transformer architectures concentrate task-specific signals in later blocks for strategic layer selection, and (2) gradient-based relevance weighting filters irrelevant neural activations to focus on neurons directly influencing token generation.
Result: ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s) across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding), maintaining high accuracy when scaling client numbers.
Conclusion: ProToken provides a practical provenance methodology for token-level client attribution in federated LLMs that addresses critical deployment needs while maintaining FL privacy constraints, validating its viability for real-world applications.
Abstract: Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
[403] Cross-Domain Offshore Wind Power Forecasting: Transfer Learning Through Meteorological Clusters
Dominic Weisser, Chloé Hashimoto-Cullen, Benjamin Guedj
Main category: cs.LG
TL;DR: Novel transfer learning framework for offshore wind power forecasting that clusters weather patterns and uses expert models to enable accurate forecasting with under 5 months of site-specific data, eliminating the need for a full year of local measurements.
Details
Motivation: New offshore wind farms lack sufficient site-specific data for traditional ML forecasting models, which require large volumes of local data. This creates a barrier for newly commissioned plants that need accurate power forecasts immediately for grid stability, reserve management, and energy trading.Method: Proposes a transfer learning framework that clusters power output based on meteorological features. Instead of a single general model, uses an ensemble of expert models, each trained on a specific weather pattern cluster. These pre-trained models specialize in distinct weather patterns and adapt efficiently to new sites.
Result: Achieved accurate cross-domain forecasting with under 5 months of site-specific data across 8 offshore wind farms, with MAE of 3.52%. Demonstrated that reliable forecasts don’t require a full annual cycle of local measurements.
Conclusion: The climate-aware transfer learning method successfully addresses data scarcity for new wind farms and opens opportunities for other offshore wind applications like early-stage wind resource assessment, accelerating project development while mitigating risks.
Abstract: Ambitious decarbonisation targets are catalysing growth in orders of new offshore wind farms. For these newly commissioned plants to run, accurate power forecasts are needed from the onset. These allow grid stability, good reserve management and efficient energy trading. Despite machine learning models having strong performances, they tend to require large volumes of site-specific data that new farms do not yet have. To overcome this data scarcity, we propose a novel transfer learning framework that clusters power output according to covariate meteorological features. Rather than training a single, general-purpose model, we thus forecast with an ensemble of expert models, each trained on a cluster. As these pre-trained models each specialise in a distinct weather pattern, they adapt efficiently to new sites and capture transferable, climate-dependent dynamics. Through the expert models’ built-in calibration to seasonal and meteorological variability, we remove the industry-standard requirement of local measurements over a year. Our contributions are two-fold - we propose this novel framework and comprehensively evaluate it on eight offshore wind farms, achieving accurate cross-domain forecasting with under five months of site-specific data. Our experiments achieve a MAE of 3.52%, providing empirical verification that reliable forecasts do not require a full annual cycle. Beyond power forecasting, this climate-aware transfer learning method opens new opportunities for offshore wind applications such as early-stage wind resource assessment, where reducing data requirements can significantly accelerate project development whilst effectively mitigating its inherent risks.
[404] LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation
Hongyaoxing Gu, Lijuan Hu, Liye Yu, Haowei Li, Fangfang Liu
Main category: cs.LG
TL;DR: LoPRo is a fine-tuning-free post-training quantization method that uses block-wise permutation and Walsh-Hadamard transformations to improve weight quantization accuracy, achieving state-of-the-art results at 2-3 bits with minimal overhead.
Details
Motivation: Current weight-only PTQ methods struggle with significant accuracy degradation in sub-3-bit regimes, often requiring fine-tuning to maintain competitive performance. There's a need for effective fine-tuning-free quantization methods that can handle low-bit quantization while preserving accuracy.Method: LoPRo enhances residual matrix quantization through block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while preserving quantization accuracy of the most salient column blocks. It also uses mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to minimize quantization costs.
Result: LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. It achieves state-of-the-art results on LLaMA-2/3 models with up to 4× speedup, and on Mixtral-8x7B completes quantization in 2.5 hours while reducing perplexity by 0.4 and improving accuracy by 8%.
Conclusion: LoPRo provides an effective fine-tuning-free PTQ solution that achieves superior accuracy with lower rank compared to other low-rank quantization methods, while maintaining high inference efficiency and minimal additional latency.
Abstract: Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4$\times$ speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4$\downarrow$ and improving accuracy by 8%$\uparrow$. Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.
[405] Out-of-Distribution Generalization via Invariant Trajectories for Multimodal Large Language Model Editing
Jiajie Su, Haoyuan Wang, Xiaohua Feng, Yunshan Ma, Xiaobo Xia, Yuyuan Li, Xiaolin Zheng, Jianmao Xiao, Chaochao Chen
Main category: cs.LG
TL;DR: ODEdit is a plug-and-play invariant learning framework for multimodal LLM knowledge editing that treats editing as an out-of-distribution generalization problem to handle diverse cross-modal prompts.
Details
Motivation: Existing unimodal LLM editing methods fail for multimodal LLMs because they rely on rigid parameter-to-output mappings, causing causal-underfit and causal-overfit in cascaded reasoning across modalities.Method: Reformulates MLLM editing as an OOD generalization problem, proposes ODEdit framework with tripartite OOD risk objective to enhance reliability, locality, and generality, plus edit trajectory invariant learning with total variation penalty to stabilize trajectories against environmental variations.
Result: Theoretical analysis and extensive experiments demonstrate the effectiveness of ODEdit for robust multimodal knowledge editing.
Conclusion: ODEdit successfully addresses the challenges of multimodal LLM editing by treating it as an OOD generalization problem and using invariant learning to achieve robust editing across diverse cross-modal prompts.
Abstract: Knowledge editing emerges as a crucial technique for efficiently correcting incorrect or outdated knowledge in large language models (LLM). Existing editing methods for unimodal LLM rely on a rigid parameter-to-output mapping, which causes causal-underfit and causal-overfit in cascaded reasoning for Multimodal LLM (MLLM). In this paper, we reformulate MLLM editing as an out-of-distribution (OOD) generalization problem, where the goal is to discern semantic shift with factual shift and thus achieve robust editing among diverse cross-modal prompting. The key challenge of this OOD problem lies in identifying invariant causal trajectories that generalize accurately while suppressing spurious correlations. To address it, we propose ODEdit, a plug-and-play invariant learning based framework that optimizes the tripartite OOD risk objective to simultaneously enhance editing reliability, locality, and generality.We further introduce an edit trajectory invariant learning method, which integrates a total variation penalty into the risk minimization objective to stabilize edit trajectories against environmental variations. Theoretical analysis and extensive experiments demonstrate the effectiveness of ODEdit.
[406] Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow
Yunyue Wei, Chenhui Zuo, Yanan Sui
Main category: cs.LG
TL;DR: Qflex is a reinforcement learning method that explores directly in high-dimensional action spaces using value-guided probability flows, outperforming baselines in continuous control and musculoskeletal control tasks.
Details
Motivation: High-dimensional systems in biology and robotics are challenging to control due to expansive state-action spaces. Existing exploration strategies degrade with action dimensionality, and dimensionality reduction methods limit policy expressiveness and system flexibility.Method: Q-guided Flow Exploration (Qflex) explores directly in native high-dimensional action space by traversing actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise.
Result: Qflex substantially outperforms representative online RL baselines across diverse high-dimensional continuous-control benchmarks. It successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings.
Conclusion: Value-guided flows offer a principled and practical route to exploration at scale in high-dimensional systems.
Abstract: Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.
[407] Rethinking Divisive Hierarchical Clustering from a Distributional Perspective
Kaifeng Zhang, Kai Ming Ting, Tianrun Liang, Qiuran Zhao
Main category: cs.LG
TL;DR: Current DHC methods fail to produce dendrograms with three desired properties due to set-oriented bisecting criteria. Using distributional kernels instead addresses this and maximizes total similarity of all clusters with theoretical guarantees.
Details
Motivation: Current divisive hierarchical clustering methods produce dendrograms lacking three key properties: no unwarranted splitting, grouping similar clusters together, and ground-truth correspondence. This limitation stems from using set-oriented bisecting assessment criteria.Method: Replace set-oriented bisecting criteria with distributional kernels. This approach achieves a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). The method provides theoretical guarantees for the resulting dendrogram.
Result: The proposed method successfully creates dendrograms consistent with biological regions in Spatial Transcriptomics datasets, where other methods fail. Theoretical analysis shows the dendrogram guarantees a lower bound of TSC.
Conclusion: Distributional kernels address fundamental shortcomings of set-oriented DHC methods, enabling dendrograms with desired properties and better performance on complex datasets like Spatial Transcriptomics.
Abstract: We uncover that current objective-based Divisive Hierarchical Clustering (DHC) methods produce a dendrogram that does not have three desired properties i.e., no unwarranted splitting, group similar clusters into a same subset, ground-truth correspondence. This shortcoming has their root cause in using a set-oriented bisecting assessment criterion. We show that this shortcoming can be addressed by using a distributional kernel, instead of the set-oriented criterion; and the resultant clusters achieve a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). Our theoretical analysis shows that the resultant dendrogram guarantees a lower bound of TSC. The empirical evaluation shows the effectiveness of our proposed method on artificial and Spatial Transcriptomics (bioinformatics) datasets. Our proposed method successfully creates a dendrogram that is consistent with the biological regions in a Spatial Transcriptomics dataset, whereas other contenders fail.
[408] Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals
Octavio Pappalardo
Main category: cs.LG
TL;DR: ULEE: Unsupervised meta-learning method combining in-context learning with adversarial goal generation for efficient exploration and adaptation in reinforcement learning.
Details
Motivation: Enable RL agents to learn from unsupervised pre-training by setting and pursuing their own goals, addressing challenges in generating, selecting, and learning from goals when downstream tasks are outside pre-training distribution or unknown.Method: ULEE combines in-context learner with adversarial goal-generation strategy that maintains training at frontier of agent’s capabilities, optimizing for efficient multi-episode exploration and adaptation within meta-learning framework.
Result: On XLand-MiniGrid benchmarks, ULEE yields improved exploration/adaptation abilities generalizing to novel objectives, dynamics, and map structures, with better zero-shot/few-shot performance and strong initialization for fine-tuning.
Conclusion: ULEE outperforms learning from scratch, DIAYN pre-training, and alternative curricula, demonstrating effective unsupervised meta-learning for RL agents.
Abstract: Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent’s post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent’s capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
[409] Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action
Gong Gao, Weidong Zhao, Xianhui Liu, Ning Jia
Main category: cs.LG
TL;DR: IRA algorithm improves online RL efficiency through Q-representation learning, greedy action guidance, and instant policy updates, addressing exploration and policy update delays.
Details
Motivation: Existing value-based online RL algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates, limiting learning efficiency.Method: Proposes IRA with three key components: 1) Q-Representation Discrepancy Evolution for discriminative representations, 2) Greedy Action Guidance via historical action backtracking for policy constraints, and 3) Instant Policy Update mechanism to increase policy update frequency.
Result: IRA significantly improves learning efficiency and final performance on eight MuJoCo continuous control tasks, with early-stage training conservatism helping alleviate overestimation bias.
Conclusion: IRA effectively addresses exploration and policy update challenges in online RL, providing a comprehensive solution through representation learning, policy constraints, and update frequency optimization.
Abstract: Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$-nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.
[410] Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
Hongxu Chen, Ke Wei, Xiaoming Yuan, Luo Luo
Main category: cs.LG
TL;DR: Develops generalization bounds for stochastic optimization under heavy-tailed gradient noise using algorithmic stability and truncation arguments.
Details
Motivation: Heavy-tailed gradient noise better characterizes ML training than bounded variance noise, but generalization analysis under such noise remains limited despite existing convergence studies.Method: Introduces truncation argument framework for generalization error bounds via algorithmic stability under bounded p-th centered moment (p∈(1,2]). Applies framework to analyze clipped/normalized SGD and their mini-batch/momentum variants.
Result: Develops general framework for establishing generalization bounds under heavy-tailed noise and provides stability/generalization analysis for popular stochastic algorithms in this setting.
Conclusion: Provides theoretical foundation for understanding generalization properties of stochastic optimization algorithms under realistic heavy-tailed gradient noise conditions.
Abstract: The empirical evidence indicates that stochastic optimization with heavy-tailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded $p$th centered moment with $p\in(1,2]$. Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.
[411] GraphDLG: Exploring Deep Leakage from Gradients in Federated Graph Learning
Shuyue Wei, Wantong Chen, Tongyu Wei, Chen Gong, Yongxin Tong, Lizhen Cui
Main category: cs.LG
TL;DR: GraphDLG is a novel method that successfully recovers raw training graphs from shared gradients in federated graph learning, addressing the deep leakage from gradients vulnerability for graph data.
Details
Motivation: While deep leakage from gradients (DLG) has been studied for image/text data, it remains unknown whether graphs can be effectively recovered in federated graph learning, especially given the unique entanglement of graph structure and node features in GNNs.Method: Theoretical analysis reveals that once graph structure is recovered, node features can be obtained through a closed-form recursive rule. GraphDLG leverages this insight, using randomly generated graphs or client-side training graphs as auxiliaries to enhance recovery.
Result: GraphDLG outperforms existing solutions, achieving improvements of over 5.46% (by MSE) for node feature reconstruction and over 25.04% (by AUC) for graph structure reconstruction.
Conclusion: GraphDLG successfully demonstrates that graphs can be effectively recovered from gradients in federated graph learning, highlighting a significant privacy vulnerability that needs to be addressed in FGL systems.
Abstract: Federated graph learning (FGL) has recently emerged as a promising privacy-preserving paradigm that enables distributed graph learning across multiple data owners. A critical privacy concern in federated learning is whether an adversary can recover raw data from shared gradients, a vulnerability known as deep leakage from gradients (DLG). However, most prior studies on the DLG problem focused on image or text data, and it remains an open question whether graphs can be effectively recovered, particularly when the graph structure and node features are uniquely entangled in GNNs. In this work, we first theoretically analyze the components in FGL and derive a crucial insight: once the graph structure is recovered, node features can be obtained through a closed-form recursive rule. Building on this analysis, we propose GraphDLG, a novel approach to recover raw training graphs from shared gradients in FGL, which can utilize randomly generated graphs or client-side training graphs as auxiliaries to enhance recovery. Extensive experiments demonstrate that GraphDLG outperforms existing solutions by successfully decoupling the graph structure and node features, achieving improvements of over 5.46% (by MSE) for node feature reconstruction and over 25.04% (by AUC) for graph structure reconstruction.
[412] Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining
Yunwei Ren, Yatin Dandi, Florent Krzakala, Jason D. Lee
Main category: cs.LG
TL;DR: Deep convolutional networks can be efficiently trained to learn hierarchical functions via layerwise training when intermediate layers receive clean signal and features are weakly identifiable.
Details
Motivation: Despite empirical success of deep learning in exploiting hierarchical structure, theoretical understanding of hierarchical learning in genuinely deep models remains limited, with most optimization results focusing on shallow networks (2-3 layers). The paper aims to prove whether deep networks trained by gradient methods can efficiently exploit hierarchical structure.Method: The paper uses Random Hierarchy Models (a hierarchical context-free grammar) as a test case. The proof builds on the observation that if intermediate layers receive clean signal from labels and relevant features are weakly identifiable, then layerwise training of each individual layer suffices for hierarchical learning.
Result: The authors prove that under mild conditions, a deep convolutional network can be efficiently trained to learn the Random Hierarchy Models function class, which was conjectured to separate deep and shallow networks.
Conclusion: Deep networks trained via gradient-based methods can indeed efficiently exploit hierarchical structure when certain conditions are met (clean signal propagation to intermediate layers and weak feature identifiability), providing theoretical justification for hierarchical learning in deep models.
Abstract: The empirical success of deep learning is often attributed to deep networks’ ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained by gradient-based methods, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models – a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.
[413] The Effect of Architecture During Continual Learning
Allyson Hahn, Krishnan Raghavan
Main category: cs.LG
TL;DR: The paper introduces a mathematical framework for continual learning that jointly optimizes neural network architecture and weights in Sobolev space, proving that learning only weights is insufficient to prevent catastrophic forgetting under distribution shifts.
Details
Motivation: Static neural network architectures fail to adapt to evolving data distributions across tasks in continual learning, leading to catastrophic forgetting. The authors aim to mathematically demonstrate that architecture adaptation is essential for mitigating forgetting.Method: Formulates continual learning as a bilevel optimization problem: upper level selects optimal architecture via derivative-free direct search, lower level computes optimal weights via dynamic programming. Uses low-rank transfer mechanism to map knowledge across architectures with mismatched dimensions.
Result: Empirical studies across regression/classification problems with feedforward, convolutional, and graph neural networks show up to two orders of magnitude performance improvement, reduced forgetting, and enhanced noise robustness compared to static architecture approaches.
Conclusion: Simultaneous learning of optimal architecture and weights is mathematically proven necessary and practically effective for continual learning, significantly reducing catastrophic forgetting under distribution shifts.
Abstract: Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting. To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches.
[414] Knowledge-Aware Evolution for Streaming Federated Continual Learning with Category Overlap and without Task Identifiers
Sixing Tan, Xianmin Liu
Main category: cs.LG
TL;DR: FedKACE is a streaming federated continual learning method that addresses category overlap and absent task identifiers through adaptive model switching, gradient-balanced replay, and kernel spectral boundary buffer maintenance.
Details
Motivation: Existing batch-based federated continual learning methods lack adaptability to streaming scenarios with category overlap between old/new data and absent task identifiers, leading to knowledge confusion and uncertain task assignments.Method: FedKACE introduces: 1) adaptive inference model switching (local↔global) for personalization-generalization trade-off; 2) adaptive gradient-balanced replay for handling overlapping classes; 3) kernel spectral boundary buffer maintenance for preserving high-information boundary samples.
Result: Experiments across multiple scenarios and regret analysis demonstrate the effectiveness of FedKACE in streaming federated continual learning settings.
Conclusion: FedKACE successfully addresses challenges in streaming federated continual learning with overlapping categories and absent task identifiers through its three key components, enabling sustained inference capability for all prior categories.
Abstract: Federated Continual Learning (FCL) leverages inter-client collaboration to balance new knowledge acquisition and prior knowledge retention in non-stationary data. However, existing batch-based FCL methods lack adaptability to streaming scenarios featuring category overlap between old and new data and absent task identifiers, leading to indistinguishability of old and new knowledge, uncertain task assignments for samples, and knowledge confusion.To address this, we propose streaming federated continual learning setting: per federated learning (FL) round, clients process streaming data with disjoint samples and potentially overlapping categories without task identifiers, necessitating sustained inference capability for all prior categories after each FL round.Next, we introduce FedKACE: 1) an adaptive inference model switching mechanism that enables unidirectional switching from local model to global model to achieve a trade-off between personalization and generalization; 2) a adaptive gradient-balanced replay scheme that reconciles new knowledge learning and old knowledge retention under overlapping-class scenarios; 3) a kernel spectral boundary buffer maintenance that preserves high-information and high-boundary-influence samples to optimize cross-round knowledge retention. Experiments across multiple scenarios and regret analysis demonstrate the effectiveness of FedKACE.
[415] To Grok Grokking: Provable Grokking in Ridge Regression
Mingyue Xu, Gal Vardi, Itay Safran
Main category: cs.LG
TL;DR: Rigorous analysis of grokking phenomenon in ridge regression showing three-stage training: early overfitting, prolonged poor generalization, then eventual perfect generalization. Proves quantitative bounds on grokking time and shows it can be controlled via hyperparameter tuning.
Details
Motivation: To provide rigorous theoretical understanding of the grokking phenomenon - where generalization occurs long after overfitting - in a tractable linear regression setting, and to determine whether grokking is an inherent failure mode or controllable through hyperparameters.Method: Theoretical analysis of over-parameterized linear regression with gradient descent and weight decay, proving end-to-end grokking results with quantitative bounds on generalization delay. Empirical validation on both linear models and extension to non-linear neural networks.
Result: Proves three-stage grokking occurs: (1) early overfitting, (2) prolonged poor generalization, (3) eventual perfect generalization. Derives first rigorous quantitative bounds on grokking time in terms of hyperparameters. Shows grokking can be amplified or eliminated through proper hyperparameter tuning.
Conclusion: Grokking is not an inherent failure mode of deep learning but a consequence of specific training conditions that can be controlled through hyperparameter tuning, without requiring fundamental changes to model architecture or learning algorithms.
Abstract: We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the “grokking time”) in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.
[416] Component-Aware Pruning Framework for Neural Network Controllers via Gradient-Based Importance Estimation
Ganesh Sundaram, Jonas Ulmen, Daniel Görges
Main category: cs.LG
TL;DR: Component-aware pruning framework using gradient-based importance metrics outperforms static norm-based pruning for neural network controllers.
Details
Motivation: Multi-component neural architectures have high computational complexity, and conventional norm-based pruning fails to capture functional significance of parameters.Method: Proposes component-aware pruning framework using three gradient-based importance metrics: Gradient Accumulation, Fisher Information, and Bayesian Uncertainty, computed during training.
Result: Experimental results with autoencoder and TD-MPC agent show the framework reveals critical structural dependencies and dynamic importance shifts that static heuristics miss.
Conclusion: Gradient-based component-aware pruning enables more informed compression decisions for complex neural network controllers by capturing functional significance.
Abstract: The transition from monolithic to multi-component neural architectures in advanced neural network controllers poses substantial challenges due to the high computational complexity of the latter. Conventional model compression techniques for complexity reduction, such as structured pruning based on norm-based metrics to estimate the relative importance of distinct parameter groups, often fail to capture functional significance. This paper introduces a component-aware pruning framework that utilizes gradient information to compute three distinct importance metrics during training: Gradient Accumulation, Fisher Information, and Bayesian Uncertainty. Experimental results with an autoencoder and a TD-MPC agent demonstrate that the proposed framework reveals critical structural dependencies and dynamic shifts in importance that static heuristics often miss, supporting more informed compression decisions.
[417] Learn and Verify: A Framework for Rigorous Verification of Physics-Informed Neural Networks
Kazuaki Tanaka, Kohei Yatabe
Main category: cs.LG
TL;DR: A “Learn and Verify” framework provides mathematically rigorous error bounds for neural network solutions of differential equations, combining novel training loss with interval arithmetic verification.
Details
Motivation: Neural network solutions for differential equations (like PINNs) lack rigorous error bounds and convergence guarantees compared to classical numerical methods, making it difficult to mathematically certify their accuracy due to non-deterministic optimization.Method: Proposes a “Learn and Verify” framework combining: 1) Doubly Smoothed Maximum (DSM) loss for training neural networks, and 2) interval arithmetic for verification to compute rigorous a posteriori error bounds as machine-verifiable proofs.
Result: Numerical experiments on nonlinear ODEs (including problems with time-varying coefficients and finite-time blow-up) demonstrate successful construction of rigorous enclosures of true solutions.
Conclusion: The framework establishes a foundation for trustworthy scientific machine learning by providing computable, mathematically rigorous error bounds for neural network solutions of differential equations.
Abstract: The numerical solution of differential equations using neural networks has become a central topic in scientific computing, with Physics-Informed Neural Networks (PINNs) emerging as a powerful paradigm for both forward and inverse problems. However, unlike classical numerical methods that offer established convergence guarantees, neural network-based approximations typically lack rigorous error bounds. Furthermore, the non-deterministic nature of their optimization makes it difficult to mathematically certify their accuracy. To address these challenges, we propose a “Learn and Verify” framework that provides computable, mathematically rigorous error bounds for the solutions of differential equations. By combining a novel Doubly Smoothed Maximum (DSM) loss for training with interval arithmetic for verification, we compute rigorous a posteriori error bounds as machine-verifiable proofs. Numerical experiments on nonlinear Ordinary Differential Equations (ODEs), including problems with time-varying coefficients and finite-time blow-up, demonstrate that the proposed framework successfully constructs rigorous enclosures of the true solutions, establishing a foundation for trustworthy scientific machine learning.
[418] A Multi-directional Meta-Learning Framework for Class-Generalizable Anomaly Detection
Padmaksha Roy, Lamine Mili, Almuatazbellah Boker
Main category: cs.LG
TL;DR: A multidirectional meta-learning framework for class-generalizable anomaly detection using normal data and few anomaly samples to detect unseen anomalies.
Details
Motivation: Address class-generalizable anomaly detection where models need to detect completely unseen anomalies (OOD classes) using limited normal data and rare/costly anomaly samples.Method: Two-level multidirectional meta-learning: inner level learns normal data manifold (representation), outer level meta-tunes with few anomaly samples to maximize softmax confidence margin between normal (ID) and anomaly (OOD) samples for decision surface calibration.
Result: The framework enables stronger generalization to unseen anomaly classes through iterative multidirectional training over multiple episodes of normal and few anomaly samples.
Conclusion: Proposed multidirectional meta-learning effectively addresses class-generalizable anomaly detection by learning from normal data and few anomaly samples to detect unseen anomalies.
Abstract: In this paper, we address the problem of class-generalizable anomaly detection, where the objective is to develop a unified model by focusing our learning on the available normal data and a small amount of anomaly data in order to detect the completely unseen anomalies, also referred to as the out-of-distribution (OOD) classes. Adding to this challenge is the fact that the anomaly data is rare and costly to label. To achieve this, we propose a multidirectional meta-learning algorithm – at the inner level, the model aims to learn the manifold of the normal data (representation); at the outer level, the model is meta-tuned with a few anomaly samples to maximize the softmax confidence margin between the normal and anomaly samples (decision surface calibration), treating normals as in-distribution (ID) and anomalies as out-of-distribution (OOD). By iteratively repeating this process over multiple episodes of predominantly normal and a small number of anomaly samples, we realize a multidirectional meta-learning framework. This two-level optimization, enhanced by multidirectional training, enables stronger generalization to unseen anomaly classes.
[419] Calibration without Ground Truth
Yuqing Kong, Mingyu Song, Yizhou Wang, Yifan Wu
Main category: cs.LG
TL;DR: A label-free post-processing framework that improves strong but miscalibrated models using weaker but better-calibrated reference models, guaranteeing strict performance improvement under any proper loss without needing ground-truth labels.
Details
Motivation: With predictions that publicly available human text will be exhausted within the next decade, improving models without access to ground-truth labels becomes increasingly important as labeled data becomes scarce.Method: Proposes a label-free post-processing framework that leverages a weaker but better-calibrated reference model to improve a strong but miscalibrated model. The approach is based on characterizing when strict improvement is possible (when models are not mutually calibrated), connects to arbitrage and no-trade results from economics, and uses an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels.
Result: Experiments on representative LLMs across varying scales demonstrate that the label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.
Conclusion: The proposed framework provides a practical solution for model improvement in label-scarce scenarios, offering guaranteed performance improvements under proper losses without requiring ground-truth labels, which is increasingly important as human-generated text becomes exhausted.
Abstract: Villalobos et al. [2024] predict that publicly available human text will be exhausted within the next decade. Thus, improving models without access to ground-truth labels becomes increasingly important. We propose a label-free post-processing framework that improves a strong but miscalibrated model using a weaker yet better-calibrated reference. Our framework guarantees a strict performance improvement under any proper loss. Our approach is based on a characterization of when strict improvement is possible: when the strong and reference models are not mutually calibrated. We formalize this condition, connect it to arbitrage and no-trade results from economics, and develop an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels. Experiments on representative LLMs across varying scales demonstrate that our label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.
[420] Bandits in Flux: Adversarial Constraints in Dynamic Environments
Tareq Si Salem
Main category: cs.LG
TL;DR: Novel primal-dual algorithm for adversarial multi-armed bandits with time-varying constraints achieves sublinear dynamic regret and constraint violation with state-of-the-art performance.
Details
Motivation: Real-world applications often involve adversarial multi-armed bandit problems with time-varying constraints, which present significant challenges that existing methods may not adequately address.Method: Proposed a novel primal-dual algorithm that extends online mirror descent by incorporating suitable gradient estimators and effective constraint handling mechanisms.
Result: Theoretical guarantees establish sublinear dynamic regret and sublinear constraint violation. The algorithm achieves state-of-the-art performance in both metrics.
Conclusion: The proposed approach demonstrates superior performance through empirical evaluations, effectively solving the challenging problem of adversarial multi-armed bandits with time-varying constraints.
Abstract: We investigate the challenging problem of adversarial multi-armed bandits operating under time-varying constraints, a scenario motivated by numerous real-world applications. To address this complex setting, we propose a novel primal-dual algorithm that extends online mirror descent through the incorporation of suitable gradient estimators and effective constraint handling. We provide theoretical guarantees establishing sublinear dynamic regret and sublinear constraint violation for our proposed policy. Our algorithm achieves state-of-the-art performance in terms of both regret and constraint violation. Empirical evaluations demonstrate the superiority of our approach.
[421] RHSIA: Real-time Hemodynamics Surrogation for Non-idealized Intracranial Aneurysms
Yiying Sheng, Wenhao Ding, Dylan Roi, Leonard Leong Litt Yeo, Hwa Liang Leo, Choon Hwai Yap
Main category: cs.LG
TL;DR: A Graph Transformer model predicts Wall Shear Stress across cardiac cycles from aneurysm geometry using deep learning, achieving high accuracy with limited pulsatile CFD data through steady-state data augmentation.
Details
Motivation: CFD-derived fluid mechanical markers can indicate intracranial aneurysm progression risks but are not clinically adopted due to CFD's complexity, time consumption, and low throughput. A deep learning solution is needed to provide real-time biomechanical markers without CFD expertise.Method: Developed a Graph Transformer model incorporating temporal information, supervised by large CFD data. Uses aneurysm surface meshes as input to predict WSS across cardiac cycles. Augmented limited pulsatile CFD data with low-cost steady-state CFD data to enhance performance with small sample sizes.
Result: Model accurately predicts WSS with SSIM up to 0.981 and maximum-based relative L2 error of 2.8%. Effectively captures temporal variations of WSS patterns. Ablation studies and SOTA comparison confirm optimality. Steady-state data augmentation substantially improves performance when pulsatile data is limited.
Conclusion: Proof of concept that temporal cardiovascular fluid mechanical parameters can be computed in real-time from geometric meshes using deep learning, even with small pulsatile CFD sample sizes. Approach is likely applicable to other cardiovascular scenarios and enables clinical translation of biomechanical markers.
Abstract: Extensive studies suggested that fluid mechanical markers of intracranial aneurysms (IAs) derived from Computational Fluid Dynamics (CFD) can indicate disease progression risks, but to date this has not been translated clinically. This is because CFD requires specialized expertise and is time-consuming and low throughput, making it difficult to support clinical trials. A deep learning model that maps IA morphology to biomechanical markers can address this, enabling physicians to obtain these markers in real time without performing CFD. Here, we show that a Graph Transformer model that incorporates temporal information, which is supervised by large CFD data, can accurately predict Wall Shear Stress (WSS) across the cardiac cycle from IA surface meshes. The model effectively captures the temporal variations of the WSS pattern, achieving a Structural Similarity Index (SSIM) of up to 0.981 and a maximum-based relative L2 error of 2.8%. Ablation studies and SOTA comparison confirmed its optimality. Further, as pulsatile CFD data is computationally expensive to generate and sample sizes are limited, we engaged a strategy of injecting a large amount of steady-state CFD data, which are extremely low-cost to generate, as augmentation. This approach enhances network performance substantially when pulsatile CFD data sample size is small. Our study provides a proof of concept that temporal sequences cardiovascular fluid mechanical parameters can be computed in real time using a deep learning model from the geometric mesh, and this is achievable even with small pulsatile CFD sample size. Our approach is likely applicable to other cardiovascular scenarios.
[422] Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal
Main category: cs.LG
TL;DR: SDFT enables on-policy continual learning from demonstrations by using demonstration-conditioned models as their own teachers, outperforming supervised fine-tuning while reducing catastrophic forgetting.
Details
Motivation: Continual learning for foundation models faces challenges: on-policy RL requires explicit rewards (often unavailable), while supervised fine-tuning (the main alternative) is off-policy and causes forgetting. Need a method for on-policy learning directly from demonstrations.Method: Self-Distillation Fine-Tuning (SDFT) - uses demonstration-conditioned models as their own teachers to generate on-policy training signals. Leverages in-context learning to preserve prior capabilities while acquiring new skills.
Result: SDFT consistently outperforms SFT across skill learning and knowledge acquisition tasks, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. Enables sequential accumulation of multiple skills without performance regression.
Conclusion: SDFT establishes on-policy distillation as a practical path to continual learning from demonstrations, enabling models to acquire new skills without degrading existing capabilities.
Abstract: Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
[423] Language Models are Symbolic Learners in Arithmetic
Chunyuan Deng, Zhiqi Li, Roy Xie, Ruidi Chang, Hanjie Chen
Main category: cs.LG
TL;DR: LMs don’t truly learn arithmetic algorithms but instead master hierarchical symbolic shortcuts, starting with simplest digit mappings and progressing to more complex patterns, as revealed by subgroup induction analysis.
Details
Motivation: To determine whether language models genuinely learn to compute arithmetic or just perform superficial pattern matching, investigating their true learning mechanisms for arithmetic tasks.Method: Introduces subgroup induction framework adapted from Solomonoff Induction, analyzing arithmetic problems by breaking them into minimal mappings between input digits and single output digits, measuring subgroup quality as viability of shortcuts.
Result: Reveals U-shaped accuracy pattern in multi-digit multiplication: LMs master first and last output digits easily but struggle with middle digits, which aligns perfectly with quality of simplest subgroups requiring fewest input tokens.
Conclusion: LMs learn arithmetic through hierarchical symbolic shortcuts rather than true algorithmic computation, starting with easy low-token mappings and gradually incorporating more complex patterns as training progresses.
Abstract: The prevailing question in LM performing arithmetic is whether these models learn to truly compute or if they simply master superficial pattern matching. In this paper, we argues for the latter, presenting evidence that LMs act as greedy symbolic learners, prioritizing the simplest possible shortcuts to fit the stats of dataset to solve arithmetic tasks. To investigate this, we introduce subgroup induction, a practical framework adapted from Solomonoff Induction (SI), one of the most powerful universal predictors. Our framework analyzes arithmetic problems by breaking them down into subgroups-minimal mappings between a few input digits and a single output digit. Our primary metric, subgroup quality, measures the viability of these shortcuts. Experiments reveal a distinct U-shaped accuracy pattern in multi-digit multiplication: LMs quickly master the first and last output digits while struggling with those in the middle. We demonstrate this U-shape is not coincidental; it perfectly mirrors the quality of the simplest possible subgroups, those requiring the fewest input tokens. This alignment suggests a core learning mechanism: LMs first learn easy, low-token shortcuts and only incorporate more complex, multi-token patterns as training progresses. They do not learn the algorithm of multiplication but rather a hierarchy of increasingly complex symbol-to-symbol mappings. Ultimately, our findings suggest that the path to arithmetic mastery for LMs is not paved with algorithms, but with a cascade of simple, hierarchically-learned symbolic shortcuts.
[424] MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
Main category: cs.LG
TL;DR: MobileSafetyBench: A benchmark for evaluating safety of mobile device-control agents in Android emulators, showing current LLM-based agents often fail to prevent harm despite safety-focused prompting.
Details
Motivation: LLM-powered autonomous agents interact with personal device data and settings, creating safety risks, but no standardized benchmark exists to evaluate mobile device-control agent safety.Method: Created MobileSafetyBench with diverse tasks in Android emulators involving messaging, banking apps, and indirect prompt injection attacks. Proposed safety-prioritizing prompting method.
Result: State-of-the-art LLM-based baseline agents often fail to prevent harm. Safety-focused prompting shows promise but still has significant room for improvement to earn user trust.
Conclusion: Urgent need for continued research on robust safety mechanisms for mobile device-control agents, as current methods are insufficient despite safety-focused interventions.
Abstract: Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing misuse and negative side effects. These tasks include tests to evaluate the safety of agents in daily scenarios as well as their robustness against indirect prompt injection attacks. Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments.
[425] Improving Value-based Process Verifier via Structural Prior Injection
Zetian Sun, Dongfang Li, Baotian Hu, Jun Yu, Min Zhang
Main category: cs.LG
TL;DR: The paper proposes representing LLM reasoning state values as categorical distributions instead of scalars to better handle Monte Carlo sampling errors, improving verifier performance by 1-2 points with minimal cost.
Details
Motivation: Monte Carlo sampling for LLM reasoning state estimation introduces noise and errors due to limited sampling. Current scalar value representations don't adequately capture this uncertainty.Method: Inject structural priors by transforming scalar values into expectations of pre-defined categorical distributions (Binomial). Treat Monte Carlo samples as single samples from ground-truth distributions, quantify errors as distribution mismatches, and optimize via distribution selection optimization.
Result: Value-based process verifiers improved by 1-2 points on Best-of-N and Beam search tasks compared to scalar representations. Different structural priors significantly affect performance despite having the same optimal solution.
Conclusion: Reasonable structural prior injection improves LLM reasoning verifier performance with minimal cost, and the choice of structural prior is crucial despite optimal solution equivalence.
Abstract: In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1$\sim$2 points at little-to-no cost. We also show that under different structural prior, the verifiers’ performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.
[426] ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
Main category: cs.LG
TL;DR: ExPO (Self-Explanation Policy Optimization) is a framework that enables RL-based reasoning improvement by generating positive samples through self-explanation conditioned on ground-truth answers, overcoming limitations of existing methods that rely on initial correct solutions.
Details
Motivation: Current RL self-improvement methods fail on complex reasoning tasks because they rely on the model's initial ability to generate correct solutions. Without guided exploration, they merely reinforce existing knowledge rather than enabling the model to solve problems where it initially generates no correct answers.Method: ExPO generates effective positive samples by conditioning on ground-truth answers, ensuring samples are (1) likely under current policy and (2) increase likelihood of predicting correct answers. This modular framework integrates with RL methods like GRPO and DPO, enabling exploration beyond current output distribution.
Result: ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings like MATH level-5 where models initially struggle most.
Conclusion: ExPO provides an effective solution for RL-based reasoning improvement by generating self-explanatory positive samples that enable exploration and guide models toward correct reasoning trajectories, overcoming limitations of both expert demonstrations and initial incorrect samples.
Abstract: Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model’s initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .
[427] Prompt-Counterfactual Explanations for Generative AI System Behavior
Sofie Goethals, Foster Provost, João Sedoc
Main category: cs.LG
TL;DR: This paper introduces prompt-counterfactual explanations (PCEs) - a framework for understanding what causes LLM outputs to exhibit specific characteristics like toxicity, sentiment, or political bias by analyzing how prompt changes affect output properties.
Details
Motivation: As generative AI systems are integrated into real-world applications, organizations need to interpret their behavior and understand what causes specific output characteristics. Decision-makers need to know what about the input (prompt) causes LLMs to produce outputs with particular properties like toxicity, negative sentiment, or political bias.Method: The paper adapts counterfactual explanations from Explainable AI literature to generative AI systems. It proposes a flexible framework for applying counterfactual explanations to non-deterministic generative AI systems using downstream classifiers to reveal output characteristics. The authors introduce an algorithm for generating prompt-counterfactual explanations (PCEs).
Result: The framework is demonstrated through three case studies examining political leaning, toxicity, and sentiment. PCEs can streamline prompt engineering to suppress undesirable output characteristics and enhance red-teaming efforts to uncover prompts that elicit undesirable outputs.
Conclusion: This work establishes a foundation for prompt-focused interpretability in generative AI, which will become essential as models are used for higher-stakes tasks and face regulatory requirements for transparency and accountability.
Abstract: As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input – the prompt – that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts counterfactual explanations to non-deterministic, generative AI systems in scenarios where downstream classifiers can reveal key characteristics of their outputs. Based on this framework, we introduce an algorithm for generating prompt-counterfactual explanations (PCEs). Finally, we demonstrate the production of counterfactual explanations for generative AI systems with three case studies, examining different output characteristics (viz., political leaning, toxicity, and sentiment). The case studies further show that PCEs can streamline prompt engineering to suppress undesirable output characteristics and can enhance red-teaming efforts to uncover additional prompts that elicit undesirable outputs. Ultimately, this work lays a foundation for prompt-focused interpretability in generative AI: a capability that will become indispensable as these models are entrusted with higher-stakes tasks and subject to emerging regulatory requirements for transparency and accountability.
[428] Endless Terminals: Scaling RL Environments for Terminal Agents
Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos
Main category: cs.LG
TL;DR: Endless Terminals is an autonomous pipeline that procedurally generates terminal-use tasks for RL training, enabling simple PPO agents to achieve substantial performance gains on both generated and human-curated benchmarks.
Details
Motivation: Current terminal benchmarks are designed for evaluation, not training. Reinforcement learning requires scalable environments, not just datasets. There's a need for an autonomous pipeline that can generate diverse terminal tasks without human annotation to enable effective agent training.Method: A four-stage pipeline: 1) Generate diverse task descriptions, 2) Build and validate containerized environments, 3) Produce completion tests, 4) Filter for solvability. Training uses vanilla PPO with binary episode-level rewards and minimal interaction loop (no retrieval, multi-agent coordination, or specialized tools).
Result: Generated 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. Models trained on Endless Terminals show substantial gains: Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0% on held-out dev set. Improvements transfer to human-curated benchmarks like TerminalBench 2.0.
Conclusion: Simple reinforcement learning can succeed when environments scale. The Endless Terminals pipeline demonstrates that procedurally generated training tasks enable substantial agent improvement, outperforming more complex agentic scaffolds and showing strong transfer to human-curated benchmarks.
Abstract: Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
[429] Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu
Main category: cs.LG
TL;DR: Streaming-dLLM is a training-free framework that accelerates diffusion LLM inference by addressing spatial redundancy (pruning redundant suffix tokens) and temporal inefficiency (dynamic early exit), achieving up to 68.2X speedup while maintaining quality.
Details
Motivation: Current diffusion LLM acceleration methods overlook intrinsic inefficiencies: spatial redundancy from uniformly modeling uninformative suffix regions, and temporal inefficiency from fixed denoising schedules across all decoding steps.Method: Two-pronged approach: 1) Spatial - attenuation guided suffix modeling to prune redundant mask tokens, 2) Temporal - dynamic confidence aware strategy with early exit mechanism to skip unnecessary iterations for converged tokens.
Result: Extensive experiments show Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, demonstrating effective acceleration of diffusion LLM inference.
Conclusion: Streaming-dLLM effectively addresses spatial and temporal inefficiencies in diffusion LLM inference through training-free optimization, enabling significant speedup without quality degradation.
Abstract: Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming-dLLM.
[430] General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design
Yue Jian, Curtis Wu, Danny Reidenbach, Aditi S. Krishnapriyan
Main category: cs.LG
TL;DR: BADGER introduces binding-affinity guidance for diffusion models in structure-based drug design, achieving 60% improvement in binding affinity over prior methods through classifier and classifier-free guidance strategies.
Details
Motivation: Current diffusion models for structure-based drug design often underemphasize binding affinity control during ligand generation, limiting their effectiveness in producing strongly-binding molecules.Method: BADGER incorporates binding affinity awareness through two complementary strategies: (1) classifier guidance (gradient-based affinity signals during sampling) and (2) classifier-free guidance (affinity conditioning directly in diffusion model training). The framework also extends to multi-constraint guidance for binding affinity, drug-likeness (QED), and synthetic accessibility (SA).
Result: BADGER achieves up to 60% improvement in ligand-protein binding affinity of sampled molecules over prior methods. The framework enables controllable ligand generation guided by binding affinity and can optimize multiple constraints simultaneously.
Conclusion: BADGER provides a general binding-affinity guidance framework that significantly improves the binding affinity of generated ligands while maintaining flexibility for multi-constraint optimization, advancing structure-based drug design capabilities.
Abstract: Structure-based drug design (SBDD) aims to generate ligands that bind strongly and specifically to target protein pockets. Recent diffusion models have advanced SBDD by capturing the distributions of atomic positions and types, yet they often underemphasize binding affinity control during generation. To address this limitation, we introduce \textbf{\textnormal{\textbf{BADGER}}}, a general \textbf{binding-affinity guidance framework for diffusion models in SBDD}. \textnormal{\textbf{BADGER} }incorporates binding affinity awareness through two complementary strategies: (1) \textit{classifier guidance}, which applies gradient-based affinity signals during sampling in a plug-and-play fashion, and (2) \textit{classifier-free guidance}, which integrates affinity conditioning directly into diffusion model training. Together, these approaches enable controllable ligand generation guided by binding affinity. \textnormal{\textbf{BADGER} } can be added to any diffusion model and achieves up to a \textbf{60% improvement in ligand–protein binding affinity} of sampled molecules over prior methods. Furthermore, we extend the framework to \textbf{multi-constraint diffusion guidance}, jointly optimizing for binding affinity, drug-likeness (QED), and synthetic accessibility (SA) to design realistic and synthesizable drug candidates.
[431] Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning
Jiayu Chen, Le Xu, Wentse Chen, Jeff Schneider
Main category: cs.LG
TL;DR: Offline MBRL with Bayes Adaptive MDP framework and Monte Carlo planning outperforms state-of-the-art methods on benchmark tasks.
Details
Motivation: Offline MBRL faces challenges with model uncertainty - multiple MDPs can explain the same offline data, requiring principled handling of this uncertainty for better decision-making.Method: Model offline MBRL as Bayes Adaptive MDP (BAMDP) and develop novel Bayes Adaptive Monte-Carlo planning algorithm using Monte Carlo Tree Search for continuous state/action spaces with stochastic transitions.
Result: Significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three challenging stochastic tokamak control tasks.
Conclusion: The “RL + Search” framework, inspired by AlphaZero, successfully addresses model uncertainty in offline MBRL through principled BAMDP modeling and efficient planning, demonstrating superior performance on diverse benchmarks.
Abstract: Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our “RL + Search” framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three challenging, stochastic tokamak control tasks. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.
[432] Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity
Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elena Mocanu, Mykola Pechenizkiy, Zhangyang Wang, Decebal Constantin Mocanu
Main category: cs.LG
TL;DR: FreeTickets is an efficient ensemble learning framework that trains sparse subnetworks from scratch instead of multiple dense networks, achieving better performance with fewer parameters and FLOPs than a single dense model.
Details
Motivation: Deep ensembles improve predictive performance, uncertainty estimation, and OoD robustness but are computationally expensive. Existing efficient ensemble approaches still require at least the same resources as training a single dense model.Method: Connects sparse neural network training with deep ensembles. Instead of training multiple dense networks, directly trains sparse subnetworks from scratch using dynamic sparse training, extracting diverse yet accurate subnetworks during sparse-to-sparse training.
Result: FreeTickets surpasses dense baselines in accuracy, uncertainty estimation, OoD robustness, and efficiency. Outperforms naive deep ensemble with ResNet50 on ImageNet using only ~1/5 of training FLOPs, with fewer parameters and FLOPs than a single dense model.
Conclusion: FreeTickets provides a novel, highly efficient ensemble framework that leverages sparse training to achieve superior performance across multiple metrics while significantly reducing computational costs compared to traditional deep ensembles.
Abstract: The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.
[433] SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling
Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
Main category: cs.LG
TL;DR: LLM agents can use off-policy RL with hindsight relabeling to improve learning efficiency and enable autonomous goal-directed behavior.
Details
Motivation: Current LLM agents mainly use on-policy RL methods, which limits their ability to use important techniques like experience replay and hindsight relabeling that could enable more efficient learning and autonomous goal-directed behavior.Method: Adaptation of Soft Actor-Critic (an off-policy RL algorithm) with hindsight relabeling for LLM agents, enabling experience replay and goal relabeling techniques.
Result: The method outperforms on-policy approaches in multi-goal RL environments and enables more efficient learning for LLM agents.
Conclusion: Off-policy RL with hindsight relabeling is a promising direction for LLM agents, enabling more efficient learning and paving the way for autonomous goal-directed (autotelic) LLM agents.
Abstract: The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.
[434] An efficient, provably optimal algorithm for the 0-1 loss linear classification problem
Xi He, Max A. Little
Main category: cs.LG
TL;DR: ICE algorithm solves 0-1 loss linear classification exactly with O(N^{D+1}) complexity, first standalone algorithm with rigorous guarantees for this NP-hard problem.
Details
Motivation: The 0-1 loss linear classification problem for non-linearly separable data is NP-hard, and existing approaches use approximations (hinge/logistic loss) that cannot guarantee exact solutions. There's a need for an efficient, rigorously proven algorithm for exact solutions.Method: Incremental Cell Enumeration (ICE) analyzes combinatorial and incidence relations between hyperplanes and data points using hyperplane arrangements and oriented matroids theory. It enumerates cells in the arrangement to find optimal classification.
Result: ICE solves 0-1 loss classification exactly in O(N^{D+1}) time, generalizes to polynomial hypersurface classification in O(N^{G+1}) time. Achieves optimal training accuracy on small datasets and higher test accuracy on most datasets, with superior computational efficiency compared to branch-and-bound methods.
Conclusion: ICE is the first standalone algorithm with rigorous guarantees for exact 0-1 loss linear classification, solving a long-standing open problem. It provides theoretical foundations and practical effectiveness for exact classification.
Abstract: Algorithms for solving the linear classification problem have a long history, dating back at least to 1936 with linear discriminant analysis. For linearly separable data, many algorithms can obtain the exact solution to the corresponding 0-1 loss classification problem efficiently, but for data which is not linearly separable, it has been shown that this problem, in full generality, is NP-hard. Alternative approaches all involve approximations of some kind, such as the use of surrogates for the 0-1 loss (for example, the hinge or logistic loss), none of which can be guaranteed to solve the problem exactly. Finding an efficient, rigorously proven algorithm for obtaining an exact (i.e., globally optimal) solution to the 0-1 loss linear classification problem remains an open problem. By analyzing the combinatorial and incidence relations between hyperplanes and data points, we derive a rigorous construction algorithm, incremental cell enumeration (ICE), that can solve the 0-1 loss classification problem exactly in $O(N^{D+1})$. To the best of our knowledge, this is the first standalone algorithm-one that does not rely on general-purpose solvers-with rigorously proven guarantees for this problem. Moreover, we further generalize ICE to address the polynomial hypersurface classification problem in $O(N^{G+1})$ time, where $G$ is determined by both the data dimension and the polynomial hypersurface degree. The correctness of our algorithm is proved by the use of tools from the theory of hyperplane arrangements and oriented matroids. We demonstrate the effectiveness of our algorithm on real-world datasets, achieving optimal training accuracy for small-scale datasets and higher test accuracy on most datasets. Furthermore, our complexity analysis shows that the ICE algorithm offers superior computational efficiency compared with state-of-the-art branch-and-bound algorithm.
[435] A simple algorithm for output range analysis for deep neural networks
Helder Rojas, Nilton Rojas, Espinoza J. B., Luis Huamanchumo
Main category: cs.LG
TL;DR: Novel approach for DNN output range estimation using Simulated Annealing with constrained domains, applicable to various architectures including ResNets, with theoretical convergence guarantees and empirical validation.
Details
Motivation: Address challenges in DNN output range estimation due to lack of local geometric information and high non-linearity, especially for complex architectures like ResNets where existing methods have limitations.Method: Integrates Simulated Annealing algorithm tailored for constrained domains to find global optima, with minimal assumptions on DNN internal architecture, making it applicable to diverse network structures.
Result: Theoretical convergence guarantees and extensive empirical evaluations demonstrate robustness in navigating non-convex response surfaces, efficient accurate output range estimation even with high non-linearity and complex constraints.
Conclusion: Proposed SA-based approach effectively solves DNN output range estimation problem, extends to complex models like ResNets, with both theoretical and empirical validation supporting its effectiveness.
Abstract: This paper presents a novel approach for the output range estimation problem in Deep Neural Networks (DNNs) by integrating a Simulated Annealing (SA) algorithm tailored to operate within constrained domains and ensure convergence towards global optima. The method effectively addresses the challenges posed by the lack of local geometric information and the high non-linearity inherent to DNNs, making it applicable to a wide variety of architectures, with a special focus on Residual Networks (ResNets) due to their practical importance. Unlike existing methods, our algorithm imposes minimal assumptions on the internal architecture of neural networks, thereby extending its usability to complex models. Theoretical analysis guarantees convergence, while extensive empirical evaluations-including optimization tests involving functions with multiple local minima-demonstrate the robustness of our algorithm in navigating non-convex response surfaces. The experimental results highlight the algorithm’s efficiency in accurately estimating DNN output ranges, even in scenarios characterized by high non-linearity and complex constraints. For reproducibility, Python codes and datasets used in the experiments are publicly available through our GitHub repository.
[436] Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models
Tiejin Chen, Kaishen Wang, Hua Wei
Main category: cs.LG
TL;DR: Zer0-Jack is a black-box jailbreak attack method for MLLMs that uses zeroth-order optimization and patch coordinate descent to generate malicious image inputs without requiring white-box access, achieving high attack success rates comparable to white-box methods.
Details
Motivation: Existing gradient-based jailbreak methods require white-box access to MLLMs, which is often unavailable in real-world scenarios. Transfer attacks from white-box to black-box models suffer from reduced performance. There's a need for efficient black-box jailbreak methods that don't require model gradients.Method: Proposes Zer0-Jack using zeroth-order optimization to bypass white-box requirements. Introduces patch coordinate descent to efficiently generate malicious image inputs by optimizing patches in the image space to directly attack black-box MLLMs, significantly reducing memory usage.
Result: Achieves 95% attack success rate on MiniGPT-4 with Harmful Behaviors Multi-modal Dataset in black-box setting. Surpasses previous transfer-based methods and performs comparably with existing white-box jailbreak techniques. Successfully attacks commercial MLLMs like GPT-4o.
Conclusion: Zer0-Jack demonstrates that effective jailbreak attacks can be conducted without white-box access using zeroth-order optimization, posing significant security risks to black-box MLLMs and highlighting the need for improved safety measures against such attacks.
Abstract: Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.
[437] Link Representation Learning for Probabilistic Travel Time Estimation
Chen Xu, Qiang Wang, Lijun Sun
Main category: cs.LG
TL;DR: ProbETA: A deep hierarchical joint probabilistic model for travel time estimation that captures both inter-trip and intra-trip correlations using low-rank multivariate Gaussian distributions and learnable link representations.
Details
Motivation: Existing travel time estimation methods assume trip independence and focus on individual trips, ignoring real-world correlations between trips caused by external factors (weather) and internal factors (driver tendencies). This limitation reduces estimation accuracy.Method: Proposes a joint probabilistic model using low-rank multivariate Gaussian distributions to model travel times across multiple trips. Uses learnable link representations estimated via empirical Bayes approach. Introduces trip sub-sampling data augmentation for fine-grained gradient backpropagation during link representation learning.
Result: Outperforms state-of-the-art deterministic and probabilistic baselines on two real-world GPS trajectory datasets, reducing Mean Absolute Percentage Error by over 12.60%. Learned link representations align with physical network geometry.
Conclusion: ProbETA effectively captures trip correlations for more accurate travel time estimation, and the learned link representations have potential applications in other transportation-related tasks.
Abstract: Travel time estimation is a key task in navigation apps and web mapping services. Existing deterministic and probabilistic methods, based on the assumption of trip independence, predominantly focus on modeling individual trips while overlooking trip correlations. However, real-world conditions frequently introduce strong correlations between trips, influenced by external and internal factors such as weather and the tendencies of drivers. To address this, we propose a deep hierarchical joint probabilistic model ProbETA for travel time estimation, capturing both inter-trip and intra-trip correlations. The joint distribution of travel times across multiple trips is modeled as a low-rank multivariate Gaussian, parameterized by learnable link representations estimated using the empirical Bayes approach. We also introduce a data augmentation method based on trip sub-sampling, allowing for fine-grained gradient backpropagation when learning link representations. During inference, our model estimates the probability distribution of travel time for a queried trip, conditional on spatiotemporally adjacent completed trips. Evaluation on two real-world GPS trajectory datasets demonstrates that ProbETA outperforms state-of-the-art deterministic and probabilistic baselines, with Mean Absolute Percentage Error decreasing by over 12.60%. Moreover, the learned link representations align with the physical network geometry, potentially making them applicable for other tasks.
[438] Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning
KaiHui Huang, RunQing Wu, JinHui Sheng, HanYi Zhang, Ling Ge, JinYu Guo, Fei Ye
Main category: cs.LG
TL;DR: OWMMD framework with Multi-Level Feature Matching and Adaptive Regularization Optimization achieves state-of-the-art continual learning performance by mitigating catastrophic forgetting.
Details
Motivation: Continual learning enables models to persistently acquire and retain information, but suffers from catastrophic forgetting that severely impairs model performance when learning new tasks.Method: Proposes OWMMD framework with Multi-Level Feature Matching Mechanism to penalize representation alterations, plus Adaptive Regularization Optimization strategy that refines adaptive weight vectors to autonomously assess feature layer importance during optimization.
Result: Comprehensive experiments show the approach achieves state-of-the-art performance compared to established baselines, effectively mitigating catastrophic forgetting.
Conclusion: The proposed OWMMD framework with adaptive regularization successfully addresses network forgetting in continual learning, relieving over-regularization problems and promoting future task learning.
Abstract: Continual learning has emerged as a pivotal area of research, primarily due to its advantageous characteristic that allows models to persistently acquire and retain information. However, catastrophic forgetting can severely impair model performance. In this study, we address network forgetting by introducing a novel framework termed Optimally-Weighted Maximum Mean Discrepancy (OWMMD), which imposes penalties on representation alterations via a Multi-Level Feature Matching Mechanism (MLFMM). Furthermore, we propose an Adaptive Regularization Optimization (ARO) strategy to refine the adaptive weight vectors, which autonomously assess the significance of each feature layer throughout the optimization process, The proposed ARO approach can relieve the over-regularization problem and promote the future task learning. We conduct a comprehensive series of experiments, benchmarking our proposed method against several established baselines. The empirical findings indicate that our approach achieves state-of-the-art performance.
[439] Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression
Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, Madeleine Udell
Main category: cs.LG
TL;DR: ASkotch is a new scalable, accelerated iterative solver for full kernel ridge regression that achieves linear convergence and outperforms state-of-the-art methods on large datasets.
Details
Motivation: Full kernel ridge regression (KRR) is computationally expensive for large datasets, while approximate methods using inducing points sacrifice predictive performance. There's a need for scalable solvers that can handle full KRR without compromising accuracy.Method: ASkotch is an accelerated iterative method for full KRR that leverages ridge leverage scores and determinantal point processes theory to achieve linear convergence, with condition-number-free convergence under appropriate conditions.
Result: ASkotch outperforms state-of-the-art KRR solvers on 23 large-scale regression and classification tasks, demonstrating the superiority of full KRR over inducing points approximations.
Conclusion: ASkotch enables practical full KRR on large datasets, opening up new applications across various disciplines by providing better solutions faster than existing methods.
Abstract: Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems that range from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, full KRR solvers are challenging to scale to large datasets: both direct (i.e., Cholesky decomposition) and iterative methods (i.e., PCG) incur prohibitive computational and storage costs. The standard approach to scale KRR to large datasets chooses a set of inducing points and solves an approximate version of the problem, inducing points KRR. However, the resulting solution tends to have worse predictive performance than the full KRR solution. In this work, we introduce a new solver, ASkotch, for full KRR that provides better solutions faster than state-of-the-art solvers for full and inducing points KRR. ASkotch is a scalable, accelerated, iterative method for full KRR that provably obtains linear convergence. Under appropriate conditions, we show that ASkotch obtains condition-number-free linear convergence. This convergence analysis rests on the theory of ridge leverage scores and determinantal point processes. ASkotch outperforms state-of-the-art KRR solvers on a testbed of 23 large-scale KRR regression and classification tasks derived from a wide range of application domains, demonstrating the superiority of full KRR over inducing points KRR. Our work opens up the possibility of as-yet-unimagined applications of full KRR across a number of disciplines.
[440] Token Caching for Diffusion Transformer Acceleration
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Yuming Li, Chenguang Ma
Main category: cs.LG
TL;DR: TokenCache is a novel acceleration method for diffusion transformers that uses token pruning, block selection, and temporal scheduling to reduce redundant computations while maintaining generation quality.
Details
Motivation: Diffusion transformers have excellent performance but suffer from high computational demands due to quadratic attention complexity and multi-step inference, limiting their practical applications.Method: TokenCache introduces a Cache Predictor that hierarchically addresses three challenges: (1) Token pruning with importance scores to determine which tokens to prune/reuse, (2) Block selection with adaptive pruning ratios for each block, and (3) Temporal scheduling to decide when to apply caching strategies.
Result: Experimental results across various models show that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.
Conclusion: TokenCache successfully addresses computational bottlenecks in diffusion transformers through intelligent caching mechanisms, enabling practical applications while maintaining performance.
Abstract: Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically addresses these issues by (1) Token pruning: assigning importance scores to each token to determine which tokens to prune and reuse; (2) Block selection: allocating pruning ratio to each block to adaptively select blocks for caching; (3) Temporal Scheduling: deciding at which time steps to apply caching strategies. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.
[441] Creating a Causally Grounded Rating Method for Assessing the Robustness of AI Models for Time-Series Forecasting
Kausik Lakkaraju, Rachneet Kaur, Parisa Zehtabi, Sunandita Patra, Zhen Zeng, Siva Likitha Valluru, Biplav Srivastava, Marco Valtorta
Main category: cs.LG
TL;DR: Proposes a causally grounded rating framework to evaluate AI model robustness in time-series forecasting, particularly for stock price prediction, by analyzing statistical and confounding biases under various input perturbations.
Details
Motivation: AI models for time-series forecasting are sensitive to input perturbations, leading to prediction errors that undermine trust among stakeholders like investors and analysts. There's a need for systematic robustness evaluation in black-box settings.Method: Develops a causally grounded rating framework that analyzes statistical and confounding biases under noisy/erroneous inputs. Tests with stock price data across industries, evaluating uni-modal and multi-modal models (including ViT and FMs) using six input perturbation types and twelve data distributions.
Result: Multi-modal and time-series-specific Foundation Models show greater robustness and accuracy than general-purpose models. User study confirms the framework’s ratings help users compare model robustness more easily.
Conclusion: The proposed rating framework enables stakeholders to understand model robustness and accuracy for better decision-making in black-box settings without needing model weights or training data.
Abstract: AI models, including both time-series-specific and general-purpose Foundation Models (FMs), have demonstrated strong potential in time-series forecasting across sectors like finance. However, these models are highly sensitive to input perturbations, which can lead to prediction errors and undermine trust among stakeholders, including investors and analysts. To address this challenge, we propose a causally grounded rating framework to systematically evaluate model robustness by analyzing statistical and confounding biases under various noisy and erroneous input scenarios. Our framework is applied to a large-scale experimental setup involving stock price data from multiple industries and evaluates both uni-modal and multi-modal models, including Vision Transformer-based (ViT) models and FMs. We introduce six types of input perturbations and twelve data distributions to assess model performance. Results indicate that multi-modal and time-series-specific FMs demonstrate greater robustness and accuracy compared to general-purpose models. Further, to validate our framework’s usability, we conduct a user study showcasing time-series models’ prediction errors along with our computed ratings. The study confirms that our ratings reduce the difficulty for users in comparing the robustness of different models. Our findings can help stakeholders understand model behaviors in terms of robustness and accuracy for better decision-making even without access to the model weights and training data, i.e., black-box settings.
[442] Pseudo-Nonlinear Data Augmentation: A Constrained Energy Minimization Viewpoint
Pingbang Hu, Mahito Sugiyama
Main category: cs.LG
TL;DR: Proposes a novel data augmentation method using energy-based modeling and information geometry principles to create geometrically aware latent spaces for controllable augmentation across data modalities.
Details
Motivation: Existing learning-based data augmentation methods typically rely on generative models with learned latent representations, which may lack intuitive geometric structure and fine-grained controllability. The authors aim to develop a more geometrically aware approach that better represents the intrinsic structure of data while enabling explicit control over augmentation.Method: Uses energy-based modeling combined with information geometry principles to construct geometrically aware latent spaces that represent the data structure itself. The method supports efficient and explicit encoding/decoding procedures, and includes techniques for designing latent spaces that control the augmentation process.
Result: The proposed data augmentation method achieves competitive performance in downstream tasks compared to other baselines. It offers fine-grained controllability over augmentation that is lacking in existing literature, demonstrating effectiveness across general data modalities.
Conclusion: The energy-based modeling approach with information geometry provides a novel framework for data augmentation that combines competitive performance with enhanced controllability, addressing limitations of existing generative model-based methods.
Abstract: We propose a simple yet novel data augmentation method for general data modalities based on energy-based modeling and principles from information geometry. Unlike most existing learning-based data augmentation methods, which rely on learning latent representations with generative models, our proposed framework enables an intuitive construction of a geometrically aware latent space that represents the structure of the data itself, supporting efficient and explicit encoding and decoding procedures. We then present and discuss how to design latent spaces that will subsequently control the augmentation with the proposed algorithm. Empirical results demonstrate that our data augmentation method achieves competitive performance in downstream tasks compared to other baselines, while offering fine-grained controllability that is lacking in the existing literature.
[443] Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery
HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin
Main category: cs.LG
TL;DR: LRMC is a novel scalable non-convex approach for robust matrix completion that handles missing data and outliers, featuring linear convergence, deep unfolding for parameter learning, and a flexible neural framework supporting infinite iterations.
Details
Motivation: Robust matrix completion needs to address both missing data entries and extreme outliers in low-rank data analysis, but existing approaches may lack scalability, learnability, or computational efficiency for large-scale problems.Method: Proposes Learned Robust Matrix Completion (LRMC) - a non-convex approach with low computational complexity and linear convergence. Uses deep unfolding to learn free parameters, and introduces a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fixed to infinite iterations.
Result: LRMC demonstrates superior empirical performance against state-of-the-art methods on synthetic datasets and real applications including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.
Conclusion: LRMC provides an effective, scalable, and learnable solution for large-scale robust matrix completion problems, combining theoretical guarantees with practical performance across diverse applications.
Abstract: Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.
[444] CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers for Causally Constrained Predictions
Matthew J. Vowels, Mathieu Rochat, Sina Akbari
Main category: cs.LG
TL;DR: Causal Transformers (CaTs) are neural networks that incorporate causal constraints from DAGs to improve robustness and interpretability while maintaining powerful function approximation.
Details
Motivation: Traditional ANNs and transformers lack inherent causal structure awareness, making them vulnerable to covariate shift and difficult to interpret, which limits their reliability in real-world applications.Method: Introduce Causal Transformers (CaTs) that operate under predefined causal constraints specified by Directed Acyclic Graphs (DAGs), retaining traditional neural network capabilities while adhering to structural constraints.
Result: CaTs maintain powerful function approximation abilities while improving robustness, reliability, and interpretability at inference time by incorporating causal constraints.
Conclusion: This approach enables deployment of neural networks in demanding real-world scenarios where robustness and explainability are critical, opening new avenues for reliable AI applications.
Abstract: Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Transformers (CaTs), a general model class designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). CaTs retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.
[445] PYRREGULAR: A Unified Framework for Irregular Time Series, with Classification Benchmarks
Francesco Spinnato, Cristiano Landi
Main category: cs.LG
TL;DR: A unified framework and standardized dataset repository for irregular time series classification, featuring 34 datasets and benchmarking 12 classifier models to centralize research efforts.
Details
Motivation: Irregular temporal data with varying frequencies, durations, and missing values poses significant challenges across multiple fields, but existing research communities address these challenges in isolation with fragmented tools and methods.Method: Introduce a unified framework and the first standardized dataset repository for irregular time series classification, built on a common array format for interoperability. The repository includes 34 datasets and benchmarks 12 classifier models from diverse domains.
Result: Created a centralized repository with 34 datasets and benchmarked 12 classifier models, providing a standardized evaluation platform for irregular temporal data analysis methods.
Conclusion: This work bridges the gap in irregular time series research by providing a unified framework and standardized dataset repository to centralize research efforts and enable more robust evaluation of methods.
Abstract: Irregular temporal data, characterized by varying recording frequencies, differing observation durations, and missing values, presents significant challenges across fields like mobility, healthcare, and environmental science. Existing research communities often overlook or address these challenges in isolation, leading to fragmented tools and methods. To bridge this gap, we introduce a unified framework, and the first standardized dataset repository for irregular time series classification, built on a common array format to enhance interoperability. This repository comprises 34 datasets on which we benchmark 12 classifier models from diverse domains and communities. This work aims to centralize research efforts and enable a more robust evaluation of irregular temporal data analysis methods.
[446] LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster
Main category: cs.LG
TL;DR: LSHBloom is a new deduplication method that combines MinhashLSH with Bloom filters to achieve state-of-the-art deduplication performance with 12× faster runtime and 18× less disk space than MinhashLSH.
Details
Motivation: Current document-level deduplication methods for LLM training datasets are either unreliable or extremely expensive in terms of runtime and memory, making it difficult to scale high-quality deduplication to internet-scale text datasets.Method: LSHBloom extends MinhashLSH by replacing the expensive LSHIndex with lightweight Bloom filters, maintaining the same deduplication performance while significantly reducing computational and storage requirements.
Result: LSHBloom achieves state-of-the-art deduplication performance comparable to MinhashLSH with only marginal increase in false positives (near zero), while being 12× faster and using 18× less disk space on the peS2o dataset.
Conclusion: LSHBloom enables scaling high-quality document deduplication to internet-scale text datasets by providing the deduplication quality of MinHashLSH at scales that were previously only tractable for less sophisticated heuristic solutions.
Abstract: Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication – detecting and eliminating additional instances of the same content – is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same state-of-the-art deduplication performance as MinhashLSH, with only a marginal increase in false positives (near zero in our experiments), while boasting competitive runtime (12$\times$ faster than MinhashLSH on peS2o) and, crucially, using 18$\times$ less disk space than MinhashLSH (as measured on peS2o). Based on extrapolation, we show that this advantage in space and runtime remains even at the extreme scale of several billion documents. LSHBloom allows practitioners to access the deduplication quality of MinHashLSH at scales that are normally only tractable for less sophisticated, heuristic solutions. As a result, LSHBloom promises to enable scaling high-quality document deduplication to internet-scale text datasets.
[447] From Tables to Time: Extending TabPFN-v2 to Time Series Forecasting
Shi Bin Hoo, Samuel Müller, David Salinas, Frank Hutter
Main category: cs.LG
TL;DR: TabPFN-TS uses tabular foundation models for time series forecasting by treating forecasting as tabular regression with temporal featurization, achieving SOTA on covariate-informed forecasting without time-series-specific pretraining.
Details
Motivation: To demonstrate that tabular foundation models can be effectively applied to time series forecasting, bridging tabular and time-series learning within a unified framework without requiring specialized time-series pretraining.Method: Treats forecasting as a tabular regression problem by combining lightweight temporal featurization with pretrained TabPFN-v2, supporting both univariate and covariate-informed forecasting with only 11M parameters.
Result: Achieves state-of-the-art performance on covariate-informed forecasting and competitive accuracy on univariate forecasting across GIFT-Eval and fev-bench benchmarks.
Conclusion: Tabular foundation models paired with suitable temporal features offer an efficient and versatile alternative for forecasting, demonstrating capabilities that emerge from tabular models for time series tasks.
Abstract: Recent progress in foundation models has enabled strong zero-shot performance for time series forecasting. In this work, we show that such capabilities can also emerge from tabular foundation models. We introduce TabPFN-TS, a simple method that treats forecasting as a tabular regression problem by combining lightweight temporal featurization with the pretrained TabPFN-v2. This formulation requires no time-series-specific pretraining and naturally supports both univariate and covariate-informed forecasting. Despite its compact size (11M parameters), TabPFN-TS achieves state-of-the-art performance on covariate-informed forecasting and competitive accuracy on univariate forecasting across the GIFT-Eval and fev-bench benchmarks. We further provide controlled analyses examining how the model interprets temporal structure, how featurization choices affect accuracy, and how forecasts change under alternative tabular backbones. Together, our results demonstrate that tabular foundation models–when paired with suitable temporal features–offer an efficient and versatile alternative for forecasting, bridging tabular and time-series learning within a unified framework. Code is available at https://github.com/PriorLabs/tabpfn-time-series.
[448] Improving LLM-based Global Optimization with Search Space Partitioning
Andrej Schwanke, Lyubomir Ivanov, David Salinas, Fabio Ferreira, Aaron Klein, Frank Hutter, Arber Zela
Main category: cs.LG
TL;DR: HOLLM is a global optimization algorithm that enhances LLM-based sampling by partitioning search space into promising subregions selected via bandit-inspired scoring, improving performance in high-dimensional spaces.
Details
Motivation: LLMs show promise as surrogate models in global optimization but struggle in high-dimensional spaces or without domain-specific priors, leading to sparse/uninformative suggestions.Method: Partitions search space into subregions (meta-arms), selects promising ones via bandit-inspired scoring mechanism balancing exploration/exploitation, then uses LLM to propose candidate points within selected subregions without domain knowledge.
Result: Empirical evaluation shows HOLLM consistently matches or surpasses leading global optimization methods and substantially outperforms global LLM-based sampling strategies.
Conclusion: HOLLM effectively overcomes limitations of LLM-based optimization by combining space partitioning with bandit selection, enabling better performance in challenging optimization scenarios.
Abstract: Large Language Models (LLMs) have recently emerged as effective surrogate models and candidate generators within global optimization frameworks for expensive blackbox functions. Despite promising results, LLM-based methods often struggle in high-dimensional search spaces or when lacking domain-specific priors, leading to sparse or uninformative suggestions. To overcome these limitations, we propose HOLLM, a novel global optimization algorithm that enhances LLM-driven sampling by partitioning the search space into promising subregions. Each subregion acts as a ``meta-arm’’ selected via a bandit-inspired scoring mechanism that effectively balances exploration and exploitation. Within each selected subregion, an LLM then proposes high-quality candidate points, without any explicit domain knowledge. Empirical evaluation on standard optimization benchmarks shows that HOLLM consistently matches or surpasses leading global optimization methods, while substantially outperforming global LLM-based sampling strategies.
[449] Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers
Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison
Main category: cs.LG
TL;DR: SAE metrics can’t distinguish trained vs random transformers - high scores don’t guarantee meaningful feature recovery.
Details
Motivation: To test whether commonly used SAE quality metrics and automatic explanation pipelines can reliably distinguish between trained transformers and randomly initialized ones, since current metrics might be misleading.Method: Train sparse autoencoders on both trained Pythia models and randomly initialized transformers (with various randomization schemes), then compare auto-interpretability scores and reconstruction metrics across both conditions.
Result: SAEs trained on randomly initialized transformers often produce similar auto-interpretability scores and reconstruction metrics as those from trained models, showing current metrics can’t reliably distinguish meaningful features from random ones.
Conclusion: Common SAE metrics are insufficient proxies for mechanistic interpretability; researchers should use randomized baselines and targeted measures of feature ‘abstractness’ to validate interpretability findings.
Abstract: Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature ‘abstractness’.
[450] JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation
Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo
Main category: cs.LG
TL;DR: JointDiff is a diffusion framework that simultaneously generates continuous spatio-temporal data and synchronous discrete events, bridging the gap between continuous and discrete modeling in complex interactive systems like sports.
Details
Motivation: Current generative models treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously (like in sports where player trajectories and possession events occur together).Method: JointDiff is a novel diffusion framework that unifies continuous and discrete processes. It introduces CrossGuid, an effective conditioning operation for multi-agent domains, and is applied to sports by simultaneously modeling multi-agent trajectories and key possession events.
Result: The method achieves state-of-the-art performance and is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance (semantic control via intended ball possessors) and text-guidance (language-driven generation). A new unified sports benchmark with textual descriptions is also created.
Conclusion: Joint modeling of continuous and discrete processes is crucial for building realistic and controllable generative models for interactive systems, as demonstrated by JointDiff’s superior performance in sports applications.
Abstract: Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
[451] Generative Modeling with Bayesian Sample Inference
Marten Lienen, Marcel Kollovieh, Stephan Günnemann
Main category: cs.LG
TL;DR: A novel generative model derived from iterative Gaussian posterior inference that treats generated samples as unknown variables and uses Bayesian probability for sampling, connecting to diffusion models and including BFNs as a special case.
Details
Motivation: To develop a principled generative modeling approach based on Bayesian probability theory, treating the sampling process as iterative posterior inference starting from broad initial beliefs about unknown samples.Method: Derives a generative model from iterative Gaussian posterior inference, formulating sampling as Bayesian probability with prediction and posterior update steps. The model iteratively narrows down unknown samples starting from broad initial beliefs.
Result: Improves sample quality on ImageNet32 over both Bayesian Flow Networks (BFNs) and Variational Diffusion Models, while achieving equivalent log-likelihoods on ImageNet32 and ImageNet64.
Conclusion: The proposed Bayesian iterative inference model provides a principled framework for generative modeling that connects to diffusion models and outperforms related approaches in sample quality while maintaining competitive likelihood performance.
Abstract: We derive a novel generative model from iterative Gaussian posterior inference. By treating the generated sample as an unknown variable, we can formulate the sampling process in the language of Bayesian probability. Our model uses a sequence of prediction and posterior update steps to iteratively narrow down the unknown sample starting from a broad initial belief. In addition to a rigorous theoretical analysis, we establish a connection between our model and diffusion models and show that it includes Bayesian Flow Networks (BFNs) as a special case. In our experiments, we demonstrate that our model improves sample quality on ImageNet32 over both BFNs and the closely related Variational Diffusion Models, while achieving equivalent log-likelihoods on ImageNet32 and ImageNet64. Find our code at https://github.com/martenlienen/bsi.
[452] Activation Function Design Sustains Plasticity in Continual Learning
Lute Lillo, Nick Cheney
Main category: cs.LG
TL;DR: Activation function choice is crucial for mitigating plasticity loss in continual learning, with new Smooth-Leaky and Randomized Smooth-Leaky nonlinearities showing strong performance across supervised and reinforcement learning benchmarks.
Details
Motivation: In continual learning, models can progressively lose the ability to adapt (plasticity loss), which differs from i.i.d. training where activation differences shrink with tuning. The role of activation functions in this failure mode remains underexplored.Method: Introduced two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) based on property-level analysis of negative-branch shape and saturation behavior. Evaluated in supervised class-incremental benchmarks and reinforcement learning with non-stationary MuJoCo environments. Provided stress protocol and diagnostics linking activation shape to adaptation under change.
Result: Activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. The proposed nonlinearities sustain plasticity in continual learning without extra capacity or task-specific tuning.
Conclusion: Thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning, making it a crucial consideration beyond traditional i.i.d. training regimes.
Abstract: In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
[453] A Comprehensive Survey of Deep Learning for Multivariate Time Series Forecasting: A Channel Strategy Perspective
Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Junkai Lu, Jilin Hu, Chenjuan Guo, Christian S. Jensen, Bin Yang
Main category: cs.LG
TL;DR: This paper provides a systematic review and taxonomy of channel modeling strategies for multivariate time series forecasting, organizing approaches into three hierarchical perspectives and analyzing their advantages/limitations.
Details
Motivation: Multivariate time series forecasting is crucial across many domains, and modeling correlations between different channels is critical for improving prediction accuracy. However, there's a need for systematic organization and analysis of existing channel modeling strategies to provide clear guidance for researchers.Method: The authors propose a three-level taxonomy: strategy perspective (how channels are modeled), mechanism perspective (underlying computational mechanisms), and characteristic perspective (inherent properties of channels). They conduct structured analysis of existing methods and examine advantages/limitations of different channel strategies.
Result: The paper provides a comprehensive survey and taxonomy of channel modeling strategies for MTSF, offering a systematic framework for understanding existing approaches and their trade-offs.
Conclusion: The review organizes channel modeling strategies into a clear taxonomy, analyzes their strengths and weaknesses, discusses future research directions, and provides an up-to-date GitHub repository for ongoing reference and community contribution.
Abstract: Multivariate Time Series Forecasting (MTSF) plays a crucial role across diverse fields, ranging from economic, energy, to traffic. In recent years, deep learning has demonstrated outstanding performance in MTSF tasks. In MTSF, modeling the correlations among different channels is critical, as leveraging information from other related channels can significantly improve the prediction accuracy of a specific channel. This study systematically reviews the channel modeling strategies for time series and proposes a taxonomy organized into three hierarchical levels: the strategy perspective, the mechanism perspective, and the characteristic perspective. On this basis, we provide a structured analysis of these methods and conduct an in-depth examination of the advantages and limitations of different channel strategies. Finally, we summarize and discuss some future research directions to provide useful research guidance. Moreover, we maintain an up-to-date Github repository (https://github.com/decisionintelligence/CS4TS) which includes all the papers discussed in the survey.
[454] Universal Multi-Domain Translation via Diffusion Routers
Duc Kieu, Kien Do, Tuan Hoang, Thao Minh Le, Tung Kieu, Dang Nguyen, Thin Nguyen
Main category: cs.LG
TL;DR: Universal Multi-Domain Translation (UMDT) enables translations between any pair of K domains using only K-1 paired datasets with a central domain, via a novel Diffusion Router framework.
Details
Motivation: Existing multi-domain translation approaches require fully aligned tuples or can only handle domain pairs seen in training, limiting practicality and excluding many cross-domain mappings.Method: Proposes Diffusion Router (DR), a unified diffusion-based framework that models all central↔non-central translations with a single noise predictor conditioned on source/target domain labels. Enables indirect non-central translations by routing through central domain, with scalable learning strategy using variational-bound objective and efficient Tweedie refinement.
Result: Achieves state-of-the-art results on three large-scale UMDT benchmarks for both indirect and direct translations, while lowering sampling cost and enabling novel tasks like sketch↔segmentation.
Conclusion: DR establishes a scalable and versatile framework for universal translation across multiple domains, overcoming limitations of previous approaches.
Abstract: Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.
[455] Exploring Graph Learning Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation
Yuxiang Wang, Xinnan Dai, Wenqi Fan, Yao Ma
Main category: cs.LG
TL;DR: LLMs show strong potential for graph learning tasks, outperforming traditional graph models in few-shot settings with good domain transfer and robustness.
Details
Motivation: Previous studies on LLMs for graph tasks focus mainly on performance benchmarks without comprehensive comparison to graph learning models or exploration of broader capabilities like domain transfer and robustness.Method: Comprehensive evaluation of both off-the-shelf and instruction-tuned LLMs across various scenarios including few-shot/zero-shot settings, domain transfer, structural understanding, and robustness, while also addressing data leakage and computational overhead.
Result: LLMs, especially instruction-tuned models, significantly outperform traditional graph learning models in few-shot settings, demonstrate strong domain transferability, and show excellent generalization and robustness capabilities.
Conclusion: LLMs have broader capabilities in graph learning than previously recognized, providing a foundation for future research in this area.
Abstract: In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks. Many studies leverage natural language to describe graphs and apply LLMs for reasoning, yet most focus narrowly on performance benchmarks without fully comparing LLMs to graph learning models or exploring their broader potential. In this work, we present a comprehensive study of LLMs on graph learning tasks, evaluating both off-the-shelf and instruction-tuned models across a variety of scenarios. Beyond accuracy, we discuss data leakage concerns and computational overhead, and assess their performance under few-shot/zero-shot settings, domain transfer, structural understanding, and robustness. Our findings show that LLMs, particularly those with instruction tuning, greatly outperform traditional graph learning models in few-shot settings, exhibit strong domain transferability, and demonstrate excellent generalization and robustness. Our study highlights the broader capabilities of LLMs in graph learning and provides a foundation for future research.
[456] Beyond Memorization: Selective Learning for Copyright-Safe Diffusion Model Training
Divya Kothandaraman, Jaclyn Pytlarz
Main category: cs.LG
TL;DR: A gradient projection method for concept-level feature exclusion in diffusion models to prevent memorization of sensitive attributes while preserving training data utility.
Details
Motivation: Memorization in text-to-image diffusion models creates security and IP risks, enabling adversarial extraction of sensitive features. Current dememorization techniques fail to prevent concept-level feature internalization, and discarding all related images wastes valuable training data.Method: Gradient projection method that operates during backpropagation by identifying and removing training signals aligned with prohibited attribute embeddings. Projects gradient updates onto orthogonal complement of sensitive feature’s embedding space to eliminate its influence on model weights.
Result: The framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. It integrates seamlessly into standard training pipelines and complements existing defenses against feature extraction attacks.
Conclusion: The approach establishes a new paradigm for IP-safe and privacy-preserving generative AI by reframing memorization control as selective learning at the concept level.
Abstract: Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective learning at the concept level. We introduce a gradient projection method designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature’s embedding space, thereby zeroing out its influence on the model’s weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.
[457] Causal Effect Estimation under Networked Interference without Networked Unconfoundedness Assumption
Weilin Chen, Ruichu Cai, Jie Qiao, Yuguang Yan, José Miguel Hernández-Lobato
Main category: cs.LG
TL;DR: Proposes a framework to estimate causal effects under networked interference when latent confounders violate the unconfoundedness assumption, by recovering three types of latent confounders from network interaction patterns.
Details
Motivation: The networked unconfoundedness assumption is often violated in observational data due to latent confounders, preventing accurate estimation of causal effects under networked interference. Existing methods fail when this assumption doesn't hold.Method: Develops a confounder recovery framework that identifies three categories of latent confounders: unit-specific, neighbor-specific, and shared confounders. Uses identifiable representation learning to design a networked effect estimator based on recovered confounders.
Result: Proves identifiability of all three types of latent confounders and establishes formal identification result for networked effects. Extensive experiments validate theoretical findings and demonstrate method effectiveness.
Conclusion: The proposed framework successfully addresses the challenge of latent confounders in networked causal inference, enabling reliable estimation of causal effects under networked interference when the unconfoundedness assumption is violated.
Abstract: Estimating causal effects under networked interference from observational data is a crucial yet challenging problem. Most existing methods mainly rely on the networked unconfoundedness assumption, which guarantees the identification of networked effects. However, this assumption is often violated due to the latent confounders inherent in observational data, thereby hindering the identification of networked effects. To address this issue, we leverage the rich interaction patterns between units in networks, which provide valuable information for recovering these latent confounders. Building on this insight, we develop a confounder recovery framework that explicitly characterizes three categories of latent confounders in networked settings: those affecting only the unit, those affecting only the unit’s neighbors, and those influencing both. Based on this framework, we design a networked effect estimator using identifiable representation learning techniques. From a theoretical standpoint, we prove the identifiability of all three types of latent confounders and, by leveraging the recovered confounders, establish a formal identification result for networked effects. Extensive experiments validate our theoretical findings and demonstrate the effectiveness of the proposed method.
[458] BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman
Main category: cs.LG
TL;DR: Bootseer reduces LLM training startup overhead by 50% through optimizing container loading, dependency installation, and checkpoint resumption.
Details
Motivation: LLM training suffers from significant startup overhead (3.5% GPU time wasted), especially in industrial-scale systems where failures are frequent and teams work in iterative cycles. Prior research focused only on runtime performance, ignoring startup delays.Method: Bootseer system-level optimization framework addresses three bottlenecks: (1) container image loading via hot block record-and-prefetch, (2) runtime dependency installation via dependency snapshotting, and (3) model checkpoint resumption via striped HDFS-FUSE.
Result: 50% reduction in startup overhead when deployed in production environment with real LLM training workloads.
Conclusion: Startup overhead is a critical issue in industrial LLM training, and Bootseer’s system-level optimizations effectively address this problem, significantly reducing wasted GPU time.
Abstract: Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
[459] Sample-Efficient Optimization over Generative Priors via Coarse Learnability
Pranjal Awasthi, Sreenivas Gollapudi, Ravi Kumar, Kamesh Munagala
Main category: cs.LG
TL;DR: A framework for zeroth-order optimization with generative priors (like LLMs) that finds solutions minimizing an objective while maintaining high probability under the prior, with polynomial sample complexity guarantees.
Details
Motivation: Traditional zeroth-order optimization lacks theoretical guarantees when incorporating complex generative priors like LLMs, which are needed for problems requiring qualitative constraints or prior distributions.Method: Introduces “coarse learnability” assumption, then designs iterative algorithm with Metropolis-Hastings correction to approximate target distribution proportional to prior × exponential of objective.
Result: Provides polynomial sample complexity guarantees for model-based optimization with deep generative priors, with theoretical support for coarse learnability and empirical validation using LLMs.
Conclusion: First work to establish sample-complexity guarantees for model-based optimization with deep generative priors, enabling principled use of LLMs for constrained zeroth-order optimization.
Abstract: In zeroth-order optimization, we seek to minimize a function $d(\cdot)$, which may encode combinatorial feasibility, using only function evaluations. We focus on the setting where solutions must also satisfy qualitative constraints or conform to a complex prior distribution. To address this, we introduce a new framework in which such constraints are represented by an initial generative prior $Ł(\cdot)$, for example, a Large Language Model (LLM). The objective is to find solutions $s$ that minimize $d(s)$ while having high probability under $Ł(s)$, effectively sampling from a target distribution proportional to $Ł(s) \cdot e^{-T \cdot d(s)}$ for a temperature parameter $T$. While this framework aligns with classical Model-Based Optimization (e.g., the Cross-Entropy method), existing theory is ill-suited for deriving sample complexity bounds in black-box deep generative models. We therefore propose a novel learning assumption, which we term \emph{coarse learnability}, where an agent with access to a polynomial number of samples can learn a model whose point-wise density approximates the target within a polynomial factor. Leveraging this assumption, we design an iterative algorithm that employs a Metropolis-Hastings correction to provably approximate the target distribution using a polynomial number of samples. To the best of our knowledge, this is one of the first works to establish such sample-complexity guarantees for model-based optimization with deep generative priors. We provide two lines of evidence supporting the coarse learnability assumption. Theoretically, we show that maximum likelihood estimation naturally induces the required coverage properties, holding for both standard exponential families and for misspecified models. Empirically, we demonstrate that LLMs can adapt their learned distributions to zeroth-order feedback to solve combinatorial optimization problems.
[460] Noradrenergic-inspired gain modulation attenuates the stability gap in joint training
Alejandro Rodriguez-Garcia, Anindya Ghosh, Srikanth Ramaswamy
Main category: cs.LG
TL;DR: Dynamic gain scaling optimization technique reduces stability gaps in continual learning by modulating learning rates and flattening local landscapes, inspired by neuromodulatory mechanisms.
Details
Motivation: Address the stability gap problem in continual learning where performance drops on previous tasks when new tasks are introduced, even under ideal joint training. Need optimization mechanisms that balance plasticity (adaptation to new tasks) and stability (retention of old tasks) at task boundaries.Method: Introduce dynamic gain scaling as a two-timescale optimization technique inspired by noradrenergic (neuromodulatory) bursts that transiently increase neuronal gain under uncertainty. The mechanism modulates effective learning rates and flattens the local landscape through an effective reparameterization.
Result: Dynamic gain scaling effectively attenuates stability gaps while maintaining competitive accuracy across domain- and class-incremental MNIST, CIFAR, and mini-ImageNet benchmarks under task-agnostic joint training. Improves robustness at task transitions.
Conclusion: Dynamic gain scaling provides an effective optimization approach to mitigate stability gaps in continual learning by balancing adaptation and retention, demonstrating improved robustness at task boundaries while maintaining overall performance.
Abstract: Recent work in continual learning has highlighted the stability gap – a temporary performance drop on previously learned tasks when new ones are introduced. This phenomenon reflects a mismatch between rapid adaptation and strong retention at task boundaries, underscoring the need for optimization mechanisms that balance plasticity and stability over abrupt distribution changes. While optimizers such as momentum-SGD and Adam introduce implicit multi-timescale behavior, they still exhibit pronounced stability gaps. Importantly, these gaps persist even under ideal joint training, making it crucial to study them in this setting to isolate their causes from other sources of forgetting. Motivated by how noradrenergic (neuromodulatory) bursts transiently increase neuronal gain under uncertainty, we introduce a dynamic gain scaling mechanism as a two-timescale optimization technique that balances adaptation and retention by modulating effective learning rates and flattening the local landscape through an effective reparameterization. Across domain- and class-incremental MNIST, CIFAR, and mini-ImageNet benchmarks under task-agnostic joint training, dynamic gain scaling effectively attenuates stability gaps while maintaining competitive accuracy, improving robustness at task transitions.
[461] Imputation-free Learning of Tabular Data with Missing Values using Incremental Feature Partitions in Transformer
Manar D. Samad, Kazi Fuad B. Akhter, Shourav B. Rabbani, Ibna Kowsar
Main category: cs.LG
TL;DR: Proposes IFIAL, an imputation-free incremental attention learning method for tabular data with missing values that uses attention masks in transformers without imputing missing values.
Details
Motivation: Imputation methods for handling missing values in tabular data raise concerns about data quality and reliability of outcomes. Synthetic values from imputation models may introduce bias or artifacts that affect downstream machine learning results.Method: IFIAL uses a pair of attention masks retrofitted to a transformer to directly process tabular data without imputing missing values. It incrementally learns partitions of overlapping, fixed-size feature sets to enhance transformer performance.
Result: IFIAL achieved superior average classification performance rank across 17 diverse tabular datasets compared to 11 state-of-the-art methods. It shows robustness to varying types and proportions of missing data, outperforming methods that rely on explicit imputations.
Conclusion: IFIAL enables deep attention models to learn directly from tabular data without imputing missing values, with optimal feature partition size being half the original feature space for best trade-off between computational efficiency and predictive performance.
Abstract: Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns regarding data quality and the reliability of data-driven outcomes. To address these concerns, this article proposes an imputation-free incremental attention learning (IFIAL) method for tabular data with missing values. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the performance of the transformer. The average classification performance rank order across 17 diverse tabular data sets highlights the superiority of IFIAL over 11 state-of-the-art learning methods with or without missing value imputations. Additional experiments corroborate the robustness of IFIAL to varying types and proportions of missing data, demonstrating its superiority over methods that rely on explicit imputations. A feature partition size equal to one-half the original feature space yields the best trade-off between computational efficiency and predictive performance. IFIAL is one of the first solutions that enables deep attention models to learn directly from tabular data, eliminating the need to impute missing values. %without the need for imputing missing values. The source code for this paper is publicly available.
[462] SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons
Vipin Singh, Teodor Chiaburu, Einar Eberhardt, Stefan Broda, Joey Prüssing, Frank Haußer, Felix Bießmann
Main category: cs.LG
TL;DR: SoilNet is a multimodal multitask model for soil horizon classification that integrates image data and geotemporal metadata to predict depth markers, segment soil profiles, extract morphological features, and predict hierarchical labels using graph-based representation.
Details
Motivation: Soil horizon classification remains challenging due to its multimodal/multitask nature and complex hierarchical label taxonomy, which hasn't benefited from recent AI foundation model advances despite being crucial for soil condition monitoring.Method: Structured modular pipeline that: 1) integrates image data and geotemporal metadata to predict depth markers and segment soil profiles into horizon candidates, 2) extracts horizon-specific morphological features from each segment, 3) predicts labels using multimodal concatenated feature vectors with graph-based label representation to handle hierarchical relationships.
Result: Demonstrated effectiveness on real-world soil profile dataset and comprehensive user study with domain experts. SoilNet reliably predicts plausible and accurate soil horizons, achieving predictive performance on par with or better than human experts.
Conclusion: SoilNet provides a transparent, structured approach to complex hierarchical soil horizon classification that outperforms human experts while being inherently interpretable by following human expert task structures.
Abstract: Recent advances in artificial intelligence (AI), in particular foundation models, have improved the state of the art in many application domains including geosciences. Some specific problems, however, could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil condition. In this work, we propose \textit{SoilNet} - a multimodal multitask model to tackle this problem through a structured modularized pipeline. In contrast to omnipurpose AI foundation models, our approach is designed to be inherently transparent by following the task structure human experts developed for solving this challenging annotation task. The proposed approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset and a comprehensive user study with domain experts. Our empirical evaluations demonstrate that SoilNet reliably predicts soil horizons that are plausible and accurate. User study results indicate that SoilNet achieves predictive performance on par with or better than that of human experts. All code can be found at: https://github.com/calgo-lab/BGR/
[463] Theoretical Investigation on Inductive Bias of Isolation Forest
Qin-Cheng Zheng, Shao-Qun Zhang, Shen-Huan Lyu, Yuan Jiang, Zhi-Hua Zhou
Main category: cs.LG
TL;DR: This paper provides a theoretical analysis of Isolation Forest’s inductive bias, explaining when and why it works well for anomaly detection through random walk modeling.
Details
Motivation: Despite Isolation Forest's widespread use and practical success in anomaly detection, there has been a lack of theoretical understanding about why it works so well. The paper aims to establish a theoretical foundation to explain iForest's effectiveness.Method: The authors model iForest’s growth process as a random walk, where split dimensions and values are randomly selected. They derive the expected depth function using transition probabilities to analyze iForest’s behavior theoretically.
Result: Case studies reveal key inductive biases: iForest shows lower sensitivity to central anomalies and greater parameter adaptability compared to k-Nearest Neighbor methods. The random walk model successfully explains iForest’s performance characteristics.
Conclusion: The study provides the first theoretical understanding of iForest’s effectiveness, establishes a foundation for further theoretical exploration, and explains the specific inductive biases that make iForest successful for anomaly detection tasks.
Abstract: Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest’s success remains unclear. This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well. The key is to formulate the growth process of iForest, where the split dimensions and split values are randomly selected. We model the growth process of iForest as a random walk, enabling us to derive the expected depth function, which is the outcome of iForest, using transition probabilities. The case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor. Our study provides a theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.
[464] Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems
Jan Tauberschmidt, Sophie Fellenz, Sebastian J. Vollmer, Andrew B. Duncan
Main category: cs.LG
TL;DR: A framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems through differentiable post-training and joint optimization.
Details
Motivation: To bridge generative modeling and scientific inference by creating physics-aware models that can solve ill-posed inverse problems while maintaining physical consistency, enabling simulation-augmented discovery and data-efficient modeling of physical systems.Method: Uses differentiable post-training procedure that minimizes weak-form residuals of governing PDEs on pre-trained flow-matching models. For inverse problems, augments generative process with learnable latent parameter predictor and employs joint optimization strategy to produce physically valid field solutions alongside plausible estimates of hidden parameters.
Result: Validated on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. The approach successfully addresses ill-posed inverse problems in a data-driven yet physics-aware manner.
Conclusion: The framework bridges generative modeling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modeling of physical systems by producing physically consistent solutions while estimating unknown parameters.
Abstract: We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physicsaware manner. We validate our method on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems.
[465] Risk-Sensitive Agent Compositions
Guruprerana Shabadi, Rajeev Alur
Main category: cs.LG
TL;DR: This paper presents a framework for optimizing agent compositions in workflows by minimizing risk metrics (VaR and CVaR) to ensure safety, fairness, and privacy requirements while maintaining task success.
Details
Motivation: Modern agentic systems decompose complex tasks into subtasks handled by specialized AI agents, but real-world deployment requires not just maximizing task success but also minimizing violations of safety, fairness, and privacy requirements. This demands analyzing low-probability tail behaviors of agent compositions.Method: The authors formalize agentic workflows as directed acyclic graphs (agent graphs) where edges represent AI agents and paths correspond to feasible agent compositions. They introduce an efficient algorithm that traverses the agent graph using dynamic programming to approximate Value-at-Risk (VaR) of agent compositions by exploiting a union bound. The algorithm also approximates Conditional Value-at-Risk (CVaR) as a byproduct.
Result: The algorithm finds near-optimal agent compositions that minimize risk. The authors prove the approximation is near-optimal asymptotically for a broad class of practical loss functions. Evaluation on video game-like control benchmarks with reinforcement learning agents demonstrates the algorithm’s effectiveness in approximating VaR and identifying optimal agent compositions.
Conclusion: The proposed framework provides an efficient method for risk-aware agent composition selection in complex workflows, balancing task success with safety, fairness, and privacy requirements through formal risk minimization over feasible agent compositions.
Abstract: From software development to robot control, modern agentic systems decompose complex objectives into a sequence of subtasks and choose a set of specialized AI agents to complete them. We formalize agentic workflows as directed acyclic graphs, called agent graphs, where edges represent AI agents and paths correspond to feasible compositions of agents. Real-world deployment requires selecting agent compositions that not only maximize task success but also minimize violations of safety, fairness, and privacy requirements which demands a careful analysis of the low-probability (tail) behaviors of compositions of agents. In this work, we consider risk minimization over the set of feasible agent compositions and seek to minimize the value-at-risk and the conditional value-at-risk of the loss distribution of the agent composition where the loss quantifies violations of these requirements. We introduce an efficient algorithm which traverses the agent graph and finds a near-optimal composition of agents. It uses a dynamic programming approach to approximate the value-at-risk of agent compositions by exploiting a union bound. Furthermore, we prove that the approximation is near-optimal asymptotically for a broad class of practical loss functions. We also show how our algorithm can be used to approximate the conditional value-at-risk as a byproduct. To evaluate our framework, we consider a suite of video game-like control benchmarks that require composing several agents trained with reinforcement learning and demonstrate our algorithm’s effectiveness in approximating the value-at-risk and identifying the optimal agent composition.
[466] Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards
Artin Tajdini, Jonathan Scarlett, Kevin Jamieson
Main category: cs.LG
TL;DR: Improved regret bounds for stochastic linear bandits with heavy-tailed rewards, achieving better dependence on dimension d and establishing tighter lower bounds.
Details
Motivation: Prior work on heavy-tailed linear bandits had suboptimal regret bounds, particularly loose dependence on dimension d, and lower bounds didn't properly capture the hardness of linear bandits compared to multi-armed bandits.Method: Proposed a new elimination-based algorithm guided by experimental design, which achieves improved regret bounds. Also established new lower bounds and extended results to finite action sets, different geometries (l_p-norm balls), and infinite-dimensional settings via kernel trick.
Result: Achieved regret $\tilde{\mathcal{O}}(d^{\frac{1+3ε}{2(1+ε)}} T^{\frac{1}{1+ε}})$, improving dependence on d for all ε∈(0,1). Established lower bound $Ω(d^{\frac{2ε}{1+ε}} T^{\frac{1}{1+ε}})$ that strictly improves upon multi-armed bandit rate. Extended results to various settings including Matérn kernel.
Conclusion: The paper provides improved understanding of heavy-tailed linear bandits with tighter upper and lower bounds, showing the problem is harder than multi-armed bandits, and demonstrates how geometry and kernel methods can further reduce dimension dependence.
Abstract: We study stochastic linear bandits with heavy-tailed rewards, where the rewards have a finite $(1+ε)$-absolute central moment bounded by $\upsilon$ for some $ε\in (0,1]$. We improve both upper and lower bounds on the minimax regret compared to prior work. When $\upsilon = \mathcal{O}(1)$, the best prior known regret upper bound is $\tilde{\mathcal{O}}(d T^{\frac{1}{1+ε}})$. While a lower with the same scaling has been given, it relies on a construction using $\upsilon = \mathcal{O}(d)$, and adapting the construction to the bounded-moment regime with $\upsilon = \mathcal{O}(1)$ yields only a $Ω(d^{\fracε{1+ε}} T^{\frac{1}{1+ε}})$ lower bound. This matches the known rate for multi-armed bandits and is generally loose for linear bandits, in particular being $\sqrt{d}$ below the optimal rate in the finite-variance case ($ε= 1$). We propose a new elimination-based algorithm guided by experimental design, which achieves regret $\tilde{\mathcal{O}}(d^{\frac{1+3ε}{2(1+ε)}} T^{\frac{1}{1+ε}})$, thus improving the dependence on $d$ for all $ε\in (0,1)$ and recovering a known optimal result for $ε= 1$. We also establish a lower bound of $Ω(d^{\frac{2ε}{1+ε}} T^{\frac{1}{1+ε}})$, which strictly improves upon the multi-armed bandit rate and highlights the hardness of heavy-tailed linear bandit problems. For finite action sets, we derive similarly improved upper and lower bounds for regret. Finally, we provide action set dependent regret upper bounds showing that for some geometries, such as $l_p$-norm balls for $p \le 1 + ε$, we can further reduce the dependence on $d$, and we can handle infinite-dimensional settings via the kernel trick, in particular establishing new regret bounds for the Matérn kernel that are the first to be sublinear for all $ε\in (0, 1]$.
[467] KANO: Kolmogorov-Arnold Neural Operator
Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang
Main category: cs.LG
TL;DR: KANO is a dual-domain neural operator combining spectral and spatial bases with symbolic interpretability, overcoming FNO’s limitations on position-dependent dynamics and achieving superior performance in quantum Hamiltonian learning.
Details
Motivation: The paper addresses limitations of Fourier Neural Operator (FNO), which suffers from pure-spectral bottlenecks and requires spectrally sparse operators with fast-decaying Fourier tails. FNO struggles with generic position-dependent dynamics (variable coefficient PDEs), motivating the need for a more expressive and robust neural operator.Method: Introduces Kolmogorov-Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases. This approach provides intrinsic symbolic interpretability and overcomes the spectral-only limitations of FNO by incorporating spatial domain information.
Result: Theoretical analysis shows KANO remains expressive over generic position-dependent dynamics where FNO fails. Empirical verification on position-dependent differential operators demonstrates KANO’s robust generalization versus FNO’s failure. In quantum Hamiltonian learning, KANO reconstructs ground-truth Hamiltonians with closed-form symbolic representations accurate to fourth decimal place and achieves ≈6×10⁻⁶ state infidelity, substantially outperforming FNO’s ≈1.5×10⁻² even with ideal data.
Conclusion: KANO represents a significant advancement over FNO by overcoming spectral-only limitations through dual-domain parameterization, enabling robust handling of position-dependent dynamics and achieving superior performance in symbolic learning tasks like quantum Hamiltonian reconstruction.
Abstract: We introduce Kolmogorov–Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics (variable coefficient PDEs) for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.
[468] NIMO: a Nonlinear Interpretable MOdel
Shijian Xu, Marcello Massimo Negri, Volker Roth
Main category: cs.LG
TL;DR: NIMO is a framework that combines neural networks’ expressive power with linear regression’s inherent interpretability, providing flexible and intelligible feature effects through parameter elimination optimization and adaptive ridge regression.
Details
Motivation: There's a growing demand for interpretability in deep learning models. Post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameters, while inherently interpretable models like linear regression are often outperformed by neural networks. This creates a dilemma between interpretability and performance.Method: NIMO builds on linear regression to provide flexible feature effects. It uses parameter elimination optimization to effectively optimize both neural network parameters and linear coefficients. Adaptive ridge regression is incorporated to enable sparsity in the model.
Result: Empirical results show that NIMO can provide faithful and intelligible feature effects while maintaining good predictive performance, bridging the gap between interpretability and model performance.
Conclusion: NIMO successfully addresses the interpretability-performance trade-off by combining neural networks’ expressive power with linear regression’s inherent interpretability, offering a practical solution for interpretable deep learning.
Abstract: Deep learning has achieved remarkable success across many domains, but it has also created a growing demand for interpretability in model predictions. Although many explainable machine learning methods have been proposed, post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameter choices, highlighting the appeal of inherently interpretable models. For example, linear regression provides clear feature effects through its coefficients. However, such models are often outperformed by more complex neural networks (NNs) that usually lack inherent interpretability. To address this dilemma, we introduce NIMO, a framework that combines inherent interpretability with the expressive power of neural networks. Building on the simple linear regression, NIMO is able to provide flexible and intelligible feature effects. Relevantly, we develop an optimization method based on parameter elimination, that allows for optimizing the NN parameters and linear coefficients effectively and efficiently. By relying on adaptive ridge regression we can easily incorporate sparsity as well. We show empirically that our model can provide faithful and intelligible feature effects while maintaining good predictive performance.
[469] Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li
Main category: cs.LG
TL;DR: RL fine-tuning enhances LLMs by increasing activation intensity and diversity, making information flow more redundant and flexible, explaining improved mathematical generalization.
Details
Motivation: To understand why RL fine-tuning improves LLM capabilities beyond SFT alone, and to investigate the internal mechanisms behind these improvements across different model families.Method: Used edge attribution patching (EAP) to analyze internal differences before/after RL fine-tuning across multiple model families and mathematical datasets, comparing PPO, GRPO, and DPO approaches.
Result: Found two robust effects: (1) increased average activation intensity (more engaged pathways), and (2) greater diversity in activation patterns (higher entropy, less concentrated distributions). DPO showed weaker/inconsistent changes compared to PPO/GRPO.
Conclusion: RL fine-tuning systematically reshapes LLM internal circuitry to be more redundant and flexible, explaining its advantage in mathematical generalization, with online RL (PPO/GRPO) showing stronger effects than preference-based DPO.
Abstract: Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families and mathematical datasets shows two robust effects of online RL post-training: (i) an overall increase in average activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in mathematical generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://github.com/tsinghua-fib-lab/llm_rl_probing_analysis.
[470] GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning
Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, Ziwei Zhang
Main category: cs.LG
TL;DR: GraphRAG-R1: Adaptive GraphRAG framework using process-constrained RL to enhance multi-hop reasoning in LLMs, addressing retrieval and over-thinking problems with specialized rewards and hybrid retrieval.
Details
Motivation: Existing GraphRAG methods struggle with complex multi-hop reasoning problems because they rely on pre-defined heuristics for query/retrieval and don't fully leverage LLMs' reasoning potential, leading to shallow retrieval and over-thinking issues.Method: 1) Modified GRPO with rollout-with-thinking capability; 2) Two process-constrained rewards: Progressive Retrieval Attenuation (PRA) to encourage essential retrievals and Cost-Aware F1 (CAF) to balance performance with computational costs; 3) Three-stage phase-dependent training strategy; 4) Hybrid graph-textual retrieval approach.
Result: GraphRAG-R1 significantly boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets, and can be flexibly integrated with various existing retrieval methods.
Conclusion: The proposed adaptive GraphRAG framework with process-constrained RL effectively enhances multi-hop reasoning in LLMs, addressing key limitations of existing methods while maintaining flexibility for integration with different retrieval approaches.
Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships. However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning. Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. Next, we design two process-constrained reward functions. To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals. Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards. Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity. Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets. Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements.
[471] Prior Distribution and Model Confidence
Maksim Kazanskii, Artem Kasianov
Main category: cs.LG
TL;DR: Embedding Density framework estimates prediction confidence by measuring test sample distance from training distribution in embedding space, improving classification accuracy by filtering low-confidence predictions.
Details
Motivation: To understand how training data distribution affects model confidence and performance, and to develop a model-agnostic method for estimating prediction confidence without retraining.Method: Introduces Embedding Density framework that measures distance of test samples from training distribution in embedding space. Uses this density estimation to filter low-confidence predictions.
Result: Significantly improves classification accuracy by filtering low-density predictions. Evaluated across multiple architectures and compared favorably with state-of-the-art OOD detection methods.
Conclusion: Embedding Density provides effective confidence estimation without retraining, potentially generalizable beyond computer vision applications.
Abstract: We study how the training data distribution affects confidence and performance in image classification models. We introduce Embedding Density, a model-agnostic framework that estimates prediction confidence by measuring the distance of test samples from the training distribution in embedding space, without requiring retraining. By filtering low-density (low-confidence) predictions, our method significantly improves classification accuracy. We evaluate Embedding Density across multiple architectures and compare it with state-of-the-art out-of-distribution (OOD) detection methods. The proposed approach is potentially generalizable beyond computer vision.
[472] Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf
Main category: cs.LG
TL;DR: This paper systematically studies linear models for time series forecasting, focusing on characteristic roots’ role in temporal dynamics, revealing noise-induced spurious roots, and proposing two robust root restructuring methods that achieve state-of-the-art results.
Details
Motivation: Despite complex models dominating time series forecasting, simple linear models show surprising competitiveness. Their robustness and interpretability warrant deeper theoretical investigation, especially regarding how characteristic roots govern temporal dynamics and how design choices affect model capabilities.Method: The study analyzes linear models in both noise-free and noisy regimes, revealing that characteristic roots govern long-term behavior and that noise leads to spurious roots. Two complementary strategies are proposed: 1) rank reduction techniques (Reduced-Rank Regression and Direct Weight Rank Reduction) to recover low-dimensional latent dynamics, and 2) Root Purge, a novel adaptive method that encourages learning a noise-suppressing null space during training.
Result: Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating theoretical insights and achieving state-of-the-art results in several settings. The analysis reveals a key data-scaling property: mitigating noise influence requires disproportionately large training data.
Conclusion: The findings underscore the potential of integrating classical linear systems theory with modern learning techniques to build robust, interpretable, and data-efficient forecasting models, highlighting the importance of structural regularization for handling noise-induced spurious roots.
Abstract: Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including Reduced-Rank Regression and Direct Weight Rank Reduction, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called Root Purge, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models.
[473] LoaQ: Layer-wise Output Approximation Quantization
Li Lin, Xiaojun Wan
Main category: cs.LG
TL;DR: LoaQ is a layer-wise post-training quantization method for LLMs that incorporates output-matching factors when quantizing linear layers, better aligning with the intuition of approximating each component’s quantized output to match its original.
Details
Motivation: Current layer-wise PTQ methods focus on weight approximation at the linear-layer level, which yields insufficient approximations and practical deviations from the guiding intuition of matching original outputs. Recent improvements still fail to achieve alignment with full-model output.Method: LoaQ incorporates output-matching factors when quantizing linear layers within the layer-wise PTQ framework. It features a simple closed-form solution and is orthogonal to existing techniques, making it easily integrable into existing quantization pipelines.
Result: Experiments on LLaMA and Qwen model families show LoaQ performs effectively in both weight-only and weight-activation quantization. It enhances overall quantization quality when integrated with existing strategies.
Conclusion: LoaQ better aligns with the intuitive goal of output matching in quantization, shows strong potential to advance PTQ frontiers, and can be seamlessly integrated with existing quantization techniques.
Abstract: A natural and intuitive idea in model quantization is to approximate each component’s quantized output to match its original. Motivated by this idea, most layer-wise post-training quantization (PTQ) methods focus on weight approximation at the linear-layer level. As a result, this local objective often yields insufficient approximations and practical deviations from the guiding intuition. Recent work has improved the approximation of linear-layer outputs within the layer-wise PTQ framework, but such refinements remain inadequate for achieving alignment with the full-model output. Based on a deeper understanding of the structure of mainstream LLMs, we propose LoaQ, which incorporates output-matching factors when quantizing linear layers within the layer-wise PTQ framework. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.
[474] Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting
Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot
Main category: cs.LG
TL;DR: Super-Linear is a lightweight mixture-of-experts model for time series forecasting that uses frequency-specialized linear experts and spectral gating for efficient, accurate predictions across diverse datasets.
Details
Motivation: Existing large pre-trained models for time series forecasting (like Chronos and Time-MoE) show strong zero-shot performance but suffer from high computational costs, creating a need for more efficient yet accurate forecasting models.Method: Super-Linear replaces deep architectures with simple frequency-specialized linear experts trained on resampled data across multiple frequency regimes, using a lightweight spectral gating mechanism to dynamically select relevant experts.
Result: Super-Linear demonstrates strong performance across benchmarks while substantially improving efficiency, robustness to sampling rates, and interpretability compared to existing models.
Conclusion: Super-Linear offers an effective lightweight alternative to computationally expensive pre-trained models for time series forecasting, balancing accuracy, efficiency, and interpretability.
Abstract: Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.
[475] Optimal Scaling Needs Optimal Norm
Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim
Main category: cs.LG
TL;DR: The paper discovers “norm transfer” - that optimal learning rate/batch size pairs for Adam and Scion optimizers maintain constant operator norm of the output layer across model and dataset scaling, providing a unifying principle for hyperparameter transfer.
Details
Motivation: Despite progress in hyperparameter transfer under scaling, there's no unifying explanatory principle. The paper aims to discover such a principle by investigating how optimal hyperparameters scale across model and dataset sizes.Method: Analyzed Adam and Scion optimizers across models up to 1.3B parameters trained on up to 138B tokens. Measured optimal learning rate/batch size pairs and their relationship to the operator norm of the output layer. Also tuned per-layer-group learning rates and studied scaling rules with dataset size.
Result: Discovered “norm transfer” - optimal (η*, B*) pairs maintain constant operator norm of output layer across scaling. This norm condition is necessary but not sufficient. Found consistent scaling rules between Adam and Scion. Output layer is most sensitive to learning rates, hidden layers benefit from lower rates.
Conclusion: The operator norm of the output layer serves as a unifying invariant for optimal hyperparameter transfer across model and dataset scaling. Provides practical norm-guided scaling insights and releases Distributed Scion (Disco) implementation with extensive training logs for further research.
Abstract: Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.
[476] Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
Keitaro Sakamoto, Issei Sato
Main category: cs.LG
TL;DR: The paper provides a unified explanation for late-phase training phenomena like grokking and information bottleneck through neural collapse dynamics, showing that contraction of within-class variance is the key factor connecting these behaviors.
Details
Motivation: Deep neural networks exhibit puzzling late-phase training behaviors like grokking (sudden test improvement after training loss plateaus) and information bottleneck (progressive discarding of irrelevant input information), but the underlying mechanisms and their relationships remain poorly understood.Method: The authors analyze these phenomena through the lens of neural collapse, which characterizes learned representation geometry. They show that contraction of population within-class variance connects grokking and information bottleneck, relating this to neural collapse measures on training data. They analyze neural collapse dynamics to explain distinct time scales between training set fitting and neural collapse progression.
Result: The theoretical framework explains that the distinct time scales between fitting the training set and neural collapse progression account for late-phase phenomena behavior. The findings are validated on multiple datasets and architectures.
Conclusion: Neural collapse provides a unified explanation for late-phase training phenomena, with contraction of within-class variance as the key factor connecting grokking and information bottleneck, offering insights into deep learning training dynamics.
Abstract: The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.
[477] Temporal Lifting as Latent-Space Regularization for Continuous-Time Flow Models in AI Systems
Jeffrey Camlin
Main category: cs.LG
TL;DR: Latent-space adaptive temporal lifting method that regularizes near-singular flow behavior while preserving conservation laws, enabling globally smooth trajectories for stiff/turbulent systems.
Details
Motivation: To address challenges in modeling continuous-time dynamical systems with near-singular behavior (like turbulent Navier-Stokes equations) and stabilize machine-learning dynamics approaches like physics-informed neural networks.Method: Introduces a smooth monotone mapping t→τ(t) that performs temporal lifting, regularizing near-singular flow behavior while preserving conservation laws. This acts as a continuous-time normalization operator in latent space.
Result: Trajectories of systems like incompressible Navier-Stokes equations on 𝕋³ become globally smooth in the lifted coordinates. The method stabilizes physics-informed neural networks and latent-flow architectures.
Conclusion: Temporal lifting bridges analytic regularity theory with representation-learning methods, providing a framework for handling stiff or turbulent processes in AI systems for dynamical systems.
Abstract: We present a latent-space formulation of adaptive temporal lifting for continuous-time dynamical systems. The method introduces a smooth monotone mapping $t \mapsto τ(t)$ that regularizes near-singular behavior of the underlying flow while preserving its conservation laws. In the lifted coordinate, trajectories such as those of the incompressible Navier-Stokes equations on the torus $\mathbb{T}^3$ become globally smooth. From the standpoint of machine-learning dynamics, temporal lifting acts as a continuous-time normalization operator that can stabilize physics-informed neural networks and other latent-flow architectures used in AI systems. The framework links analytic regularity theory with representation-learning methods for stiff or turbulent processes.
[478] Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models
Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan
Main category: cs.LG
TL;DR: Proposes a rank-1 EWC variant for continual learning in diffusion models, leveraging gradient collinearity in low SNR regimes to better capture curvature while being computationally cheap.
Details
Motivation: Address limitations of existing continual learning approaches: replay requires strong generators and suffers from distributional drift, while EWC assumes shared optimum across tasks and uses limited diagonal Fisher approximation.Method: 1) Theoretical/empirical analysis showing per-sample gradients become strongly collinear in low SNR regimes of diffusion models, yielding rank-1 empirical Fisher. 2) Propose rank-1 EWC variant that captures dominant curvature direction at diagonal approximation cost. 3) Pair with replay-based approach to encourage parameter sharing while mitigating drift.
Result: Consistent improvement in average FID and reduced forgetting on class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k). Forgetting nearly eliminated on MNIST/FashionMNIST, more than halved on ImageNet-1k.
Conclusion: Diffusion models admit approximately rank-1 Fisher structure. With better Fisher estimate, EWC becomes strong complement to replay: replay encourages parameter sharing, EWC effectively constrains replay-induced drift.
Abstract: Catastrophic forgetting remains a central obstacle for continual learning in neural models. Popular approaches – replay and elastic weight consolidation (EWC) – have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation. In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data. We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient. Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction. We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift. On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is more than halved on ImageNet-1k. These results suggest that diffusion models admit an approximately rank-1 Fisher. With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.
[479] Hermes: A Multi-Scale Spatial-Temporal Hypergraph Network for Stock Time Series Forecasting
Xiangfei Qiu, Liu Yang, Xiangyu Xu, Hanyin Cheng, Xingjian Wu, Rongjia Wu, Zhigang Zhang, Ding Tu, Chenjuan Guo, Bin Yang, Christian S. Jensen, Jilin Hu
Main category: cs.LG
TL;DR: Hermes framework improves stock forecasting by better capturing industry correlations through hypergraph-based moving aggregation for lead-lag relationships and multi-scale fusion modules.
Details
Motivation: Stock time series exhibit industry correlations that can improve forecasting accuracy, but existing hypergraph methods capture these correlations superficially. They fail to fully consider inter-industry lead-lag interactions and don't model multi-scale information within and among industries.Method: Hermes framework integrates moving aggregation and multi-scale fusion modules in a hypergraph network. It uses hyperedge-based moving aggregation with sliding windows and dynamic temporal aggregation to capture lead-lag relationships. It employs cross-scale, edge-to-edge message passing to integrate multi-scale information while maintaining scale consistency.
Result: Experimental results on multiple real-world stock datasets show that Hermes outperforms existing state-of-the-art methods.
Conclusion: The Hermes framework effectively addresses limitations in capturing industry correlations for stock time series forecasting by modeling lead-lag relationships and multi-scale information, leading to improved forecasting accuracy.
Abstract: Time series forecasting occurs in a range of financial applications providing essential decision-making support to investors, regulatory institutions, and analysts. Unlike multivariate time series from other domains, stock time series exhibit industry correlation. Exploiting this kind of correlation can improve forecasting accuracy. However, existing methods based on hypergraphs can only capture industry correlation relatively superficially. These methods face two key limitations: they do not fully consider inter-industry lead-lag interactions, and they do not model multi-scale information within and among industries. This study proposes the Hermes framework for stock time series forecasting that aims to improve the exploitation of industry correlation by addressing these limitations. The framework integrates moving aggregation and multi-scale fusion modules in a hypergraph network. Specifically, to more flexibly capture the lead-lag relationships among industries, Hermes proposes a hyperedge-based moving aggregation module. This module incorporates a sliding window and utilizes dynamic temporal aggregation operations to consider lead-lag dependencies among industries. Additionally, to effectively model multi-scale information, Hermes employs cross-scale, edge-to-edge message passing to integrate information from different scales while maintaining the consistency of each scale. Experimental results on multiple real-world stock datasets show that Hermes outperforms existing state-of-the-art methods.
[480] Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning
Ling Zhang, Xianliang Yang, Juwon Yu, Park Cheonyoung, Miran Lee, Lei Song, Jiang Bian
Main category: cs.LG
TL;DR: ICA framework uses in-context approximation to estimate training example value without retraining, enabling efficient data selection and dynamic reweighting for better alignment.
Details
Motivation: Noisy or off-target examples in fine-tuning dilute supervision, and current methods for identifying high-value training data rely on heuristics or expensive retraining.Method: In-Context Approximation (ICA) estimates holdout loss after training on a candidate example by conditioning on a small curated holdout set in context, requiring no reference model or additional finetuning.
Result: ICA-based reweighting consistently improves model alignment across SFT, DPO, and SimPO with diverse backbones and datasets, with minimal overhead.
Conclusion: ICA provides a principled, resource-efficient framework for data selection and reweighting, though limitations exist in rapidly drifting on-policy settings.
Abstract: Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a principled, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. We define the resulting estimate as the ICA score, and derive per-example weights that dynamically reweight gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the number of in-context holdout examples. We also discuss limitations in rapidly drifting on-policy settings, highlighting directions for future work. Code and prompts will be released.
[481] Monotone and Separable Set Functions: Characterizations and Neural Models
Soutrik Sarangi, Yonatan Sverdlov, Nadav Dym, Abir De
Main category: cs.LG
TL;DR: The paper introduces Monotone and Separating (MAS) set functions that preserve set containment relationships through vector embeddings, establishing theoretical bounds and practical applications.
Details
Motivation: Motivated by set containment problems, the authors aim to design set-to-vector functions that preserve the natural partial order of sets, enabling efficient set containment operations through vector comparisons.Method: The authors define MAS functions that satisfy S⊆T iff F(S)≤F(T), establish theoretical bounds on vector dimensions, propose a “weakly MAS” model for infinite ground sets with Holder continuity, and construct universal monotone models.
Result: Theoretical results show MAS functions don’t exist for infinite ground sets, but a relaxed “weakly MAS” model is provably stable. Experiments demonstrate improved performance on set containment tasks compared to standard set models without containment inductive bias.
Conclusion: MAS functions provide a principled approach to embedding sets while preserving containment relationships, with theoretical guarantees and practical benefits for set containment applications.
Abstract: Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name “weakly MAS” and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/yonatansverdlov/Monotone-Embedding.
[482] FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
Junkang Liu, Fanhua Shang, Hongying Liu, Yuxuan Tian, Yuanyuan Liu, Jin Liu, Kewen Zhu, Zhouchen Lin
Main category: cs.LG
TL;DR: FedAdamW is a federated version of AdamW optimizer that addresses challenges of data heterogeneity, local overfitting, and slow convergence in federated learning by using local correction mechanisms, decoupled weight decay, and efficient aggregation of second-moment estimates.
Details
Motivation: AdamW is effective for large-scale models but faces challenges in federated learning: high variance in second-moment estimates due to data heterogeneity, local overfitting causing client drift, and slow convergence from reinitializing moment estimates each round.Method: FedAdamW aligns local updates with global updates using local correction mechanism and decoupled weight decay. It efficiently aggregates the mean of second-moment estimates to reduce variance and reinitialize them properly.
Result: Theoretically proves linear speedup convergence rate without heterogeneity assumption. Empirically validates effectiveness on language and vision Transformer models, significantly reducing communication rounds and improving test accuracy compared to baselines.
Conclusion: FedAdamW successfully addresses AdamW’s challenges in federated learning, providing both theoretical guarantees and empirical improvements for training large models in federated settings.
Abstract: AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L Δσ_l^2)/(S K R ε^2)}+(L Δ)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.
[483] Causal Graph Neural Networks for Healthcare
Munib Mesinovic, Max Buhlan, Tingting Zhu
Main category: cs.LG
TL;DR: This review paper examines how causal graph neural networks address healthcare AI failures by learning invariant causal mechanisms rather than spurious correlations, with applications in psychiatry, oncology, monitoring, and drug recommendation, while identifying barriers like computational costs and causal-washing risks.
Details
Motivation: Healthcare AI systems fail when deployed across institutions due to learning statistical associations rather than causal mechanisms, leading to performance drops and perpetuation of discriminatory patterns. This creates a triple crisis of distribution shift, discrimination, and inscrutability that needs to be addressed.Method: The paper reviews methodological foundations including structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. It combines graph-based representations of biomedical data with causal inference principles.
Result: Applications demonstrate clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins.
Conclusion: Substantial barriers remain including computational requirements, validation challenges, and risks of causal-washing. The paper proposes tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identifies critical research priorities for making causal rather than purely associational claims.
Abstract: Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
[484] Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
Mang Li, Wei Lyu
Main category: cs.LG
TL;DR: The paper provides a theoretical explanation for the one-epoch overfitting problem in CTR/CVR models with sparse categorical features, proposes an adaptive regularization method for embedding layers, and demonstrates successful deployment in production systems.
Details
Motivation: The one-epoch overfitting problem is widespread in CTR and CVR estimation models in search, advertising, and recommendation domains. Models relying on large-scale sparse categorical features suffer significant performance decline when trained for multiple epochs, but the fundamental cause remains unclear despite heuristic solutions.Method: The authors present a theoretical explanation grounded in Rademacher complexity for why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, they propose a regularization method that adaptively constrains the norm budget of embedding layers.
Result: The proposed approach not only prevents severe performance degradation during multi-epoch training, but also improves model performance within a single epoch. The method has already been deployed in online production systems.
Conclusion: The paper provides both theoretical understanding and practical solution to the one-epoch overfitting problem in CTR/CVR models with sparse features, offering an adaptive regularization technique that enables stable multi-epoch training while improving single-epoch performance.
Abstract: The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, the fundamental cause of this phenomenon remains unclear. In this work, we present a theoretical explanation grounded in Rademacher complexity, supported by empirical experiments, to explain why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, we propose a regularization method that constrains the norm budget of embedding layers adaptively. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.
[485] Convolutional Model Trees
William Ward Armstrong, Hongyi Li, Jun Xu
Main category: cs.LG
TL;DR: A method for creating forests of model trees to fit functions on images, with techniques for handling distortions and producing smooth, differentiable approximations.
Details
Motivation: To develop a robust method for fitting functions defined on images that can handle various distortions (small distortions, rotations, perspective changes) while producing smooth, continuously differentiable approximations.Method: Multi-step approach: down-sampling images, determining tree hyperplanes, applying convolutions to hyperplanes to handle small distortions, creating forests of model trees for accuracy and smoothness. Uses 1-to-1 correspondence among pixels, hyperplane coefficients, and leaf functions to handle larger distortions.
Result: Theoretical framework for smoothing forest outputs to produce continuously differentiable approximations, with proven convergence of the training procedure.
Conclusion: The method provides a comprehensive approach for fitting functions on images with robustness to distortions and theoretical guarantees for smooth, differentiable approximations.
Abstract: A method for creating a forest of model trees to fit samples of a function defined on images is described in several steps: down-sampling the images, determining a tree’s hyperplanes, applying convolutions to the hyperplanes to handle small distortions of training images, and creating forests of model trees to increase accuracy and achieve a smooth fit. A 1-to-1 correspondence among pixels of images, coefficients of hyperplanes and coefficients of leaf functions offers the possibility of dealing with larger distortions such as arbitrary rotations or changes of perspective. A theoretical method for smoothing forest outputs to produce a continuously differentiable approximation is described. Within that framework, a training procedure is proved to converge.
[486] Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration
Yiwei Shi, Hongnan Ma, Mengyue Yang, Cunjia Liu, Weiru Liu
Main category: cs.LG
TL;DR: Proposes diffusion-driven Bayesian exploration framework to correct early state estimation errors in emergency response, overcoming permanent posterior support limitations of bootstrap particle filters.
Details
Motivation: Early state estimates in emergency response are critical but often based on limited/biased information, causing catastrophic delays and resource misallocation. Bootstrap particle filters suffer from Stationarity-Induced Posterior Support Invariance (S-PSI) where regions excluded by initial prior remain permanently unexplorable, making error correction impossible even with contradictory evidence.Method: Diffusion-driven Bayesian exploration framework using entropy-regularized sampling and covariance-scaled diffusion to expand posterior support. Includes Metropolis-Hastings check to validate proposals and keep inference adaptive to unexpected evidence.
Result: Matches reinforcement learning and planning baselines when priors are correct. Substantially outperforms classical SMC perturbations and RL-based methods under misalignment. Provides theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.
Conclusion: Proposed framework enables principled, real-time correction of early state estimation errors in high-stakes applications, overcoming permanent posterior support limitations of traditional methods through diffusion-driven exploration with theoretical guarantees.
Abstract: In emergency response and other high-stakes societal applications, early-stage state estimates critically shape downstream outcomes. Yet, these initial state estimates-often based on limited or biased information-can be severely misaligned with reality, constraining subsequent actions and potentially causing catastrophic delays, resource misallocation, and human harm. Under the stationary bootstrap baseline (zero transition and no rejuvenation), bootstrap particle filters exhibit Stationarity-Induced Posterior Support Invariance (S-PSI), wherein regions excluded by the initial prior remain permanently unexplorable, making corrections impossible even when new evidence contradicts current beliefs. While classical perturbations can in principle break this lock-in, they operate in an always-on fashion and may be inefficient. To overcome this, we propose a diffusion-driven Bayesian exploration framework that enables principled, real-time correction of early state estimation errors. Our method expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. A Metropolis-Hastings check validates proposals and keeps inference adaptive to unexpected evidence. Empirical evaluations on realistic hazardous-gas localization tasks show that our approach matches reinforcement learning and planning baselines when priors are correct. It substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and we provide theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.
[487] Conformal Online Learning of Deep Koopman Linear Embeddings
Ben Gao, Jordan Patracone, Stéphane Chrétien, Olivier Alata
Main category: cs.LG
TL;DR: COLoKe is a framework for adaptive online learning of Koopman embeddings from streaming data, using conformal-style mechanisms to trigger updates only when prediction errors exceed dynamic thresholds.
Details
Motivation: To develop an adaptive framework for learning Koopman-invariant representations of nonlinear dynamical systems from streaming data that prevents overfitting and reduces unnecessary updates while maintaining predictive accuracy.Method: Combines deep feature learning with multistep prediction consistency in the lifted linear space. Uses a conformal-style mechanism that assesses model consistency rather than state conformity, triggering updates only when prediction errors exceed dynamically calibrated thresholds to selectively refine the Koopman operator and embedding.
Result: Empirical results on benchmark dynamical systems show COLoKe effectively maintains long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting.
Conclusion: COLoKe provides an effective framework for adaptive online learning of Koopman embeddings that balances predictive accuracy with computational efficiency through selective updates based on model consistency assessment.
Abstract: We introduce Conformal Online Learning of Koopman embeddings (COLoKe), a novel framework for adaptively updating Koopman-invariant representations of nonlinear dynamical systems from streaming data. Our modeling approach combines deep feature learning with multistep prediction consistency in the lifted space, where the dynamics evolve linearly. To prevent overfitting, COLoKe employs a conformal-style mechanism that shifts the focus from evaluating the conformity of new states to assessing the consistency of the current Koopman model. Updates are triggered only when the current model’s prediction error exceeds a dynamically calibrated threshold, allowing selective refinement of the Koopman operator and embedding. Empirical results on benchmark dynamical systems demonstrate the effectiveness of COLoKe in maintaining long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting.
[488] Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts
Xiaolei Lu, Shamim Nemati
Main category: cs.LG
TL;DR: AdaTTT is an adaptive test-time training framework for EHR-based invasive mechanical ventilation prediction that addresses domain shifts in ICU settings through self-supervised learning and partial optimal transport alignment.
Details
Motivation: Domain shifts caused by variability in patient populations, clinical practices, and EHR systems across institutions degrade the generalization of predictive models for IMV in ICUs, necessitating robust adaptation methods during deployment.Method: AdaTTT combines: 1) Information-theoretic bounds on test-time error, 2) Self-supervised learning with reconstruction and masked feature modeling using dynamic masking, 3) Prototype learning, and 4) Partial Optimal Transport for flexible feature alignment while preserving clinical representations.
Result: Experiments across multi-center ICU cohorts demonstrate competitive classification performance on different test-time adaptation benchmarks for IMV prediction.
Conclusion: AdaTTT provides an effective framework for adapting EHR-based predictive models to domain shifts in clinical settings without requiring labeled target-domain data, enabling more reliable IMV prediction across diverse ICU environments.
Abstract: Accurate prediction of the need for invasive mechanical ventilation (IMV) in intensive care units (ICUs) patients is crucial for timely interventions and resource allocation. However, variability in patient populations, clinical practices, and electronic health record (EHR) systems across institutions introduces domain shifts that degrade the generalization performance of predictive models during deployment. Test-Time Training (TTT) has emerged as a promising approach to mitigate such shifts by adapting models dynamically during inference without requiring labeled target-domain data. In this work, we introduce Adaptive Test-Time Training (AdaTTT), an enhanced TTT framework tailored for EHR-based IMV prediction in ICU settings. We begin by deriving information-theoretic bounds on the test-time prediction error and demonstrate that it is constrained by the uncertainty between the main and auxiliary tasks. To enhance their alignment, we introduce a self-supervised learning framework with pretext tasks: reconstruction and masked feature modeling optimized through a dynamic masking strategy that emphasizes features critical to the main task. Additionally, to improve robustness against domain shifts, we incorporate prototype learning and employ Partial Optimal Transport (POT) for flexible, partial feature alignment while maintaining clinically meaningful patient representations. Experiments across multi-center ICU cohorts demonstrate competitive classification performance on different test-time adaptation benchmarks.
[489] K2-V2: A 360-Open, Reasoning-Enhanced LLM
K2 Team, Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Seungwook Han, Bowen Tan, Gurpreet Gosal, Xudong Han, Varad Pimpalkhute, Shibo Hao, Ming Shan Hee, Joel Hestness, Haolong Jia, Liqun Ma, Aaryamonvikram Singh, Daria Soboleva, Natalia Vassilieva, Renxi Wang, Yingquan Wu, Yuekai Sun, Taylor Killian, Alexander Moreno, John Maggs, Hector Ren, Guowei He, Hongyi Wang, Xuezhe Ma, Yuqi Wang, Mikhail Yurochkin, Eric P. Xing
Main category: cs.LG
TL;DR: K2-V2 is a 360-open LLM built from scratch that serves as a superior reasoning-focused base model, rivaling top open-weight models while being fully open-source with complete training transparency.
Details
Motivation: To create a fully open, reasoning-centric foundation model that can serve as a superior base for reasoning adaptation, while providing complete transparency in training data and process to empower the open-source community.Method: Built from scratch with active infusion of domain knowledge, reasoning, long-context, and tool use throughout training. Uses simple supervised fine-tuning to establish strong baselines, with full transparency of training history and data composition.
Result: K2-V2 stands as the strongest fully open model, rivals open-weight leaders in its size class, outperforms Qwen2.5-72B and approaches Qwen3-235B performance. Demonstrates significant headroom for advanced alignment.
Conclusion: K2-V2 provides a capable, reasoning-centric foundation with complete transparency (full training history, data composition, and weights), maximizing effectiveness for continuous training scenarios and empowering the open-source community.
Abstract: We introduce K2-V2, a 360-open LLM built from scratch as a superior base for reasoning adaptation, in addition to functions such as conversation and knowledge retrieval from general LLMs. It stands as the strongest fully open model, rivals open-weight leaders in its size class, outperforms Qwen2.5-72B and approaches the performance of Qwen3-235B. We actively infuse domain knowledge, reasoning, long-context, and tool use throughout the training process. This explicitly prepares the model for complex reasoning tasks. We demonstrate this potential using simple supervised fine-tuning, establishing a strong baseline that indicates significant headroom for advanced alignment. By releasing the full training history and data composition, we maximize the effectiveness of continuous training, a key open source production scenario. We release the model weights and signature LLM360 artifacts, such as complete training data, to empower the community with a capable, reasoning-centric foundation.
[490] Geometric Dynamics of Agentic Loops in Large Language Models
Nicolas Tacheny
Main category: cs.LG
TL;DR: Iterative LLM systems exhibit predictable dynamical behaviors in semantic space that can be classified as contractive (convergent), oscillatory (cycling), or exploratory (divergent), with prompt design controlling these regimes.
Details
Motivation: Prior work only evaluates task performance at convergence, ignoring how semantic content evolves across iterations. Without understanding temporal dynamics, we cannot predict system behavior, guarantee stability, or systematically design iterative architectures.Method: Formalize agentic loops as discrete dynamical systems in semantic space, borrowing from dynamical systems theory to define trajectories, attractors, and dynamical regimes for recursive LLM transformations.
Result: Experiments show iterative paraphrasing produces contractive dynamics with measurable attractor formation, while iterative negation produces exploratory dynamics. Prompt design directly controls the dynamical regime - the same model exhibits fundamentally different geometric behaviors depending on the transformation applied.
Conclusion: Iterative LLM dynamics are predictable and controllable, opening new directions for stability analysis, trajectory forecasting, and principled design of composite loops that balance convergence and exploration.
Abstract: Iterative LLM systems(self-refinement, chain-of-thought, autonomous agents) are increasingly deployed, yet their temporal dynamics remain uncharacterized. Prior work evaluates task performance at convergence but ignores the trajectory: how does semantic content evolve across iterations? Does it stabilize, drift, or oscillate? Without answering these questions, we cannot predict system behavior, guarantee stability, or systematically design iterative architectures. We formalize agentic loops as discrete dynamical systems in semantic space. Borrowing from dynamical systems theory, we define trajectories, attractors and dynamical regimes for recursive LLM transformations, providing rigorous geometric definitions adapted to this setting. Our framework reveals that agentic loops exhibit classifiable dynamics: contractive (convergence toward stable semantic attractors), oscillatory (cycling among attractors), or exploratory (unbounded divergence). Experiments on singular loops validate the framework. Iterative paraphrasing produces contractive dynamics with measurable attractor formation and decreasing dispersion. Iterative negation produces exploratory dynamics with no stable structure. Crucially, prompt design directly controls the dynamical regime - the same model exhibits fundamentally different geometric behaviors depending solely on the transformation applied. This work establishes that iterative LLM dynamics are predictable and controllable, opening new directions for stability analysis, trajectory forecasting, and principled design of composite loops that balance convergence and exploration.
[491] GS-KAN: Parameter-Efficient Kolmogorov-Arnold Networks via Sprecher-Type Shared Basis Functions
Oscar Eliasson
Main category: cs.LG
TL;DR: GS-KAN is a lightweight KAN variant that uses shared parent functions with learnable linear transformations per layer, achieving better parameter efficiency than standard KANs while maintaining strong performance on function approximation, tabular regression, and image classification tasks.
Details
Motivation: Standard Kolmogorov-Arnold Networks (KANs) suffer from parameter inefficiency because they require unique parameterizations for every network edge, making them impractical for high-dimensional applications under parameter constraints.Method: GS-KAN constructs unique edge functions by applying learnable linear transformations to a single learnable, shared parent function per layer, inspired by David Sprecher’s refinement of the superposition theorem.
Result: GS-KAN outperforms MLPs and standard KANs on continuous function approximation tasks while maintaining superior parameter efficiency. It achieves competitive performance with existing KANs on tabular regression and outperforms MLPs on high-dimensional classification tasks.
Conclusion: GS-KAN enables deployment of KAN-based architectures in high-dimensional regimes under strict parameter constraints, addressing the parameter explosion problem of standard KAN implementations.
Abstract: The Kolmogorov-Arnold representation theorem offers a theoretical alternative to Multi-Layer Perceptrons (MLPs) by placing learnable univariate functions on edges rather than nodes. While recent implementations such as Kolmogorov-Arnold Networks (KANs) demonstrate high approximation capabilities, they suffer from significant parameter inefficiency due to the requirement of maintaining unique parameterizations for every network edge. In this work, we propose GS-KAN (Generalized Sprecher-KAN), a lightweight architecture inspired by David Sprecher’s refinement of the superposition theorem. GS-KAN constructs unique edge functions by applying learnable linear transformations to a single learnable, shared parent function per layer. We evaluate GS-KAN against existing KAN architectures and MLPs across synthetic function approximation, tabular data regression and image classification tasks. Our results demonstrate that GS-KAN outperforms both MLPs and standard KAN baselines on continuous function approximation tasks while maintaining superior parameter efficiency. Additionally, GS-KAN achieves competitive performance with existing KAN architectures on tabular regression and outperforms MLPs on high-dimensional classification tasks. Crucially, the proposed architecture enables the deployment of KAN-based architectures in high-dimensional regimes under strict parameter constraints, a setting where standard implementations are typically infeasible due to parameter explosion. The source code is available at https://github.com/rambamn48/gs-impl.
[492] Generalized Spherical Neural Operators: Green’s Function Formulation
Hao Tang, Hao Chen, Chao Li
Main category: cs.LG
TL;DR: GSNO is a novel spherical neural operator framework using designable Green’s functions with harmonic expansion, enabling flexible balance of equivariance/invariance for real-world spherical PDE problems.
Details
Motivation: Existing spherical neural operators lack flexibility for real-world complexity while needing to preserve intrinsic geometry and avoid distortions that break rotational consistency.Method: Proposed GSNO framework based on designable spherical Green’s function and harmonic expansion, with absolute/relative position-dependent Green’s function for flexible equivariance-invariance balance. Developed SHNet hierarchical architecture with multi-scale spectral modeling and spherical up-down sampling.
Result: GSNO and SHNet consistently outperform state-of-the-art methods on diffusion MRI, shallow water dynamics, and global weather forecasting tasks.
Conclusion: GSNO provides a principled, generalized framework for spherical operator design that bridges rigorous theory with real-world complexity, positioning it as a foundational approach for spherical learning.
Abstract: Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a generalized operator-design framework based on the designable spherical Green’s function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green’s function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green’s-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to non-equivariant systems while retaining spectral efficiency and grid invariance. To exploit GSNO, we develop SHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and SHNet consistently outperform state-of-the-art methods. The theoretical and experimental results position GSNO as a principled and generalized framework for spherical operator design and learning, bridging rigorous theory with real-world complexity.
[493] Learning under Distributional Drift: Reproducibility as an Intrinsic Statistical Resource
Sofiya Zaichyk
Main category: cs.LG
TL;DR: The paper introduces a “reproducibility budget” C_T to quantify statistical reproducibility under distributional drift, derives optimal generalization bounds, and establishes a fundamental reproducibility speed limit.
Details
Motivation: Statistical learning under distributional drift is poorly characterized - when each observation alters the data-generating law, classical generalization bounds can fail. There's a need to quantify how much statistical reproducibility is possible when both exogenous changes and endogenous feedback affect the learning process.Method: Introduces a new statistical primitive called the “reproducibility budget” C_T, defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution. This measures total distributional motion during learning. Uses this construct to derive generalization bounds and prove minimax optimality.
Result: Derives a drift-feedback generalization bound of order O(T^{-1/2} + C_T/T), proves a matching minimax lower bound showing this rate is optimal, and establishes a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than imposed by the average Fisher-Rao drift rate C_T/T.
Conclusion: The reproducibility budget C_T emerges as the intrinsic quantity measuring distributional motion across settings including exogenous drift, adaptive data analysis, and performative prediction, providing a unified geometric framework for understanding statistical learning under distributional change.
Abstract: Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system’s finite capacity for statistical reproducibility: the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.
[494] Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation
Zikun Guo, Adeyinka. P. Adedigba, Rammohan Mallipeddi
Main category: cs.LG
TL;DR: Proposed Cluster Aggregated GAN framework for synthetic appliance data generation that handles intermittent and continuous appliances differently to improve training stability and output fidelity.
Details
Motivation: Scarcity of labeled datasets for non-intrusive load monitoring (NILM) and privacy-preserving energy research. Existing GAN-based methods treat all devices uniformly, neglecting behavioral differences between intermittent and continuous appliances, leading to unstable training and limited output fidelity.Method: Hybrid generative framework that routes appliances to specialized branches based on behavioral characteristics. For intermittent appliances: clustering module groups similar activation patterns with dedicated generators per cluster. For continuous appliances: separate branch with LSTM-based generator using sequence compression for training stability.
Result: Outperforms baseline methods across metrics measuring realism, diversity, and training stability on UVIC smart plug dataset. Integrating clustering as active generative component improves both interpretability and scalability.
Conclusion: The proposed framework establishes an effective approach for synthetic load generation in non-intrusive load monitoring research, addressing limitations of uniform device treatment in existing methods.
Abstract: Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.
[495] Bridging Training and Merging Through Momentum-Aware Optimization
Alireza Moayedikia, Alicia Troncoso
Main category: cs.LG
TL;DR: A unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition, eliminating the need to recompute curvature for merging.
Details
Motivation: Current workflows compute curvature information during training, discard it, then recompute similar information for merging, wasting computation and discarding valuable trajectory data. There's a need to unify training and merging by reusing optimization trajectory information.Method: Maintain factorized momentum and curvature statistics during training with modest memory overhead (~30% over AdamW). Accumulate task saliency scores as a byproduct of optimization that provide importance estimates comparable to post hoc Fisher computation. These scores enable curvature-aware merging directly from training.
Result: Curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels on natural language understanding benchmarks. Multi-task merging improves 1.6% over strong baselines. The framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers.
Conclusion: Training-time curvature information suffices for effective model composition, enabling a unified training-merging pipeline. By treating optimization trajectory as a reusable asset rather than discarding it, the approach demonstrates that training and merging can be efficiently unified.
Abstract: Training large neural networks and merging task specific models both exploit low rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry aware model composition. The proposed method incurs modest memory overhead (approximately 30% over AdamW) to accumulate task saliency scores that enable curvature aware merging. These scores, computed as a byproduct of optimization, provide importance estimates comparable to post hoc Fisher computation while producing merge-ready models directly from training. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature aware parameter selection outperforms magnitude only baselines across all sparsity levels, with multi-task merging improving 1.6% over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach demonstrates that training-time curvature information suffices for effective model composition, enabling a unified training merging pipeline.
[496] Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand
Kiattikun Chobtham
Main category: cs.LG
TL;DR: A novel NorthEast monsoon climate index optimized via reinforcement learning improves long-term rainfall prediction in Thailand by reducing forecast errors.
Details
Motivation: Existing global climate indices like ENSO are insufficient for accurate local-scale rainfall prediction in specific Thai regions, creating a need for region-specific climate indices.Method: Developed a NorthEast monsoon climate index from sea surface temperature, optimized using Deep Q-Network reinforcement learning to select effective rectangular areas based on correlation with seasonal rainfall. Rainfall stations were clustered into 12 groups, and the optimized index was incorporated into LSTM models.
Result: The optimized index significantly improved long-term monthly rainfall prediction skill in most cluster areas and effectively reduced Root Mean Square Error for 12-month-ahead forecasts.
Conclusion: The reinforcement learning-optimized local climate index approach successfully enhances rainfall prediction accuracy in Thailand, demonstrating the value of region-specific climate indices over global ones for local forecasting.
Abstract: Climate prediction is a challenge due to the intricate spatiotemporal patterns within Earth systems. Global climate indices, such as the El Niño Southern Oscillation, are standard input features for long-term rainfall prediction. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel NorthEast monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.
[497] Learning Neural Operators from Partial Observations via Latent Autoregressive Modeling
Jingren Hou, Hong Wang, Pengyu Xu, Chang Gao, Huafeng Liu, Liping Jing
Main category: cs.LG
TL;DR: LANO introduces the first systematic framework for learning neural operators from partial observations, addressing key challenges in real-world scientific applications with incomplete data.
Details
Motivation: Real-world scientific applications often have incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Current neural operators assume fully-observed spatial inputs, which severely restricts their applicability in practical scenarios.Method: Proposes Latent Autoregressive Neural Operator (LANO) with two novel components: (1) mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (2) Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Also introduces POBench-PDE benchmark for evaluation.
Result: LANO achieves state-of-the-art performance with 18-69% relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50% missing rate, including real-world climate prediction. Effectively handles scenarios with up to 75% missing rate.
Conclusion: LANO bridges the gap between idealized research settings and real-world scientific computing by enabling neural operators to work effectively with partial observations, addressing fundamental obstacles of supervision gaps and spatial mismatches in incomplete data scenarios.
Abstract: Real-world scientific applications frequently encounter incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Although neural operators significantly advanced PDE solving in terms of computational efficiency and accuracy, their underlying assumption of fully-observed spatial inputs severely restricts applicability in real-world applications. We introduce the first systematic framework for learning neural operators from partial observation. We identify and formalize two fundamental obstacles: (i) the supervision gap in unobserved regions that prevents effective learning of physical correlations, and (ii) the dynamic spatial mismatch between incomplete inputs and complete solution fields. Specifically, our proposed Latent Autoregressive Neural Operator(LANO) introduces two novel components designed explicitly to address the core difficulties of partial observations: (i) a mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (ii) a Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Additionally, we develop POBench-PDE, a dedicated and comprehensive benchmark designed specifically for evaluating neural operators under partial observation conditions across three PDE-governed tasks. LANO achieves state-of-the-art performance with 18–69$%$ relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50$%$ missing rate, including real-world climate prediction. Our approach effectively addresses practical scenarios involving up to 75$%$ missing rate, to some extent bridging the existing gap between idealized research settings and the complexities of real-world scientific computing.
[498] HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
Aakriti Lnu, Zhe Li, Dandan Liang, Chao Huang, Rui Li, Haibo Yang
Main category: cs.LG
TL;DR: HOSL is a hybrid-order split learning framework that combines zeroth-order optimization on clients with first-order optimization on servers to reduce memory usage while maintaining performance in collaborative LLM training.
Details
Motivation: Existing split learning systems use first-order optimization requiring clients to store activations for backpropagation, causing substantial memory overhead that negates benefits of model partitioning. Zeroth-order optimization reduces memory but suffers from slow convergence and degraded performance.Method: HOSL strategically integrates ZO optimization on client side with FO optimization on server side. Clients use memory-efficient ZO gradient estimation to eliminate backpropagation and activation storage, while servers use FO optimization for fast convergence.
Result: HOSL reduces client GPU memory by up to 3.7× compared to FO methods while achieving accuracy within 0.20%-4.23% of FO baseline. It outperforms ZO baseline by up to 15.55%. Theoretically achieves O(√(d_c/TQ)) convergence rate dependent on client-side dimension rather than full model dimension.
Conclusion: HOSL effectively addresses the trade-off between memory efficiency and optimization effectiveness in split learning for LLMs, enabling memory-efficient training on edge devices while maintaining competitive performance through hybrid-order optimization.
Abstract: Split learning (SL) enables collaborative training of large language models (LLMs) between resource-constrained edge devices and compute-rich servers by partitioning model computation across the network boundary. However, existing SL systems predominantly rely on first-order (FO) optimization, which requires clients to store intermediate quantities such as activations for backpropagation. This results in substantial memory overhead, largely negating benefits of model partitioning. In contrast, zeroth-order (ZO) optimization eliminates backpropagation and significantly reduces memory usage, but often suffers from slow convergence and degraded performance. In this work, we propose HOSL, a novel Hybrid-Order Split Learning framework that addresses this fundamental trade-off between memory efficiency and optimization effectiveness by strategically integrating ZO optimization on the client side with FO optimization on the server side. By employing memory-efficient ZO gradient estimation at the client, HOSL eliminates backpropagation and activation storage, reducing client memory consumption. Meanwhile, server-side FO optimization ensures fast convergence and competitive performance. Theoretically, we show that HOSL achieves an $\mathcal{O}(\sqrt{d_c/TQ})$ rate, which depends on client-side model dimension $d_c$ rather than the full model dimension $d$, demonstrating that convergence improves as more computation is offloaded to the server. Extensive experiments on OPT models (125M and 1.3B parameters) across 6 tasks demonstrate that HOSL reduces client GPU memory by up to 3.7$\times$ compared to the FO method while achieving accuracy within 0.20%-4.23% of this baseline. Furthermore, HOSL outperforms the ZO baseline by up to 15.55%, validating the effectiveness of our hybrid strategy for memory-efficient training on edge devices.
[499] Dual-Prototype Disentanglement: A Context-Aware Enhancement Framework for Time Series Forecasting
Haonan Yang, Jianchao Tang, Zhuo Li
Main category: cs.LG
TL;DR: DPAD is a model-agnostic auxiliary framework that enhances time series forecasting models by dynamically disentangling complex temporal patterns through dual prototype banks and context-aware routing.
Details
Motivation: Current deep learning approaches for time series forecasting often fail to dynamically disentangle complex, intertwined temporal patterns, resulting in static, averaged representations that lack context-aware capabilities.Method: Proposes DPAD with three key components: 1) Dynamic Dual-Prototype bank (DDP) with common pattern bank (strong temporal priors) and rare pattern bank (critical infrequent events), 2) Dual-Path Context-aware routing (DPC) mechanism for selective retrieval of context-specific patterns, and 3) Disentanglement-Guided Loss (DGLoss) to ensure specialization and coverage.
Result: Comprehensive experiments show DPAD consistently improves forecasting performance and reliability of state-of-the-art models across diverse real-world benchmarks.
Conclusion: DPAD successfully addresses the limitation of static pattern representations in time series forecasting by providing model-agnostic pattern disentanglement and context-aware adaptation capabilities.
Abstract: Time series forecasting has witnessed significant progress with deep learning. While prevailing approaches enhance forecasting performance by modifying architectures or introducing novel enhancement strategies, they often fail to dynamically disentangle and leverage the complex, intertwined temporal patterns inherent in time series, thus resulting in the learning of static, averaged representations that lack context-aware capabilities. To address this, we propose the Dual-Prototype Adaptive Disentanglement framework (DPAD), a model-agnostic auxiliary method that equips forecasting models with the ability of pattern disentanglement and context-aware adaptation. Specifically, we construct a Dynamic Dual-Prototype bank (DDP), comprising a common pattern bank with strong temporal priors to capture prevailing trend or seasonal patterns, and a rare pattern bank dynamically memorizing critical yet infrequent events, and then an Dual-Path Context-aware routing (DPC) mechanism is proposed to enhance outputs with selectively retrieved context-specific pattern representations from the DDP. Additionally, we introduce a Disentanglement-Guided Loss (DGLoss) to ensure that each prototype bank specializes in its designated role while maintaining comprehensive coverage. Comprehensive experiments demonstrate that DPAD consistently improves forecasting performance and reliability of state-of-the-art models across diverse real-world benchmarks.
[500] Forcing and Diagnosing Failure Modes of Fourier Neural Operators Across Diverse PDE Families
Lennon Shikhman
Main category: cs.LG
TL;DR: FNOs show poor robustness under distribution shifts, long-horizon rollouts, and structural perturbations. Systematic stress testing reveals vulnerabilities including spectral bias, compounding errors, and overfitting.
Details
Motivation: While Fourier Neural Operators (FNOs) perform well on PDE solution maps, their robustness under various challenging conditions remains poorly understood. The paper aims to systematically identify failure modes and vulnerabilities of FNOs across different PDE families.Method: Developed a systematic stress-testing framework that probes FNOs across five PDE families (dispersive, elliptic, multi-scale fluid, financial, chaotic). Designed controlled stress tests including parameter shifts, boundary/terminal condition changes, resolution extrapolation with spectral analysis, and iterative rollouts. Trained 1,000 models for large-scale evaluation.
Result: Distribution shifts in parameters or boundary conditions can inflate errors by more than an order of magnitude. Resolution changes primarily concentrate error in high-frequency modes. Input perturbations generally do not amplify error, though worst-case scenarios (e.g., localized Poisson perturbations) remain challenging.
Conclusion: The study provides a comparative failure-mode atlas and actionable insights for improving robustness in operator learning. The systematic stress testing framework reveals critical vulnerabilities in FNOs that need to be addressed for more reliable PDE solution learning.
Abstract: Fourier Neural Operators (FNOs) have shown strong performance in learning solution maps of partial differential equations (PDEs), but their robustness under distribution shifts, long-horizon rollouts, and structural perturbations remains poorly understood. We present a systematic stress-testing framework that probes failure modes of FNOs across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Rather than optimizing in-distribution accuracy, we design controlled stress tests - including parameter shifts, boundary or terminal condition changes, resolution extrapolation with spectral analysis, and iterative rollouts - to expose vulnerabilities such as spectral bias, compounding integration errors, and overfitting to restricted boundary regimes. Our large-scale evaluation (1,000 trained models) reveals that distribution shifts in parameters or boundary conditions can inflate errors by more than an order of magnitude, while resolution changes primarily concentrate error in high-frequency modes. Input perturbations generally do not amplify error, though worst-case scenarios (e.g., localized Poisson perturbations) remain challenging. These findings provide a comparative failure-mode atlas and actionable insights for improving robustness in operator learning.
[501] SolarGPT-QA: A Domain-Adaptive Large Language Model for Educational Question Answering in Space Weather and Heliophysics
Santosh Chapagain, MohammadReza EskandariNasab, Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
Main category: cs.LG
TL;DR: SolarGPT-QA: A domain-adapted LLM for space science education that combines scientific literature with pedagogical fine-tuning to improve explanations of solar activity concepts.
Details
Motivation: Solar activity impacts critical infrastructure with limited warning, requiring better educational tools. General LLMs lack domain-specific knowledge and pedagogical capability for explaining complex space science concepts clearly.Method: Built on LLaMA-3 base model, trained with scientific literature and large-scale QA data generated by GPT-4, refined using Grok-3 in student-friendly storytelling style. Combines domain-adaptive pretraining with pedagogical fine-tuning.
Result: SolarGPT-QA outperforms general-purpose models in zero-shot settings, achieves competitive performance vs instruction-tuned models for educational explanations. Pilot study shows improved clarity and accessibility. Ablation experiments confirm importance of both domain adaptation and pedagogical tuning.
Conclusion: SolarGPT-QA represents initial step toward broader SolarGPT framework for space science education and forecasting, demonstrating that combining domain knowledge with pedagogical fine-tuning balances scientific accuracy and educational effectiveness.
Abstract: Solar activity, including solar flares, coronal mass ejections (CMEs), and geomagnetic storms, can significantly impact satellites, aviation, power grids, data centers, and space missions. Extreme solar events can cause substantial economic damage with limited advance warning, underscoring the importance of early-warning systems, accurate forecasting, and effective education in space science. Although large language models (LLMs) perform well on general tasks, they often lack domain-specific knowledge and pedagogical capability to clearly explain complex space science concepts. We introduce SolarGPT-QA, a question answering system based on a domain-adapted large language model built on the LLaMA-3 base model. The model is trained using scientific literature and large-scale question-answer data generated with GPT-4 and refined using Grok-3 in a student-friendly storytelling style. Human pairwise evaluations show that SolarGPT-QA outperforms general-purpose models in zero-shot settings and achieves competitive performance compared to instruction-tuned models for educational explanations in space weather and heliophysics. A small pilot student comprehension study further suggests improved clarity and accessibility of the generated explanations. Ablation experiments indicate that combining domain-adaptive pretraining with pedagogical fine-tuning is important for balancing scientific accuracy and educational effectiveness. This work represents an initial step toward a broader SolarGPT framework for space science education and forecasting.
[502] Adaptive KDE for Real-Time Thresholding: Prioritized Queues for Financial Crime Investigation
Danny Butvinik, Nana Boateng, Achi Hackmon
Main category: cs.LG
TL;DR: Paper proposes method to convert continuous risk scores into stable decision thresholds under non-stationary distributions for detection systems.
Details
Motivation: Detection systems need to partition continuous risk scores into prioritized processing regions while maintaining semantic consistency over time, especially when score distributions are non-stationary.Method: The paper studies methods for converting continuous risk score streams into stable decision thresholds that adapt to non-stationary distributions while preserving temporal consistency.
Result: The approach enables detection systems to maintain consistent decision boundaries over time despite changing score distributions, improving reliability of prioritized processing regions.
Conclusion: Stable threshold conversion under non-stationarity is crucial for maintaining semantic consistency in detection systems that process continuous risk score streams.
Abstract: We study the problem of converting a continuous stream of risk scores into stable decision thresholds under non-stationary score distributions. This problem arises in a wide range of detection systems where scores must be partitioned into prioritized processing regions while preserving semantic consistency over time.
[503] Efficient Gaussian process learning via subspace projections
Elsa Cazelles, Felipe Tobar
Main category: cs.LG
TL;DR: A new projected likelihood objective for Gaussian Processes using low-dimensional linear projections improves accuracy and efficiency over exact and sparse GPs.
Details
Motivation: To address computational challenges in training Gaussian Processes on moderately large datasets while maintaining accuracy, by developing a more efficient training objective.Method: Proposes projected likelihood (PL) using lower-dimensional linear projections of data, provides closed-form expression for information loss, and uses random projections on unit sphere to reduce this loss.
Result: PL shows superiority over exact GP training and variational free energy approach to sparse GPs in terms of both accuracy and computational efficiency across different optimizers, kernels, and datasets.
Conclusion: Projected likelihood offers an effective alternative for GP training on moderately large datasets, balancing computational efficiency with model accuracy through dimensionality reduction techniques.
Abstract: We propose a novel training objective for GPs constructed using lower-dimensional linear projections of the data, referred to as \emph{projected likelihood} (PL). We provide a closed-form expression for the information loss related to the PL and empirically show that it can be reduced with random projections on the unit sphere. We show the superiority of the PL, in terms of accuracy and computational efficiency, over the exact GP training and the variational free energy approach to sparse GPs over different optimisers, kernels and datasets of moderately large sizes.
[504] Causal Pre-training Under the Fairness Lens: An Empirical Study of TabPFN
Qinyi Liu, Mohammad Khalil, Naman Goel
Main category: cs.LG
TL;DR: TabPFN foundation models show strong predictive accuracy and robustness to spurious correlations, but fairness improvements are limited and inconsistent, especially under MNAR covariate shifts.
Details
Motivation: The paper investigates the fairness properties of foundation models for tabular data (TabPFN), which are pre-trained on synthetic datasets from structural causal models. While these models show strong predictive performance, their fairness characteristics remain underexplored despite incorporating causal reasoning during pre-training.Method: The authors conduct a comprehensive empirical evaluation of TabPFN and its fine-tuned variants. They assess predictive performance, fairness, and robustness across varying dataset sizes and distributional shifts, including missing-not-at-random (MNAR) covariate shifts.
Result: TabPFN achieves stronger predictive accuracy compared to baselines and exhibits robustness to spurious correlations. However, improvements in fairness are moderate and inconsistent, particularly under MNAR covariate shifts.
Conclusion: Causal pre-training in TabPFN is helpful but insufficient for algorithmic fairness. The findings highlight important implications for deploying TabPFN and similar models in practice, emphasizing the need for additional fairness interventions beyond causal pre-training alone.
Abstract: Foundation models for tabular data, such as the Tabular Prior-data Fitted Network (TabPFN), are pre-trained on a massive number of synthetic datasets generated by structural causal models (SCM). They leverage in-context learning to offer high predictive accuracy in real-world tasks. However, the fairness properties of these foundational models, which incorporate ideas from causal reasoning during pre-training, remain underexplored. In this work, we conduct a comprehensive empirical evaluation of TabPFN and its fine-tuned variants, assessing predictive performance, fairness, and robustness across varying dataset sizes and distributional shifts. Our results reveal that while TabPFN achieves stronger predictive accuracy compared to baselines and exhibits robustness to spurious correlations, improvements in fairness are moderate and inconsistent, particularly under missing-not-at-random (MNAR) covariate shifts. These findings suggest that the causal pre-training in TabPFN is helpful but insufficient for algorithmic fairness, highlighting implications for deploying TabPFN (and similar) models in practice and the need for further fairness interventions.
[505] Predicting Startup Success Using Large Language Models: A Novel In-Context Learning Approach
Abdurahman Maarouf, Alket Bakiaj, Stefan Feuerriegel
Main category: cs.LG
TL;DR: Proposes kNN-ICL, a k-nearest-neighbor-based in-context learning framework using LLMs for startup success prediction, achieving higher accuracy than supervised ML with only 50 examples.
Details
Motivation: Predicting early-stage startup success is challenging due to data scarcity in VC firms, limiting traditional ML methods that require large labeled datasets.Method: kNN-ICL framework: uses LLMs with in-context learning, selects most relevant past startups as examples based on similarity, requires no model training, leverages small labeled datasets as demonstrations.
Result: kNN-ICL achieves higher prediction accuracy than supervised ML baselines and vanilla in-context learning using Crunchbase data; high balanced accuracy achieved with as few as 50 examples.
Conclusion: In-context learning can serve as an effective decision-making tool for VC firms operating in data-scarce environments, overcoming limitations of traditional ML methods.
Abstract: Venture capital (VC) investments in early-stage startups that end up being successful can yield high returns. However, predicting early-stage startup success remains challenging due to data scarcity (e.g., many VC firms have information about only a few dozen of early-stage startups and whether they were successful). This limits the effectiveness of traditional machine learning methods that rely on large labeled datasets for model training. To address this challenge, we propose an in-context learning framework for startup success prediction using large language models (LLMs) that requires no model training and leverages only a small set of labeled startups as demonstration examples. Specifically, we propose a novel k-nearest-neighbor-based in-context learning framework, called kNN-ICL, which selects the most relevant past startups as examples based on similarity. Using real-world profiles from Crunchbase, we find that the kNN-ICL approach achieves higher prediction accuracy than supervised machine learning baselines and vanilla in-context learning. Further, we study how performance varies with the number of in-context examples and find that a high balanced accuracy can be achieved with as few as 50 examples. Together, we demonstrate that in-context learning can serve as a decision-making tool for VC firms operating in data-scarce environments.
[506] Rethinking Benchmarks for Differentially Private Image Classification
Sabrina Mokhtari, Sara Kodeiri, Shubhankar Mohapatra, Florian Tramèr, Gautam Kamath
Main category: cs.LG
TL;DR: The paper proposes comprehensive benchmarks for differentially private image classification across various settings and creates a public leaderboard to track progress.
Details
Motivation: There's a need for standardized, comprehensive benchmarks to evaluate differentially private machine learning techniques across different scenarios, as existing benchmarks may not cover the full range of practical settings.Method: The authors design a comprehensive set of benchmarks covering different settings (with/without additional data, convex/non-convex, various datasets) and test established techniques on these benchmarks to assess their effectiveness.
Result: The paper provides benchmark results showing which techniques remain effective in different settings, and establishes a publicly available leaderboard for the community to track progress in differentially private ML.
Conclusion: The proposed benchmarks and leaderboard will help standardize evaluation and accelerate progress in differentially private machine learning research by providing comprehensive testing grounds and enabling community-wide tracking of advancements.
Abstract: We revisit benchmarks for differentially private image classification. We suggest a comprehensive set of benchmarks, allowing researchers to evaluate techniques for differentially private machine learning in a variety of settings, including with and without additional data, in convex settings, and on a variety of qualitatively different datasets. We further test established techniques on these benchmarks in order to see which ideas remain effective in different settings. Finally, we create a publicly available leader board for the community to track progress in differentially private machine learning.
[507] SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment
Yinkai Wang, Yan Zhou Chen, Xiaohui Chen, Li-Ping Liu, Soha Hassoun
Main category: cs.LG
TL;DR: SpecBridge improves small-molecule identification from MS/MS spectra by aligning spectral embeddings directly into a frozen molecular foundation model’s latent space, achieving 20-25% accuracy gains over neural baselines.
Details
Motivation: Current deep learning approaches for small-molecule identification from tandem mass spectrometry (MS/MS) face limitations: explicit generative models construct molecular graphs atom-by-atom, while joint contrastive models learn cross-modal subspaces from scratch. Both extremes are suboptimal for spectral library matching in untargeted settings.Method: SpecBridge uses an implicit alignment framework that treats structure identification as a geometric alignment problem. It fine-tunes a self-supervised spectral encoder (DreaMS) to project directly into the latent space of a frozen molecular foundation model (ChemBERTa), then performs retrieval by cosine similarity to precomputed molecular embeddings.
Result: Across MassSpecGym, Spectraverse, and MSnLib benchmarks, SpecBridge improves top-1 retrieval accuracy by roughly 20-25% relative to strong neural baselines, while keeping the number of trainable parameters small.
Conclusion: Aligning to frozen foundation models is a practical, stable alternative to designing new architectures from scratch for small-molecule identification from MS/MS spectra.
Abstract: Small-molecule identification from tandem mass spectrometry (MS/MS) remains a bottleneck in untargeted settings where spectral libraries are incomplete. While deep learning offers a solution, current approaches typically fall into two extremes: explicit generative models that construct molecular graphs atom-by-atom, or joint contrastive models that learn cross-modal subspaces from scratch. We introduce SpecBridge, a novel implicit alignment framework that treats structure identification as a geometric alignment problem. SpecBridge fine-tunes a self-supervised spectral encoder (DreaMS) to project directly into the latent space of a frozen molecular foundation model (ChemBERTa), and then performs retrieval by cosine similarity to a fixed bank of precomputed molecular embeddings. Across MassSpecGym, Spectraverse, and MSnLib benchmarks, SpecBridge improves top-1 retrieval accuracy by roughly 20-25% relative to strong neural baselines, while keeping the number of trainable parameters small. These results suggest that aligning to frozen foundation models is a practical, stable alternative to designing new architectures from scratch. The code for SpecBridge is released at https://github.com/HassounLab/SpecBridge.
[508] Power-based Partial Attention: Bridging Linear-Complexity and Full Attention
Yufeng Huang
Main category: cs.LG
TL;DR: The paper introduces Power-based Partial Attention (PPA), a sub-quadratic attention mechanism with complexity O(L^{1+p}) where 0≤p≤1, showing that attention scaling can be reduced without significant performance loss.
Details
Motivation: To systematically quantify the amount of attention needed in transformers and determine if quadratic O(L²) attention is necessary or if sub-quadratic attention can achieve comparable performance.Method: Introduces Power-based Partial Attention (PPA) with complexity O(L^{1+p}) where p controls attention scaling (p=0 is linear sliding window, p=1 is full attention), allowing exploration of transformer performance as a function of p.
Result: Performance shows S-curve behavior: transitions from sliding-window to full attention over a narrow window of p values, plateaus as p→1. Exists 0<p<1 where O(L^{1+p}) attention achieves similar results as O(L²) full attention.
Conclusion: Quadratic attention is not always necessary; sub-quadratic attention mechanisms (O(L^{1+p}) with 0<p<1) can achieve comparable performance to full attention, offering potential efficiency improvements for transformers.
Abstract: It is widely accepted from transformer research that “attention is all we need”, but the amount of attention required has never been systematically quantified. Is quadratic $O(L^2)$ attention necessary, or is there a sub-quadratic attention mechanism that can achieve comparable performance? To answer this question, we introduce power-based partial attention (PPA), an attention mechanism of order $O(L^{1+p})$, where $0 \leq p \leq 1$, such that $p=0$ corresponds to sliding window attention with linear complexity, and $p=1$ corresponds to full attention. With this attention construction, we can explore how transformer architecture performance varies as a function of the attention scaling behavior controlled by $p$. The overall trend from our experiments shows an S-curve-like behavior where the performance transitions from sliding-window (linear-complexity) attention to full attention over a narrow window of $p$ values, and plateaus as $p$ approaches $1$. In our experiments, we show that there exists $0<p<1$ such that $O(L^{1+p})$ attention is sufficient to achieve similar results as $O(L^2)$ full attention.
[509] Nearly Optimal Bayesian Inference for Structural Missingness
Chen Liang, Donghua Yang, Yutong Zhao, Tianle Zhang, Shenghang Zhou, Zhiyu Liang, Hengtong Zhang, Hongzhi Wang, Ziqi Li, Xiyang Zhang, Zheng Liang, Yifei Li
Main category: cs.LG
TL;DR: Bayesian approach handles structural missingness by decoupling missing-value posterior learning from label prediction, achieving SOTA results with uncertainty propagation.
Details
Motivation: Structural missingness creates causal loops where prediction needs missing features but inferring them depends on the missingness mechanism. Under MNAR, missing data comes from shifted distributions, and single imputation yields overconfident, biased decisions.Method: Bayesian framework that decouples (1) learning an in-model missing-value posterior from (2) label prediction via posterior predictive distribution. Uses posterior integration rather than single point estimates, preserving uncertainty propagation.
Result: Achieves state-of-the-art on 43 classification and 15 imputation benchmarks, with finite-sample near Bayes-optimality guarantees under the proposed SCM prior.
Conclusion: The Bayesian decoupling approach provides an “almost-free-lunch”: once the posterior is learned, prediction becomes plug-and-play while maintaining proper uncertainty propagation, solving key challenges of structural missingness.
Abstract: Structural missingness breaks ‘just impute and train’: values can be undefined by causal or logical constraints, and the mask may depend on observed variables, unobserved variables (MNAR), and other missingness indicators. It simultaneously brings (i) a catch-22 situation with causal loop, prediction needs the missing features, yet inferring them depends on the missingness mechanism, (ii) under MNAR, the unseen are different, the missing part can come from a shifted distribution, and (iii) plug-in imputation, a single fill-in can lock in uncertainty and yield overconfident, biased decisions. In the Bayesian view, prediction via the posterior predictive distribution integrates over the full model posterior uncertainty, rather than relying on a single point estimate. This framework decouples (i) learning an in-model missing-value posterior from (ii) label prediction by optimizing the predictive posterior distribution, enabling posterior integration. This decoupling yields an in-model almost-free-lunch: once the posterior is learned, prediction is plug-and-play while preserving uncertainty propagation. It achieves SOTA on 43 classification and 15 imputation benchmarks, with finite-sample near Bayes-optimality guarantees under our SCM prior.
[510] LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation
Zhiwei Zheng, Kevin Bryson
Main category: cs.LG
TL;DR: LaCoGSEA is an unsupervised framework combining deep representation learning with pathway statistics for gene expression analysis without predefined labels.
Details
Motivation: Standard pathway enrichment methods like GSEA require phenotypic labels, limiting unsupervised applications. Existing unsupervised extensions capture only linear relationships, while deep learning models lack pathway-specific interpretation methods.Method: Uses autoencoder to capture non-linear manifolds, proposes global gene-latent correlation metric as proxy for differential expression to generate dense gene rankings without prior labels.
Result: Achieves improved cancer subtype clustering, recovers more biologically meaningful pathways at higher ranks than linear methods, and maintains robustness across experimental protocols and dataset sizes.
Conclusion: LaCoGSEA provides state-of-the-art performance in unsupervised pathway enrichment analysis by integrating deep representation learning with robust pathway statistics.
Abstract: Motivation: Pathway enrichment analysis is widely used to interpret gene expression data. Standard approaches, such as GSEA, rely on predefined phenotypic labels and pairwise comparisons, which limits their applicability in unsupervised settings. Existing unsupervised extensions, including single-sample methods, provide pathway-level summaries but primarily capture linear relationships and do not explicitly model gene-pathway associations. More recently, deep learning models have been explored to capture non-linear transcriptomic structure. However, their interpretation has typically relied on generic explainable AI (XAI) techniques designed for feature-level attribution. As these methods are not designed for pathway-level interpretation in unsupervised transcriptomic analyses, their effectiveness in this setting remains limited. Results: To bridge this gap, we introduce LaCoGSEA (Latent Correlation GSEA), an unsupervised framework that integrates deep representation learning with robust pathway statistics. LaCoGSEA employs an autoencoder to capture non-linear manifolds and proposes a global gene-latent correlation metric as a proxy for differential expression, generating dense gene rankings without prior labels. We demonstrate that LaCoGSEA offers three key advantages: (i) it achieves improved clustering performance in distinguishing cancer subtypes compared to existing unsupervised baselines; (ii) it recovers a broader range of biologically meaningful pathways at higher ranks compared with linear dimensionality reduction and gradient-based XAI methods; and (iii) it maintains high robustness and consistency across varying experimental protocols and dataset sizes. Overall, LaCoGSEA provides state-of-the-art performance in unsupervised pathway enrichment analysis. Availability and implementation: https://github.com/willyzzz/LaCoGSEA
[511] TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning
Zhiwei Zheng, Kevin Bryson
Main category: cs.LG
TL;DR: TwinPurify is a self-supervised learning framework that uses adjacent-normal tissue profiles as background guidance to learn continuous tumor embeddings from bulk transcriptomics, enabling disentanglement of tumor-specific signals without external references.
Details
Motivation: Bulk transcriptomic studies are limited by tumor purity variation that obscures tumor-intrinsic signals. Existing deconvolution methods perform well on synthetic mixtures but fail to generalize to real patient cohorts due to unmodeled biological and technical variation.Method: TwinPurify adapts the Barlow Twins self-supervised objective to learn continuous, high-dimensional tumor embeddings. It leverages adjacent-normal profiles within the same cohort as “background” guidance to disentangle tumor-specific signals without relying on external references, representing a departure from traditional deconvolution approaches.
Result: TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals across multiple large cancer cohorts on both RNA-seq and microarray platforms. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities.
Conclusion: TwinPurify provides a transferable framework for decontaminating bulk transcriptomics, extending the utility of existing clinical datasets for molecular discovery by enabling more accurate tumor-specific signal extraction from bulk data.
Abstract: Advances in single-cell and spatial transcriptomic technologies have transformed tumor ecosystem profiling at cellular resolution. However, large scale studies on patient cohorts continue to rely on bulk transcriptomic data, where variation in tumor purity obscures tumor-intrinsic transcriptional signals and constrains downstream discovery. Many deconvolution methods report strong performance on synthetic bulk mixtures but fail to generalize to real patient cohorts because of unmodeled biological and technical variation. Here, we introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective, representing a fundamental departure from the deconvolution paradigm. Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings by leveraging adjacent-normal profiles within the same cohort as “background” guidance, enabling the disentanglement of tumor-specific signals without relying on any external reference. Benchmarked against multiple large cancer cohorts across RNA-seq and microarray platforms, TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities compared to raw bulk profiles. By providing a transferable framework for decontaminating bulk transcriptomics, TwinPurify extends the utility of existing clinical datasets for molecular discovery.
cs.MA
[512] Reimagining Peer Review Process Through Multi-Agent Mechanism Design
Ahmad Farooq, Kamran Iqbal
Main category: cs.MA
TL;DR: The paper proposes using multi-agent reinforcement learning to fix peer review in software engineering research by designing incentive-compatible protocols and computational solutions.
Details
Motivation: Peer review in software engineering research is failing due to growing submissions, misaligned incentives, reviewer fatigue, and is perceived as "broken" by the research community.Method: Model the research community as a stochastic multi-agent system and apply multi-agent reinforcement learning (MARL) to design incentive-compatible protocols. Three interventions: credit-based submission economy, MARL-optimized reviewer assignment, and hybrid verification of review consistency.
Result: The paper presents a conceptual framework with threat models, equity considerations, and phased pilot metrics, establishing a research agenda rather than empirical results.
Conclusion: Peer review dysfunctions are mechanism design failures that can be addressed through computational solutions, with MARL offering a path toward sustainable peer review systems.
Abstract: The software engineering research community faces a systemic crisis: peer review is failing under growing submissions, misaligned incentives, and reviewer fatigue. Community surveys reveal that researchers perceive the process as “broken.” This position paper argues that these dysfunctions are mechanism design failures amenable to computational solutions. We propose modeling the research community as a stochastic multi-agent system and applying multi-agent reinforcement learning to design incentive-compatible protocols. We outline three interventions: a credit-based submission economy, MARL-optimized reviewer assignment, and hybrid verification of review consistency. We present threat models, equity considerations, and phased pilot metrics. This vision charts a research agenda toward sustainable peer review.
[513] The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems
Prateek Gupta, Qiankun Zhong, Hiromu Yakura, Thomas Eisenmann, Iyad Rahwan
Main category: cs.MA
TL;DR: LLM agents in common-pool resource games without explicit rewards, using social learning and punishment to evolve cooperative norms endogenously.
Details
Motivation: Most LLM systems in CPR games provide explicit reward functions, but human cooperation emerges without knowing payoff structures, relying on heuristics, communication, and enforcement instead.Method: CPR simulation framework without explicit reward signals, embedding cultural-evolutionary mechanisms: social learning (adopting from successful peers) and norm-based punishment based on Ostrom’s principles. Agents learn individually from harvesting, monitoring, and punishing via environmental feedback.
Result: Validated simulation reproduces human behavior findings. Systematic model differences in sustaining cooperation and norm formation across environmental/social conditions (resource-rich vs. scarce; altruistic vs. selfish). Framework serves as testbed for emergent norms in LLM societies.
Conclusion: Framework enables rigorous study of emergent norms in mixed-motive LLM societies, informing AI system design for social/organizational contexts where alignment with cooperative norms is critical for stability, fairness, and governance.
Abstract: A growing body of multi-agent studies with LLMs explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without explicit knowledge of the payoff structure or how individual actions translate into long-run outcomes, relying instead on heuristics, communication, and enforcement. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom’s principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a $2\times2$ grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.
cs.MM
[514] Encoder-Free ECG-Language Models
William Han, Tony Chen, Chaojing Duan, Xiaoyu Song, Yihang Yao, Yuzhe Yang, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Main category: cs.MM
TL;DR: ELF is an encoder-free ECG-Language Model that replaces complex ECG encoders with a single projection layer, achieving state-of-the-art performance while revealing limitations in current evaluation practices.
Details
Motivation: Current ECG-Language Models (ELMs) follow Vision-Language Model designs and depend on pretrained ECG encoders, which adds architectural and training complexity. The authors aim to simplify this by exploring encoder-free approaches.Method: Introduce ELF, an encoder-free ELM that replaces the ECG encoder with a single projection layer trained jointly with the LLM. Test whether adding architectural biases improves performance, and analyze whether ELMs rely on benchmark artifacts vs. ECG-derived information.
Result: ELF matches or exceeds state-of-the-art ELMs across five datasets, despite using far simpler architecture. The single linear projection remains competitive even when architectural biases are added. Analysis shows ELFs often rely more on benchmark artifacts and language priors than ECG-derived information.
Conclusion: Encoder-free ELMs can achieve state-of-the-art performance with simpler architecture, but current evaluation practices and ELM designs have limitations as models often rely on artifacts rather than genuine ECG understanding.
Abstract: ECG-Language Models (ELMs) extend recent progress in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, most ELMs follow Vision-Language Model (VLM) designs and depend on pretrained ECG encoders, adding architectural and training complexity. Inspired by encoder-free VLMs, we introduce ELF, an encoder-free ELM that replaces the ECG encoder with a single projection layer trained jointly with the LLM. Across five datasets, ELF matches or exceeds state-of-the-art ELMs that use far more complex encoders and training pipelines. We also test whether adding architectural biases to ELF improves performance and find that the single linear projection remains competitive. Finally, we show that ELF, and potentially other ELMs, often rely more on benchmark artifacts and language priors than ECG-derived information, highlighting limitations in current evaluation practices and ELM design. All data and code is available at https://github.com/willxxy/ECG-Bench.
[515] Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues
Junchen Fu, Wenhao Deng, Kaiwen Zheng, Alexandros Karatzoglou, Ioannis Arapakis, Yu Ye, Yongxin Ni, Joemon M. Jose, Xuri Ge
Main category: cs.MM
TL;DR: MLLMs struggle with fine-grained missing modality completion in e-commerce despite capturing high-level semantics, with performance varying by product category and no clear correlation to model size.
Details
Motivation: Missing product modalities (images/text) on e-commerce platforms impair product presentation and downstream applications; investigate if MLLMs can generate missing modalities for products.Method: Proposed MMPCBench with two sub-benchmarks (Content Quality Completion and Recommendation), evaluated 6 SOTA MLLMs from Qwen2.5-VL and Gemma-3 families across 9 e-commerce categories for image-to-text and text-to-image completion tasks.
Result: MLLMs capture high-level semantics but struggle with fine-grained word/pixel alignment; performance varies substantially across categories; no trivial correlation between model size and performance; GRPO improves image-to-text but not text-to-image completion.
Conclusion: Current MLLMs have limitations in real-world cross-modal generation for missing-modality product completion, representing an early step toward more effective solutions.
Abstract: Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.
[516] Subjective Evaluation of Frame Rate in Bitrate-Constrained Live Streaming
Jiaqi He, Zhengfang Duanmu, Kede Ma
Main category: cs.MM
TL;DR: The HFR-LS dataset provides 384 subject-rated 1080p videos with systematic variations in compression strength and frame rate to study the perceptual trade-offs in live streaming under bandwidth constraints.
Details
Motivation: Bandwidth constraints in live streaming force a trade-off between compression strength and frame rate, but the perceptual consequences of this trade-off are not well understood. There's a need for systematic research on how frame rate affects perceived quality in bitrate-constrained scenarios.Method: Created the HFR-LS dataset with 384 subject-rated 1080p videos encoded at multiple target bitrates by systematically varying compression strength and frame rate. Conducted a single-stimulus, hidden-reference subjective study to assess perceived quality.
Result: Frame rate has a noticeable effect on perceived quality, and interacts with both bitrate and source content. The study provides empirical evidence of the perceptual trade-offs between compression and frame rate.
Conclusion: The HFR-LS dataset is publicly available to facilitate research on bitrate-constrained live streaming, addressing the underexplored area of perceptual consequences in the compression-strength vs. frame rate trade-off.
Abstract: Bandwidth constraints in live streaming require video codecs to balance compression strength and frame rate, yet the perceptual consequences of this trade-off remain underexplored. We present the high frame rate live streaming (HFR-LS) dataset, comprising 384 subject-rated 1080p videos encoded at multiple target bitrates by systematically varying compression strength and frame rate. A single-stimulus, hidden-reference subjective study shows that frame rate has a noticeable effect on perceived quality, and interacts with both bitrate and source content. The HFR-LS dataset is available at https://github.com/real-hjq/HFR-LS to facilitate research on bitrate-constrained live streaming.
eess.AS
[517] Beyond Lips: Integrating Gesture and Lip Cues for Robust Audio-visual Speaker Extraction
Zexu Pan, Xinyuan Qian, Shengkui Zhao, Kun Zhou, Bin Ma
Main category: eess.AS
TL;DR: SeLG is a speaker extraction model that uses both lip movements and upper-body gestures (not just lips) to isolate target speech from multi-talker audio, with cross-attention fusion and contrastive learning to align gestures with speech.
Details
Motivation: Traditional audio-visual speaker extraction relies on synchronized lip recordings, but co-speech gestures also provide temporally aligned visual cues that can be valuable when facial/lip regions are occluded or distant. Gestures offer complementary information beyond just lip movements.Method: Proposes SeLG model with cross-attention-based fusion mechanism allowing each visual modality (lip and gesture) to query and selectively attend to relevant speech features. Uses contrastive InfoNCE loss to improve alignment of gesture embeddings with speech dynamics by encouraging them to align more closely with corresponding lip embeddings.
Result: Experimental results on YGD dataset (TED talks) show the contrastive learning strategy significantly improves gesture-based speaker extraction. SeLG achieves superior performance compared to baselines across both complete and partial (missing-modality) conditions.
Conclusion: Moving beyond lip-centric approaches by integrating both lip and gesture information with attention mechanisms and contrastive learning enables more robust speaker extraction, especially useful when facial/lip cues are unavailable or limited.
Abstract: Most audio-visual speaker extraction methods rely on synchronized lip recording to isolate the speech of a target speaker from a multi-talker mixture. However, in natural human communication, co-speech gestures are also temporally aligned with speech, often emphasizing specific words or syllables. These gestures provide complementary visual cues that can be especially valuable when facial or lip regions are occluded or distant. In this work, we move beyond lip-centric approaches and propose SeLG, a model that integrates both lip and upper-body gesture information for robust speaker extraction. SeLG features a cross-attention-based fusion mechanism that enables each visual modality to query and selectively attend to relevant speech features in the mixture. To improve the alignment of gesture representations with speech dynamics, SeLG also employs a contrastive InfoNCE loss that encourages gesture embeddings to align more closely with corresponding lip embeddings, which are more strongly correlated with speech. Experimental results on the YGD dataset, containing TED talks, demonstrate that the proposed contrastive learning strategy significantly improves gesture-based speaker extraction, and that our proposed SeLG model, by effectively fusing lip and gesture cues with an attention mechanism and InfoNCE loss, achieves superior performance compared to baselines, across both complete and partial (i.e., missing-modality) conditions.
[518] LuSeeL: Language-queried Binaural Universal Sound Event Extraction and Localization
Zexu Pan, Shengkui Zhao, Yukun Ma, Haoxu Wang, Yiheng Jiang, Biao Tian, Bin Ma
Main category: eess.AS
TL;DR: LuSeeL: A language-driven universal sound extraction network that isolates text-described sound events from binaural audio mixtures while jointly predicting direction of arrival (DoA) using spatial cues.
Details
Motivation: Real-world audio is three-dimensional with rich spatial information that binaural audio captures, but most sound extraction algorithms focus on single-channel audio and miss crucial spatial context for understanding complex auditory scenes.Method: Proposes a dual-task network that extracts text-described sound events from binaural mixtures while simultaneously predicting direction of arrival (DoA) by leveraging spatial cues from binaural signals.
Result: The LuSeeL model significantly outperforms single-channel and uni-task baselines on the in-the-wild AudioCaps dataset.
Conclusion: Jointly modeling sound extraction and direction of arrival prediction using binaural spatial cues improves both tasks, demonstrating the value of spatial information for universal sound extraction.
Abstract: Most universal sound extraction algorithms focus on isolating a target sound event from single-channel audio mixtures. However, the real world is three-dimensional, and binaural audio, which mimics human hearing, can capture richer spatial information, including sound source location. This spatial context is crucial for understanding and modeling complex auditory scenes, as it inherently informs sound detection and extraction. In this work, we propose a language-driven universal sound extraction network that isolates text-described sound events from binaural mixtures by effectively leveraging the spatial cues present in binaural signals. Additionally, we jointly predict the direction of arrival (DoA) of the target sound using spatial features from the extraction network. This dual-task approach exploits complementary location information to improve extraction performance while enabling accurate DoA estimation. Experimental results on the in-the-wild AudioCaps dataset show that our proposed LuSeeL model significantly outperforms single-channel and uni-task baselines.
[519] SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
Alexander Polok, Dominik Klement, Samuele Cornell, Matthew Wiesner, Jan Černocký, Sanjeev Khudanpur, Lukáš Burget
Main category: eess.AS
TL;DR: SE-DiCoW improves speaker-attributed ASR by using self-enrollment segments from diarization to better distinguish overlapping speakers, reducing errors by 52.4% compared to previous DiCoW.
Details
Motivation: The paper addresses a key limitation in speaker-attributed ASR: existing approaches like DiCoW struggle with fully overlapping speakers where STNO masks become ambiguous, making it difficult to distinguish speakers with similar conditioning but different transcriptions.Method: Introduces SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper) which uses diarization output to locate enrollment segments where target speakers are most active, then uses these segments as fixed conditioning via cross-attention at each encoder layer. Also refines DiCoW with improved data segmentation, model initialization, and augmentation techniques.
Result: SE-DiCoW achieves substantial improvements, reducing macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark, demonstrating significantly better performance in multi-speaker environments.
Conclusion: The self-enrollment approach effectively addresses the ambiguity problem in overlapping speaker scenarios, enabling more robust speaker-attributed ASR that generalizes better across different domains and speaker configurations.
Abstract: Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most active. This enrollment segment is used as fixed conditioning via cross-attention at each encoder layer. We further refine DiCoW with improved data segmentation, model initialization, and augmentation. Together, these advances yield substantial gains: SE-DiCoW reduces macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark.
[520] Permutation-Invariant Physics-Informed Neural Network for Region-to-Region Sound Field Reconstruction
Xingyu Chen, Sipei Zhao, Fei Ma, Eva Cheng, Ian S. Burnett
Main category: eess.AS
TL;DR: A physics-informed neural network for region-to-region sound field reconstruction that handles continuously varying sound source and receiver positions using deep set architecture and Helmholtz equation constraints.
Details
Motivation: Existing sound field reconstruction methods are limited to point-to-region reconstruction with fixed sound source positions, but real-world acoustic transfer functions vary continuously with both sound source and receiver positions. There's a need for more flexible region-to-region reconstruction.Method: Proposes a permutation-invariant physics-informed neural network using deep set architecture to process receiver and sound source positions as unordered sets (preserving acoustic reciprocity). Incorporates the Helmholtz equation as a physical constraint during training to ensure physically consistent predictions.
Result: The method enables interpolation of acoustic transfer functions across continuously varying sound sources and measurement regions, overcoming limitations of traditional point-to-region approaches.
Conclusion: The proposed approach provides a more flexible and physically consistent solution for sound field reconstruction that handles real-world scenarios where both sound sources and receivers can vary continuously in position.
Abstract: Most existing sound field reconstruction methods target point-to-region reconstruction, interpolating the Acoustic Transfer Functions (ATFs) between a fixed-position sound source and a receiver region. The applicability of these methods is limited because real-world ATFs tend to varying continuously with respect to the positions of sound sources and receiver regions. This paper presents a permutation-invariant physics-informed neural network for region-to-region sound field reconstruction, which aims to interpolate the ATFs across continuously varying sound sources and measurement regions. The proposed method employs a deep set architecture to process the receiver and sound source positions as an unordered set, preserving acoustic reciprocity. Furthermore, it incorporates the Helmholtz equation as a physical constraint to guide network training, ensuring physically consistent predictions.
[521] Audio Deepfake Detection at the First Greeting: “Hi!”
Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang
Main category: eess.AS
TL;DR: S-MGAA: Lightweight audio deepfake detector for ultra-short (0.5-2.0s) degraded speech, optimized for real-world communication scenarios like scam detection.
Details
Motivation: Need to detect synthetic speech in real-world communication scenarios with ultra-short inputs (e.g., scammer saying "Hi") that suffer from communication degradations and processing artifacts.Method: Proposes Short-MGAA (S-MGAA), a lightweight extension of Multi-Granularity Adaptive Time-Frequency Attention with two modules: Pixel-Channel Enhanced Module (PCEM) for fine-grained time-frequency saliency, and Frequency Compensation Enhanced Module (FCEM) for multi-scale frequency modeling and adaptive frequency-temporal interaction.
Result: S-MGAA consistently outperforms nine state-of-the-art baselines, shows strong robustness to degradations, achieves favorable efficiency-accuracy trade-offs with low RTF, competitive GFLOPs, compact parameters, and reduced training cost.
Conclusion: S-MGAA demonstrates strong potential for real-time deployment in communication systems and edge devices for detecting synthetic speech in ultra-short, degraded audio inputs.
Abstract: This paper focuses on audio deepfake detection under real-world communication degradations, with an emphasis on ultra-short inputs (0.5-2.0s), targeting the capability to detect synthetic speech at a conversation opening, e.g., when a scammer says “Hi.” We propose Short-MGAA (S-MGAA), a novel lightweight extension of Multi-Granularity Adaptive Time-Frequency Attention, designed to enhance discriminative representation learning for short, degraded inputs subjected to communication processing and perturbations. The S-MGAA integrates two tailored modules: a Pixel-Channel Enhanced Module (PCEM) that amplifies fine-grained time-frequency saliency, and a Frequency Compensation Enhanced Module (FCEM) to supplement limited temporal evidence via multi-scale frequency modeling and adaptive frequency-temporal interaction. Extensive experiments demonstrate that S-MGAA consistently surpasses nine state-of-the-art baselines while achieving strong robustness to degradations and favorable efficiency-accuracy trade-offs, including low RTF, competitive GFLOPs, compact parameters, and reduced training cost, highlighting its strong potential for real-time deployment in communication systems and edge devices.
[522] SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation
Helin Wang, Bowen Shi, Andros Tjandra, John Hoffman, Yi-Chiao Wu, Apoorv Vyas, Najim Dehak, Ann Lee, Wei-Ning Hsu
Main category: eess.AS
TL;DR: SAJ is a multimodal reference-free metric for audio separation evaluation that aligns with human perception across speech, music, and sound events using text/visual/span prompts.
Details
Motivation: Existing audio separation metrics are misaligned with human perception, coarse-grained, and require ground truth signals, while subjective tests are expensive and unscalable.Method: Proposes SAM Audio Judge (SAJ), a multimodal fine-grained reference-free objective metric that supports three audio domains and three prompt inputs, evaluating four dimensions.
Result: SAJ shows high alignment with human perceptions and demonstrates potential applications in data filtering, pseudo-labeling large datasets, and reranking in audio separation models.
Conclusion: SAJ addresses the need for automated audio separation evaluation without human intervention, offering a scalable alternative to subjective tests while maintaining perceptual alignment.
Abstract: The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.
[523] Rethinking Discrete Speech Representation Tokens for Accent Generation
Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell
Main category: eess.AS
TL;DR: First systematic investigation of accent information in Discrete Speech Representation Tokens (DSRTs), revealing that ASR supervision reduces accent encoding and proposing new content-only and content-accent DSRTs for better accent control.
Details
Motivation: While phonetic and speaker information in DSRTs has been extensively studied, how accent information is encoded remains largely unexplored, creating a gap in understanding for accent-controlled speech generation applications.Method: Proposed unified evaluation framework with Accent ABX task (accessibility) and cross-accent Voice Conversion resynthesis (recoverability). Analyzed DSRTs from various speech encoders, then designed new content-only and content-accent DSRTs based on findings.
Result: Accent information is substantially reduced when ASR supervision fine-tunes encoders, and cannot be effectively disentangled through naive codebook size reduction. New proposed DSRTs significantly outperform existing designs in controllable accent generation.
Conclusion: Highlights importance of accent-aware evaluation and provides practical guidance for designing DSRTs for accent-controlled speech generation, with new proposed DSRTs offering improved performance.
Abstract: Discrete Speech Representation Tokens (DSRTs) have become a foundational component in speech generation. While prior work has extensively studied phonetic and speaker information in DSRTs, how accent information is encoded in DSRTs remains largely unexplored. In this paper, we present the first systematic investigation of accent information in DSRTs. We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis. Using this framework, we analyse DSRTs derived from a variety of speech encoders. Our results reveal that accent information is substantially reduced when ASR supervision is used to fine-tune the encoder, but cannot be effectively disentangled from phonetic and speaker information through naive codebook size reduction. Based on these findings, we propose new content-only and content-accent DSRTs that significantly outperform existing designs in controllable accent generation. Our work highlights the importance of accent-aware evaluation and provides practical guidance for designing DSRTs for accent-controlled speech generation.
[524] Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective
Hankun Wang, Haoran Wang, Yiwei Guo, Zhihan Li, Chenpeng Du, Kai Yu
Main category: eess.AS
TL;DR: This paper investigates why speech language models (SLMs) underperform compared to text-based LLMs, identifying three key factors: phonetic vs semantic information in speech tokens, longer sequence lengths, and paralinguistic complexity. The study systematically analyzes each factor’s impact through modality transition experiments.
Details
Motivation: Speech language models struggle to generate semantically coherent outputs compared to text-based LLMs, despite LLMs showing human-level writing ability. The researchers want to understand the specific reasons for this performance gap by examining three potential factors: (A) speech tokens providing phonetic rather than semantic information, (B) longer speech sequence lengths, and (C) paralinguistic information complexity.Method: The researchers use an “evolving manner” approach by transiting the modality from text to speech systematically. They explore the influence of the three key factors separately through controlled experiments that isolate each factor’s impact on model performance.
Result: The study reveals varying impacts: Factor A (phonetic vs semantic information) has relatively minor impact; Factor B (sequence length) influences syntactical and semantic modeling more obviously; Factor C (paralinguistic information) exerts the most significant impact, particularly in basic lexical modeling.
Conclusion: The findings provide insights into the unique challenges of training speech language models and highlight pathways to develop more effective end-to-end SLMs by addressing the specific impacts of sequence length and paralinguistic complexity.
Abstract: Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.
[525] EDM2SE: A Magnitude-Preserving Network Architecture for Diffusion-Based Speech Enhancement
Julius Richter, Danilo de Oliveira, Timo Gerkmann
Main category: eess.AS
TL;DR: Extends EDM2 framework to diffusion-based speech enhancement using Schrodinger bridge formulation with time-dependent preconditioning, magnitude-preserving architecture, and skip-connection variants for predicting noise or clean speech.
Details
Motivation: To improve diffusion-based speech enhancement by adapting the EDM2 framework with Schrodinger bridge formulation, addressing training stability and exploring architectural choices specific to speech enhancement tasks.Method: Uses Schrodinger bridge formulation with time-dependent preconditioning of network inputs/outputs, magnitude-preserving architecture, two skip-connection variants (predicting environmental noise or clean speech), and analyzes EMA parameter smoothing effects.
Result: Achieves competitive signal-to-distortion ratios and perceptual scores on VoiceBank-DEMAND and EARS-WHAM datasets, with skip-connection variants showing complementary strengths. Finds short or absent EMA yields better performance than in image generation.
Conclusion: Provides new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement, demonstrating effective adaptation of EDM2 framework to speech tasks.
Abstract: We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.
[526] Confidence intervals for forced alignment boundaries using model ensembles
Matthew C. Kelley
Main category: eess.AS
TL;DR: Neural network ensemble method creates confidence intervals for forced alignment boundaries, improving accuracy and providing uncertainty estimates.
Details
Motivation: Traditional forced alignment tools provide only single boundary estimates without uncertainty measures, making it difficult to assess reliability or identify boundaries needing review.Method: Uses 10 pre-trained neural network classifiers in ensemble alignment, places boundaries at median of ensemble outputs, and constructs 97.85% confidence intervals using order statistics.
Result: Slight overall improvement in boundary accuracy on Buckeye and TIMIT corpora compared to single models, plus ability to output confidence intervals as JSON and Praat TextGrids.
Conclusion: Ensemble-based confidence intervals provide valuable uncertainty estimates for forced alignment boundaries, enabling better quality control and statistical analysis while slightly improving accuracy.
Abstract: Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.
[527] Transfer Learning for Paediatric Sleep Apnoea Detection Using Physiology-Guided Acoustic Models
Chaoyue Niu, Veronica Rowe, Guy J. Brown, Heather Elphick, Heather Kenyon, Lowri Thomas, Sam Johnson, Ning Ma
Main category: eess.AS
TL;DR: Transfer learning framework adapts adult sleep acoustic models to pediatric OSA detection using SpO2 integration, showing improved performance over baseline models.
Details
Motivation: Pediatric OSA is clinically significant but difficult to diagnose due to poor tolerance of sensor-based polysomnography in children. Acoustic monitoring offers a non-invasive home-based screening alternative, but limited pediatric data hinders robust deep learning development.Method: Proposes transfer learning framework that adapts acoustic models pretrained on adult sleep data (157 nights) to pediatric OSA detection (15 nights). Incorporates SpO2-based desaturation patterns to enhance training. Systematically evaluates: (1) single- vs multi-task learning, (2) encoder freezing vs full fine-tuning, (3) impact of delaying SpO2 labels to better align with acoustics and capture physiologically meaningful features.
Result: Fine-tuning with SpO2 integration consistently improves pediatric OSA detection compared to baseline models without adaptation. The transfer learning approach demonstrates feasibility for home-based OSA screening in children.
Conclusion: Transfer learning from adult to pediatric data with SpO2 integration is feasible and effective for home-based OSA screening in children, offering potential clinical value for early diagnosis.
Abstract: Paediatric obstructive sleep apnoea (OSA) is clinically significant yet difficult to diagnose, as children poorly tolerate sensor-based polysomnography. Acoustic monitoring provides a non-invasive alternative for home-based OSA screening, but limited paediatric data hinders the development of robust deep learning approaches. This paper proposes a transfer learning framework that adapts acoustic models pretrained on adult sleep data to paediatric OSA detection, incorporating SpO2-based desaturation patterns to enhance model training. Using a large adult sleep dataset (157 nights) and a smaller paediatric dataset (15 nights), we systematically evaluate (i) single- versus multi-task learning, (ii) encoder freezing versus full fine-tuning, and (iii) the impact of delaying SpO2 labels to better align them with the acoustics and capture physiologically meaningful features. Results show that fine-tuning with SpO2 integration consistently improves paediatric OSA detection compared with baseline models without adaptation. These findings demonstrate the feasibility of transfer learning for home-based OSA screening in children and offer its potential clinical value for early diagnosis.
[528] SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes
Dayun Choi, Jung-Woo Choi
Main category: eess.AS
TL;DR: SoundCompass: A TSE framework using SPIN module to capture cross-channel spatial correlations, spherical harmonics encoding for DoA clues, and chain-of-inference iterative refinement for robust target sound extraction.
Details
Motivation: Previous DoA-based TSE methods use hand-crafted features or discrete encodings that lose fine-grained spatial information and limit adaptability. There's a need for better preservation of full spatial information in multichannel signals.Method: 1) SPIN module captures cross-channel spatial correlations in complex spectrogram domain; 2) Uses spherical harmonics encoding for DoA clues; 3) Fuses features across overlapping frequency subbands (band-split architecture); 4) Incorporates chain-of-inference iterative refinement that recursively fuses DoA with sound event activation.
Result: SoundCompass robustly extracts target sources across diverse signal classes and spatial configurations, demonstrating effectiveness of combining SPIN, SH embedding, and CoI.
Conclusion: SoundCompass provides an effective directional clue integration framework that preserves full spatial information and enables robust target sound extraction through spatial correlation capture, spherical harmonics encoding, and iterative refinement.
Abstract: Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.
[529] Short-Segment Speaker Verification with Pre-trained Models and Multi-Resolution Encoder
Jisoo Myoung, Sangwook Han, Kihyuk Kim, Jong Won Shin
Main category: eess.AS
TL;DR: Proposes a speaker verification system combining pre-trained model features with filterbank features and multi-resolution time domain encoder features using very short window shifts (1.56-12.5 ms) to improve performance on short-segment speaker verification.
Details
Motivation: Current self-supervised pre-trained models have low temporal resolution (20 ms), which is problematic for short-segment speaker verification where limited input length requires extracting maximum information. Existing multi-resolution approaches only consider lower resolution features (20, 40, 100 ms).Method: Proposes a speaker verification system that combines: 1) features from pre-trained models, 2) traditional filterbank features, and 3) features from a multi-resolution time domain encoder with very short window shifts (1.56, 3.13, 6.25, and 12.5 ms).
Result: Experimental results on the VoxCeleb dataset with various input lengths showed consistent improvements over systems with various combinations of input features.
Conclusion: The proposed multi-resolution approach with very short window shifts effectively improves speaker verification performance, especially for short input segments, by capturing finer temporal details that are missed by standard pre-trained models.
Abstract: Speaker verification (SV) utilizing features obtained from models pre-trained via self-supervised learning has recently demonstrated impressive performances. However, these pre-trained models (PTMs) usually have a temporal resolution of 20 ms, which is lower than typical filterbank features. It may be problematic especially for short-segment SV with an input segment shorter than 2 s, in which we need to extract as much information as possible from the input with a limited length. Although there have been approaches to utilize multi-resolution features from the HuBERT models, the window shifts were 20, 40, and 100 ms when the sampling rate was 16 kHz and thus only lower resolution features were considered. In this study, we propose an SV system which utilizes PTM features along with filterbank features and those from the multi-resolution time domain encoder with window shifts of 1.56, 3.13, 6.25, and 12.5 ms. Experimental results on the VoxCeleb dataset with various input lengths showed consistent improvements over systems with various combinations of input features.
[530] Unsupervised lexicon learning from speech is limited by representations rather than clustering
Danel Slabbert, Simon Malan, Herman Kamper
Main category: eess.AS
TL;DR: The paper investigates whether performance limitations in zero-resource word segmentation and clustering come from word segment representations or clustering methods, finding that representation variability across same-word segments is the primary bottleneck.
Details
Motivation: Despite progress in zero-resource word segmentation and clustering systems, the induced lexicons are still far from perfect. The research aims to identify whether performance limitations stem from how word segments are represented or from the clustering methods that group them into word-like types.Method: The study combines various self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data. Experiments are conducted in an idealized setting with gold word boundaries to isolate variables.
Result: The best performing system uses graph clustering with dynamic time warping on continuous features. Faster alternatives include graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences. Controlled experiments show representation variability across segments of the same word type is the primary limiting factor, not clustering methods.
Conclusion: The main bottleneck in zero-resource word segmentation and clustering is representation variability across segments of the same word type, rather than clustering algorithms. Future work should focus on developing more robust representations that better capture word identity despite acoustic variations.
Abstract: Zero-resource word segmentation and clustering systems aim to tokenise speech into word-like units without access to text labels. Despite progress, the induced lexicons are still far from perfect. In an idealised setting with gold word boundaries, we ask whether performance is limited by the representation of word segments, or by the clustering methods that group them into word-like types. We combine a range of self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data. The best system uses graph clustering with dynamic time warping on continuous features. Faster alternatives use graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences. Through controlled experiments that isolate either the representations or the clustering method, we demonstrate that representation variability across segments of the same word type – rather than clustering – is the primary factor limiting performance.
[531] Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training
Haixin Zhao, Kaixuan Yang, Nilesh Madhu
Main category: eess.AS
TL;DR: Proposes a gating-based Dynamically Slimmable Network (DSN) with Metric-Guided Training (MGT) for lightweight speech enhancement that adaptively controls computational load based on input signal quality.
Details
Motivation: To further reduce complexity of lightweight speech enhancement models by creating a network that can dynamically adjust its computational load based on input signal quality, avoiding unnecessary computations for simpler inputs.Method: Introduces DSN with static and dynamic components, with dynamic structures targeting common network components (grouped RNN units, multi-head attention, convolutional, and fully connected layers). A policy module controls dynamic parts at frame-wise resolution. Also proposes Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality.
Result: DSN achieves comparable enhancement performance to state-of-the-art lightweight baseline while using only 73% of its computational load on average. Dynamic component usage ratios show MGT-DSN appropriately allocates network resources according to severity of input signal distortion.
Conclusion: The proposed DSN with MGT effectively reduces computational complexity of speech enhancement models while maintaining performance, demonstrating adaptive resource allocation based on input quality.
Abstract: To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.
[532] Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness
Heejoon Koo, Miika Toikkanen, Yoon Tae Kim, Soo Yong Kim, June-Woo Kim
Main category: eess.AS
TL;DR: Proposes a counterfactual adversarial debiasing framework for multimodal respiratory sound classification to address spurious correlations from patient metadata and improve generalization across clinical sites.
Details
Motivation: Current multimodal respiratory sound classification methods are vulnerable to spurious correlations from patient metadata (age, sex, acquisition device), which hinders generalization, especially under distribution shifts across different clinical sites.Method: Three-component framework: 1) Causal graph-based counterfactual debiasing to suppress non-causal dependencies from metadata; 2) Adversarial debiasing to learn metadata-insensitive representations; 3) Counterfactual metadata augmentation to further mitigate spurious correlations and strengthen metadata-invariant representations.
Result: The method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shift scenarios.
Conclusion: The proposed counterfactual adversarial debiasing framework effectively addresses spurious correlations in multimodal respiratory sound classification, improving generalization across clinical sites and distribution shifts.
Abstract: Multimodal respiratory sound classification offers promise for early pulmonary disease detection by integrating bioacoustic signals with patient metadata. Nevertheless, current approaches remain vulnerable to spurious correlations from attributes such as age, sex, or acquisition device, which hinder their generalization, especially under distribution shifts across clinical sites. To this end, we propose a counterfactual adversarial debiasing framework. First, we employ a causal graph-based counterfactual debiasing methodology to suppress non-causal dependencies from patient metadata. Second, we introduce adversarial debiasing to learn metadata-insensitive representations and reduce metadata-specific biases. Third, we design counterfactual metadata augmentation to mitigate spurious correlations further and strengthen metadata-invariant representations. By doing so, our method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shifts. Code is available at https://github.com/RSC-Toolkit/BTS-CARD.
[533] Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic
Main category: eess.AS
TL;DR: This paper studies attention sinks and massive activations in multimodal speech recognition LLMs, identifies their patterns across ASR, VSR, and AVSR, and introduces a decorrelation loss to mitigate these issues while improving performance under feature downsampling.
Details
Motivation: While LLMs have advanced speech recognition (ASR, VSR, AVSR), understanding their internal dynamics during fine-tuning remains limited. Recent NLP research revealed attention sinks and massive activations, but these phenomena haven't been studied in multimodal speech recognition contexts.Method: The authors analyze audio-visual LLMs to identify attention sinks and massive activations across different speech recognition tasks. They discover these phenomena occur not only at BOS tokens but also at intermediate low-semantic tokens. They introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens to mitigate intermediate sinks and massive activations.
Result: The study reveals that massive activations originate in MLP layers and correspond to fixed feature indices across all sink tokens. Intermediate sink tokens show high cosine similarity to BOS tokens, amplifying attention and activation. The proposed decorrelation loss effectively mitigates intermediate sinks and massive activations, improving word error rate under high audio-visual feature downsampling while maintaining stability at lower downsampling rates.
Conclusion: This work provides the first analysis of attention sinks and massive activations in multimodal speech recognition LLMs, revealing consistent patterns across different speech modalities. The proposed decorrelation loss offers a simple yet effective solution to mitigate these issues while enhancing model robustness to feature compression, advancing our understanding of multimodal LLM dynamics.
Abstract: Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
[534] Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic
Main category: eess.AS
TL;DR: Omni-AVSR is a unified LLM framework for audio-visual speech recognition that handles ASR, VSR, and AVSR tasks simultaneously with efficient multi-granularity training and parameter adaptation, reducing computational costs while maintaining or improving accuracy.
Details
Motivation: Current LLM-based speech recognition approaches train separate models for different modalities (ASR, VSR, AVSR), which increases computational and deployment costs and misses potential cross-task synergies. Existing methods also use fixed-rate token compression that limits flexibility in balancing accuracy with efficiency.Method: Omni-AVSR adapts matryoshka representation learning to train across multiple audio and visual granularities efficiently. It explores three LoRA-based strategies for adapting the backbone LLM to balance shared and task-specific specialization, enabling elastic inference.
Result: Experiments on LRS2 and LRS3 datasets show Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training only a single model with substantially lower training and deployment resource use. The model remains robust under acoustic noise and shows favorable scaling behavior with increasing LLM size.
Conclusion: Omni-AVSR provides a unified framework for multi-modal speech recognition that addresses the limitations of current approaches by enabling efficient training across modalities, reducing computational costs, and maintaining performance across ASR, VSR, and AVSR tasks.
Abstract: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
[535] Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
Nikita Kuzmin, Songting Liu, Kong Aik Lee, Eng Siong Chng
Main category: eess.AS
TL;DR: Stream-Voice-Anon adapts causal LM-based neural audio codec architectures for streaming speaker anonymization, achieving better intelligibility and emotion preservation than prior methods while maintaining latency and privacy protection.
Details
Motivation: Streaming speaker anonymization is crucial for online voice applications but remains underexplored. While neural audio codecs with causal language models show promise for streaming tasks, existing systems are designed for voice conversion rather than anonymization, lacking proper privacy protection techniques.Method: Adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques including pseudo-speaker representation sampling, speaker embedding mixing, diverse prompt selection strategies for LM conditioning, and leveraging disentanglement properties of quantized content codes to prevent speaker information leakage. Also explores dynamic vs fixed delay configurations for latency-privacy trade-offs.
Result: Under VoicePrivacy 2024 Challenge protocol, achieves 46% relative WER reduction (intelligibility) and 28% UAR relative improvement (emotion preservation) compared to previous state-of-the-art streaming method DarkStream, while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers (though shows 15% relative degradation against semi-informed attackers).
Conclusion: Stream-Voice-Anon successfully adapts causal LM-based NAC architectures for streaming speaker anonymization, demonstrating substantial improvements in intelligibility and emotion preservation while maintaining practical latency, though privacy protection against more sophisticated attackers needs further improvement.
Abstract: Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
eess.IV
[536] Lossy Image Compression – A Frequent Sequence Mining perspective employing efficient Clustering
Avinash Kadimisetty, Oswald C, Sivaselvan B, Alekhya Kadimisetty
Main category: eess.IV
TL;DR: The paper proposes a novel lossy image compression method that replaces JPEG’s DCT with frequent sequence mining and k-means clustering, achieving better compression ratio and quality.
Details
Motivation: To improve lossy image compression by addressing redundancy more effectively than traditional methods like JPEG, which uses DCT transformation.Method: Replaces JPEG’s DCT phase with closed frequent sequence mining and k-means clustering. Uses parallel k-means clustering on all image blocks, refines GSP algorithm with novel pruning strategy to optimize pattern cardinality and reduce code table size.
Result: Simulations show significant gains in both compression ratio and image quality compared to existing alternatives.
Conclusion: The proposed approach successfully integrates data mining techniques (frequent sequence mining and clustering) into image compression, outperforming conventional methods in efficiency and quality.
Abstract: This work explores the scope of Frequent Sequence Mining in the domain of Lossy Image Compression. The proposed work is based on the idea of clustering pixels and using the cluster identifiers in the compression. The DCT phase in JPEG is replaced with a combination of closed frequent sequence mining and k-means clustering to handle the redundant data effectively. This method focuses mainly on applying k-means clustering in parallel to all blocks of each component of the image to reduce the compression time. Conventional GSP algorithm is refined to optimize the cardinality of patterns through a novel pruning strategy, thus achieving a good reduction in the code table size. Simulations of the proposed algorithm indicate significant gains in compression ratio and quality in relation to the existing alternatives.
[537] OCTA-Based Biomarker Characterization in nAMD
MAria Simona Tivadar, Ioana Damian, Adrian Groza, Simona Delia Nicoara
Main category: eess.IV
TL;DR: Developed three tools for nAMD diagnosis: biomarker extraction, 3D visualization, and white-box ML ensemble for explainable diagnosis with 68% test accuracy.
Details
Motivation: To enhance ophthalmologists' decision-making when diagnosing Neovascular Age-Related Macular Degeneration (nAMD) by providing explainable AI tools that clinicians can understand and trust.Method: Three tools: (1) Image processing for biomarker extraction (mCNV area, vessel density), (2) 3D visualization of neovascularization, (3) Ensemble of three white-box ML algorithms (decision tree, SVM, DL-Learner) for nAMD diagnosis.
Result: The learned models achieved 100% accuracy on training data and 68% accuracy on testing data. The key advantage is the white-box nature ensuring explainability and transparency for clinicians.
Conclusion: The developed tools provide ophthalmologists with explainable AI assistance for nAMD diagnosis, combining quantitative biomarkers, 3D visualization, and transparent machine learning models to support clinical decision-making.
Abstract: We aim to enhance ophthalmologists’ decision-making when diagnosing the Neovascular Age-Related Macular Degeneration (nAMD). We developed three tools to analyze Optical Coherence Tomography Angiography images: (1) extracting biomarkers such as mCNV area and vessel density using image processing; (2) generating a 3D visualization of the neovascularization for a better view of the affected regions; and (3) applying an ensemble of three white box machine learning algorithms (decision tree, support vector machines and DL-Learner) for nAMD diagnosis. The learned expressions reached 100% accuracy for the training data and 68% accuracy in testing. The main advantage is that all the learned models white-box, which ensures explainability and transparency, allowing clinicians to better understand the decision-making process.
[538] Advances in Diffusion-Based Generative Compression
Yibo Yang, Stephan Mandt
Main category: eess.IV
TL;DR: Review paper on diffusion-based generative lossy compression methods, focusing on image compression, rate-distortion-perception theory, and connections to inverse problems.
Details
Motivation: Diffusion models have shown strong image generation performance and enabled new approaches to data compression at extremely low bit-rates. There's a need to unify and review recent diffusion-based methods for generative lossy compression.Method: Review of methods that encode source into embeddings, transmit via auxiliary entropy models, and use diffusion models to iteratively refine embeddings during decoding. Also explores diffusion models for information transmission via channel simulation.
Result: Provides a unifying review of diffusion-based compression approaches, analyzing them through rate-distortion-perception theory, highlighting the role of common randomness and connections to inverse problems.
Conclusion: Identifies open challenges in the field while establishing connections between diffusion-based compression methods, rate-distortion-perception theory, and inverse problem solving approaches.
Abstract: Popularized by their strong image generation performance, diffusion and related methods for generative modeling have found widespread success in visual media applications. In particular, diffusion methods have enabled new approaches to data compression, where realistic reconstructions can be generated at extremely low bit-rates. This article provides a unifying review of recent diffusion-based methods for generative lossy compression, with a focus on image compression. These methods generally encode the source into an embedding and employ a diffusion model to iteratively refine it in the decoding procedure, such that the final reconstruction approximately follows the ground truth data distribution. The embedding can take various forms and is typically transmitted via an auxiliary entropy model, and recent methods also explore the use of diffusion models themselves for information transmission via channel simulation. We review representative approaches through the lens of rate-distortion-perception theory, highlighting the role of common randomness and connections to inverse problems, and identify open challenges.
[539] Optimized $k$-means color quantization of digital images in machine-based and human perception-based colorspaces
Ranjan Maitra
Main category: eess.IV
TL;DR: k-means color quantization performs differently across RGB, CIE-XYZ, and CIE-LUV/HCL colorspaces, with RGB best in ~50% of cases, CIE-XYZ better at higher quantization levels, and CIE-LUV sometimes better at lower levels.
Details
Motivation: While k-means is commonly used for color quantization in RGB space, recent studies suggest better performance in human perception-based colorspaces. The paper aims to systematically compare k-means performance across different colorspaces to determine optimal conditions for each.Method: Tested k-means color quantization at four quantization levels on 148 diverse digital images across RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces. Used Visual Information Fidelity (VIF) measure to numerically assess quantized image quality. Analyzed performance in relation to hue, chromaticity, and luminance distributions.
Result: RGB performed best in about half of cases. CIE-XYZ colorspace generally performed better at higher quantization levels (k). CIE-LUV colorspace sometimes performed best at lower quantization levels. Performance patterns correlate with specific distributions of hue, chromaticity, and luminance in images.
Conclusion: No single colorspace is universally optimal for k-means color quantization. Performance depends on quantization level and image characteristics. The study provides nuanced characterization of which colorspace works best under specific conditions based on image content and quantization requirements.
Abstract: Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.
[540] Recover Cell Tensor: Diffusion-Equivalent Tensor Completion for Fluorescence Microscopy Imaging
Chenwei Wang, Zhaoke Huang, Zelin Li, Wenqi Zhu
Main category: eess.IV
TL;DR: A tensor completion framework for 3D fluorescence microscopy imaging that treats sparse anisotropic sampling as a tensor completion problem and uses score-based generative modeling with structural priors for high-quality reconstruction.
Details
Motivation: Fluorescence microscopy imaging of 3D live cells faces challenges due to phototoxicity constraints, leading to sparsely sampled volumes with anisotropic resolution and high noise. Existing inverse problem methods struggle with unknown degradation processes and lack of high-quality reference data.Method: Proposes a tensor completion framework that treats FM imaging with equidistant Z-axis sampling as a uniformly random sampling tensor completion task. Derives theoretical lower bounds for exact tensor completion and reformulates the problem as a mathematically equivalent score-based generative model with structural consistency priors to guide reconstructions toward denoised and geometrically coherent results.
Result: Demonstrates state-of-the-art performance on SR-CACO-2 and three real in vivo cellular datasets, showing substantial improvements in both signal-to-noise ratio and structural fidelity compared to existing methods.
Conclusion: The proposed tensor completion framework with score-based generative modeling effectively addresses the challenges of sparse, anisotropic FM imaging, providing accurate 3D cell reconstruction without requiring high-quality reference volumes or known degradation processes.
Abstract: Fluorescence microscopy (FM) imaging is a fundamental technique for observing live cell division, one of the most essential processes in the cycle of life and death. Observing 3D live cells requires scanning through the cell volume while minimizing lethal phototoxicity. That limits acquisition time and results in sparsely sampled volumes with anisotropic resolution and high noise. Existing image restoration methods, primarily based on inverse problem modeling, assume known and stable degradation processes and struggle under such conditions, especially in the absence of high-quality reference volumes. In this paper, from a new perspective, we propose a novel tensor completion framework tailored to the nature of FM imaging, which inherently involves nonlinear signal degradation and incomplete observations. Specifically, FM imaging with equidistant Z-axis sampling is essentially a tensor completion task under a uniformly random sampling condition. On one hand, we derive the theoretical lower bound for exact cell tensor completion, validating the feasibility of accurately recovering 3D cell tensor. On the other hand, we reformulate the tensor completion problem as a mathematically equivalent score-based generative model. By incorporating structural consistency priors, the generative trajectory is effectively guided toward denoised and geometrically coherent reconstructions. Our method demonstrates state-of-the-art performance on SR-CACO-2 and three real \textit{in vivo} cellular datasets, showing substantial improvements in both signal-to-noise ratio and structural fidelity.
[541] Magnetic Resonance Simulation of Effective Transverse Relaxation (T2*)
Hidenori Takeshima
Main category: eess.IV
TL;DR: Efficient simulation of reversible transverse relaxation (T2’) in MRI using linear phase model and derivative techniques, achieving accurate T2* simulation without requiring 100+ isochromats.
Details
Motivation: T2* relaxation in MRI consists of reversible (T2') and irreversible (T2) components. While T2 simulation is straightforward, T2' simulation is challenging when only simulating individual isochromat magnetizations, typically requiring 100+ isochromats to approximate the Lorentzian function.Method: Proposed efficient T2’ simulation methods using: 1) Linear phase model to directly simulate entire Lorentzian function, avoiding isochromat approximation; 2) Simulation of partial derivatives of magnetizations with respect to frequency axis; 3) Two acceleration techniques: analytic solutions and combined transitions. Validated with one-isochromat simulation and realistic pulse sequence simulations using two phantoms.
Result: One-isochromat simulation demonstrated T2’ simulation feasibility. Realistic cases successfully recovered T2’ without requiring 100+ isochromats per point. Computational time with T2’ simulations was only 2.0-2.7× longer than without T2’ simulations. Acceleration techniques provided 19× speedup with analytic solutions and up to 17× with combined transitions.
Conclusion: Proposed methods efficiently simulate T2’ using linear phase model with Lorentzian function, analytic solutions, and combined transitions, enabling accurate T2* simulation without excessive computational cost of traditional isochromat approaches.
Abstract: Purpose: To simulate effective transverse relaxation ($T_2^$) as a part of MR simulation. $T_2^$ consists of reversible ($T_2^{\prime}$) and irreversible ($T_2$) components. Whereas simulations of $T_2$ are easy, $T_2^{\prime}$ is not easily simulated if only magnetizations of individual isochromats are simulated. Theory and Methods: Efficient methods for simulating $T_2^{\prime}$ were proposed. To approximate the Lorentzian function of $T_2^{\prime}$ realistically, conventional simulators require 100+ isochromats. This approximation can be avoided by utilizing a linear phase model for simulating an entire Lorentzian function directly. To represent the linear phase model, the partial derivatives of the magnetizations with respect to the frequency axis were also simulated. To accelerate the simulations with these partial derivatives, the proposed methods introduced two techniques: analytic solutions, and combined transitions. For understanding the fundamental mechanism of the proposed method, a simple one-isochromat simulation was performed. For evaluating realistic cases, several pulse sequences were simulated using two phantoms with and without $T_2^{\prime}$ simulations. Results: The one-isochromat simulation demonstrated that $T_2^{\prime}$ simulations were possible. In the realistic cases, $T_2^{\prime}$ was recovered as expected without using 100+ isochromats for each point. The computational times with $T_2^{\prime}$ simulations were only 2.0 to 2.7 times longer than those without $T_2^{\prime}$ simulations. When the above-mentioned two techniques were utilized, the analytic solutions accelerated 19 times, and the combined transitions accelerated up to 17 times. Conclusion: Both theory and results showed that the proposed methods simulated $T_2^{\prime}$ efficiently by utilizing a linear model with a Lorentzian function, analytic solutions, and combined transitions.
[542] Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate-Distortion Awareness
Wuyang Cong, Junqi Shi, Lizhong Wang, Weijing Shi, Ming Lu, Hao Chen, Zhan Ma
Main category: eess.IV
TL;DR: RL-based rate control for neural video compression that jointly optimizes bitrate allocation and coding parameters frame-by-frame, achieving better rate-distortion performance and bitrate adherence than existing methods.
Details
Motivation: Existing neural video compression rate control schemes overlook inter-frame rate dependencies caused by per-frame coding parameter shifts, leading to suboptimal bitrate allocation and cascading parameter decisions that degrade compression efficiency.Method: Propose a reinforcement learning framework that formulates rate control as a sequential decision process where an RL agent observes spatiotemporal states at each frame and selects coding parameters to optimize long-term rate-distortion performance and bitrate adherence, independent of GOP structure.
Result: Achieves average relative bitrate error of 1.20%, up to 13.45% bitrate savings at typical GOP sizes, improved robustness to content variation and bandwidth fluctuations, and lower coding overhead across diverse NVC architectures.
Conclusion: The RL-based rate control framework effectively addresses inter-frame rate dependencies, outperforms existing approaches, and demonstrates practical suitability for deployment in neural video compression systems.
Abstract: Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement-learning (RL)-based rate control framework that formulates the task as a frame-by-frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long-term reward that reflects rate-distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding parameters in a single step, independent of group of pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20% and achieves up to 13.45% bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower coding overhead, making it highly suitable for practical deployment.
[543] AMGFormer: Adaptive Multi-Granular Transformer for Brain Tumor Segmentation with Missing Modalities
Chengxiang Guo, Jian Wang, Junhua Fei, Xiao Li, Chunling Chen, Yun Jin
Main category: eess.IV
TL;DR: AMGFormer addresses the stability crisis in brain tumor segmentation when MRI modalities are missing, achieving consistent performance with <0.5% variance across 15 modality combinations through adaptive fusion and quality-aware modules.
Details
Motivation: Existing brain tumor segmentation methods suffer from >40% performance variance when MRI modalities are missing in clinical practice, making them unreliable for real-world deployment where complete multimodal data is often unavailable.Method: AMGFormer uses three synergistic modules: (1) QuadIntegrator Bridge for spatially adaptive fusion maintaining consistent predictions across modality combinations, (2) Multi-Granular Attention Orchestrator focusing on pathological regions to reduce background sensitivity, and (3) Modality Quality-Aware Enhancement preventing error propagation from corrupted sequences.
Result: On BraTS 2018: 89.33% WT, 82.70% TC, 67.23% ET Dice scores with <0.5% variance across 15 modality combinations. Single-modality ET segmentation shows 40-81% relative improvements over SOTA. Generalizes to BraTS 2020/2021 with up to 92.44% WT, 89.91% TC, 84.57% ET. Fast inference at 1.2s.
Conclusion: AMGFormer solves the stability crisis in multimodal brain tumor segmentation, achieving clinically reliable performance with minimal variance across missing modalities, demonstrating strong potential for clinical deployment with fast inference times.
Abstract: Multimodal MRI is essential for brain tumor segmentation, yet missing modalities in clinical practice cause existing methods to exhibit >40% performance variance across modality combinations, rendering them clinically unreliable. We propose AMGFormer, achieving significantly improved stability through three synergistic modules: (1) QuadIntegrator Bridge (QIB) enabling spatially adaptive fusion maintaining consistent predictions regardless of available modalities, (2) Multi-Granular Attention Orchestrator (MGAO) focusing on pathological regions to reduce background sensitivity, and (3) Modality Quality-Aware Enhancement (MQAE) preventing error propagation from corrupted sequences. On BraTS 2018, our method achieves 89.33% WT, 82.70% TC, 67.23% ET Dice scores with <0.5% variance across 15 modality combinations, solving the stability crisis. Single-modality ET segmentation shows 40-81% relative improvements over state-of-the-art methods. The method generalizes to BraTS 2020/2021, achieving up to 92.44% WT, 89.91% TC, 84.57% ET. The model demonstrates potential for clinical deployment with 1.2s inference. Code: https://github.com/guochengxiangives/AMGFormer.
[544] Interpretable and backpropagation-free Green Learning for efficient multi-task echocardiographic segmentation and classification
Jyun-Ping Kao, Jiaxing Yang, C. -C. Jay Kuo, Jonghye Woo
Main category: eess.IV
TL;DR: A backpropagation-free multi-task Green Learning framework achieves state-of-the-art LV segmentation and LVEF classification with high accuracy and computational efficiency, outperforming complex 3D DL models.
Details
Motivation: Manual LVEF assessment has high inter-observer variability, while existing Deep Learning models are computationally intensive "black boxes" that lack clinical trust and adoption.Method: Multi-task Green Learning framework with unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction, combined with multi-level regression decoder and XG-Boost classifier for simultaneous LV segmentation and LVEF classification.
Result: Achieves 94.3% classification accuracy and Dice Similarity Coefficient of 0.912 on EchoNet-Dynamic dataset, significantly outperforming advanced 3D DL models with over an order of magnitude fewer parameters.
Conclusion: Green Learning paradigm can deliver accurate, efficient, and interpretable solutions for medical image analysis, enabling more sustainable and trustworthy AI in clinical practice.
Abstract: Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry “black boxes” that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.
[545] Extensions on Low-complexity DCT Approximations for Larger Blocklengths Based on Minimal Angle Similarity
A. P. Radünz, L. Portella, R. S. Oliveira, F. M. Bayer, R. J. Cintra
Main category: eess.IV
TL;DR: The paper introduces new 16-, 32-, and 64-point low-complexity DCT approximations that outperform existing methods in image/video coding applications.
Details
Motivation: The DCT is crucial for image/video coding as it approximates the optimal KLT transform. However, exact DCT computation is computationally expensive, creating a need for efficient approximations that maintain good performance.Method: Developed 16-, 32-, and 64-point DCT approximations by minimizing the angle between rows of the exact DCT matrix and the approximate transform matrix. Also created fast algorithms for these low-complexity transforms.
Result: The proposed transforms outperformed existing DCT approximations according to classical figures of merit. Practical image encoding experiments showed better results than known approximations for 16, 32, and 64 blocklengths.
Conclusion: The introduced low-complexity DCT approximations achieve a good balance between computational cost and performance, making them relevant for practical image/video coding applications.
Abstract: The discrete cosine transform (DCT) is a central tool for image and video coding because it can be related to the Karhunen-Loève transform (KLT), which is the optimal transform in terms of retained transform coefficients and data decorrelation. In this paper, we introduce 16-, 32-, and 64-point low-complexity DCT approximations by minimizing individually the angle between the rows of the exact DCT matrix and the matrix induced by the approximate transforms. According to some classical figures of merit, the proposed transforms outperformed the approximations for the DCT already known in the literature. Fast algorithms were also developed for the low-complexity transforms, asserting a good balance between the performance and its computational cost. Practical applications in image encoding showed the relevance of the transforms in this context. In fact, the experiments showed that the proposed transforms had better results than the known approximations in the literature for the cases of 16, 32, and 64 blocklength.
[546] Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis
Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ge Wang, Ruifen Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Ya Zhang, Kun Sun, Yanfeng Wang, Weidi Xie
Main category: eess.IV
TL;DR: KEEP integrates medical knowledge graphs into pathology foundation models, reorganizing image-text pairs into disease-aligned groups for improved cancer diagnosis, especially for rare subtypes.
Details
Motivation: Current vision-language foundation models in computational pathology are primarily data-driven and lack explicit integration of medical knowledge, limiting their understanding of disease relationships and morphological patterns.Method: KEEP systematically incorporates disease knowledge using a comprehensive knowledge graph (11,454 diseases, 139,143 attributes) to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies.
Result: KEEP consistently outperformed existing foundation models across 18 public benchmarks (14,000+ whole-slide images) and 4 institutional rare cancer datasets (926 cases), showing substantial gains for rare subtypes.
Conclusion: Knowledge-enhanced vision-language modeling represents a powerful paradigm for advancing computational pathology by enabling deeper understanding of disease relationships and morphological patterns.
Abstract: Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. This knowledge-enhanced pretraining aligns visual and textual representations within hierarchical semantic spaces, enabling deeper understanding of disease relationships and morphological patterns. Across 18 public benchmarks (over 14,000 whole-slide images) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes. These results establish knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.
[547] Hint: hierarchical inter-frame correlation for one-shot point cloud sequence compression
Yuchen Gao, Qi Zhang
Main category: eess.IV
TL;DR: HINT: A fast deep learning method for sequential point cloud compression using temporal-spatial correlation, achieving 49.6x encoding and 21.6x decoding acceleration over G-PCC with up to 43.6% bitrate reduction.
Details
Motivation: Existing point cloud compression methods suffer from high decoding latency (10^1-10^2 seconds) due to reliance on parent/sibling contexts and level-wise autoregression, limiting practical applications.Method: HINT integrates temporal and spatial correlation using two-stage temporal feature extraction: (1) parent-level existence map and (2) child-level neighborhood lookup in previous frame, fused with spatial features via element-wise addition and encoded with group-wise strategy.
Result: Achieves encoding time 105 ms and decoding time 140 ms (49.6x and 21.6x acceleration over G-PCC), with up to 43.6% bitrate reduction, consistently outperforming spatial-only baseline RENO.
Conclusion: HINT demonstrates that integrating temporal correlation significantly improves compression efficiency and speed for sequential point clouds, enabling practical real-time applications.
Abstract: Deep learning has demonstrated strong capability in compressing point clouds. Within this area, entropy modeling for lossless compression is widely investigated. However, most methods rely solely on parent/sibling contexts and level-wise autoregression, which suffers from decoding latency on the order of 10^1-10^2 seconds. We propose HINT, a method that integrates temporal and spatial correlation for sequential point cloud compression. Specifically, it first uses a two-stage temporal feature extraction: (i) a parent-level existence map and (ii) a child-level neighborhood lookup in the previous frame. These cues are fused with the spatial features via element-wise addition and encoded with a group-wise strategy. Experimental results show that HINT achieves encoding and decoding time at 105 ms and 140 ms, respectively, equivalent to 49.6x and 21.6x acceleration in comparison with G-PCC, while achieving up to 43.6% bitrate reduction and consistently outperforming the spatial-only baseline (RENO).
[548] An Energy-Efficient Adiabatic Capacitive Neural Network Chip
Himadri Singh Raghav, Sachin Maheshwari, Mike Smart, Patrick Foster, Alex Serb
Main category: eess.IV
TL;DR: Mixed-signal adiabatic capacitive neural network chip achieves 2.1-6.8x energy savings for image classification on edge devices.
Details
Motivation: Growing demand for high computational performance under stringent energy constraints for battery-powered edge devices, driven by AI advances and increasing data bandwidth requirements in applications like video processing and high-resolution sensing.Method: Designed a mixed-signal adiabatic capacitive neural network chip in 130nm CMOS technology with dual-layer hardware incorporating 16 single-cycle multiply-accumulate engines for classifying 4 classes of 8x8 1-bit images.
Result: Achieves over 95% classification accuracy (within 2.7% of equivalent software version) and demonstrates average energy savings between 2.1x and 6.8x compared to equivalent CMOS capacitive implementation.
Conclusion: The mixed-signal adiabatic capacitive approach provides significant energy efficiency improvements for neural network inference on edge devices while maintaining high classification accuracy.
Abstract: Recent advances in artificial intelligence, coupled with increasing data bandwidth requirements, in applications such as video processing and high-resolution sensing, have created a growing demand for high computational performance under stringent energy constraints, especially for battery-powered and edge devices. To address this, we present a mixed-signal adiabatic capacitive neural network chip, designed in a 130$nm$ CMOS technology, to demonstrate significant energy savings coupled with high image classification accuracy. Our dual-layer hardware chip, incorporating 16 single-cycle multiply-accumulate engines, can reliably distinguish between 4 classes of 8x8 1-bit images, with classification results over 95%, within 2.7% of an equivalent software version. Energy measurements reveal average energy savings between 2.1x and 6.8x, compared to an equivalent CMOS capacitive implementation.
[549] Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
Xiang Li, XueHeng Li, Yu Wang, XuanHua He, ZhangChi Hu, WeiWei Yu, ChengJun Xie
Main category: eess.IV
TL;DR: Q-Probe is a new agentic IQA framework that scales image quality assessment to high resolution using context-aware probing, addressing limitations of existing RL-based methods that fail to capture local degradations.
Details
Motivation: Existing RL-based IQA models rely on coarse-grained global views and fail to capture subtle local degradations in high-resolution scenarios. Current "Thinking with Images" paradigms adapted to IQA create spurious biases like "cropping-implies-degradation" and misinterpret natural depth-of-field as artifacts.Method: Proposes Q-Probe framework with: 1) Vista-Bench benchmark for fine-grained local degradation analysis in high-resolution IQA, and 2) Three-stage training paradigm that progressively aligns with human preferences while eliminating causal bias through context-aware cropping strategy.
Result: Extensive experiments show Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.
Conclusion: Q-Probe successfully addresses the challenges of scaling IQA to high resolution through agentic context-aware probing, outperforming existing methods by better capturing local degradations while avoiding spurious biases.
Abstract: Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging “Thinking with Images” paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious “cropping-implies-degradation” biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.