Daily arXiv Papers - 2025-12-15

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages

Subham Kumar, Prakrithi Shivaprakash, Abhishek Manoharan, Astut Kurariya, Diptadhi Mukherjee, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy

Main category: cs.CL

TL;DR: First systematic audit of ASR performance on real-world clinical interviews in Indian languages (Kannada, Hindi, Indian English) reveals substantial variability across models, with systematic performance gaps tied to speaker role and gender, raising equity concerns for healthcare deployment.

DetailsMotivation: ASR is increasingly used to document clinical encounters, but its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown, creating a need for systematic evaluation.

Method: Conducted systematic audit comparing leading ASR models (Indic Whisper, Whisper, Sarvam, Google speech-to-text, Gemma3n, Omnilingual, Vaani, Gemini) on real-world clinical interview data spanning Kannada, Hindi, and Indian English, evaluating transcription accuracy across languages, speakers, and demographic subgroups.

Result: Substantial variability across models and languages; some systems perform competitively on Indian English but fail on code-mixed or vernacular speech. Uncovered systematic performance gaps tied to speaker role (patients vs. clinicians) and gender, with intersectional disparities.

Conclusion: The comprehensive multilingual benchmark and fairness analysis highlights the need for culturally and demographically inclusive ASR development for India’s healthcare ecosystem, as current systems show equity concerns that could affect clinical deployment.

Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown. In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. We evaluate transcription accuracy across languages, speakers, and demographic subgroups, with a particular focus on error patterns affecting patients vs. clinicians and gender based or intersectional disparities. Our results reveal substantial variability across models and languages, with some systems performing competitively on Indian English but failing on code mixed or vernacular speech. We also uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings. By providing a comprehensive multilingual benchmark and fairness analysis, our work highlights the need for culturally and demographically inclusive ASR development for healthcare ecosystem in India.

[2] Benchmarking Automatic Speech Recognition Models for African Languages

Alvin Nahabwe, Sulaiman Kagumire, Denis Musinguzi, Bruno Beijuka, Jonah Mubuuke Kyagaba, Peter Nabende, Andrew Katumba, Joyce Nakatumba-Nabende

Main category: cs.CL

TL;DR: Benchmarking 4 state-of-the-art ASR models (Whisper, XLS-R, MMS, W2v-BERT) across 13 African languages with varying data amounts (1-400 hours) reveals their different performance characteristics in low-resource settings.

DetailsMotivation: ASR for African languages faces challenges due to limited labeled data and lack of systematic guidance on model selection, data scaling, and decoding strategies. While large pre-trained models exist, their comparative performance in African low-resource contexts hasn't been studied systematically.

Method: Benchmarked four SOTA ASR models (Whisper, XLS-R, MMS, W2v-BERT) across 13 African languages. Fine-tuned models on progressively larger subsets of transcribed data ranging from 1 to 400 hours. Analyzed error rates and model behavior under varying conditions, including external language model decoding.

Result: MMS and W2v-BERT are more data efficient in very low-resource regimes. XLS-R scales more effectively with additional data. Whisper shows advantages in mid-resource conditions. External language model decoding helps in some cases but can plateau or introduce errors depending on acoustic-text alignment.

Conclusion: The study provides practical insights into ASR system design for underrepresented languages by highlighting interactions between pre-training coverage, model architecture, dataset domain, and resource availability. Different models excel under different resource conditions, offering guidance for model selection in African language ASR.

Abstract: Automatic speech recognition (ASR) for African languages remains constrained by limited labeled data and the lack of systematic guidance on model selection, data scaling, and decoding strategies. Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. Beyond reporting error rates, we provide new insights into why models behave differently under varying conditions. We show that MMS and W2v-BERT are more data efficient in very low-resource regimes, XLS-R scales more effectively as additional data becomes available, and Whisper demonstrates advantages in mid-resource conditions. We also analyze where external language model decoding yields improvements and identify cases where it plateaus or introduces additional errors, depending on the alignment between acoustic and text resources. By highlighting the interaction between pre-training coverage, model architecture, dataset domain, and resource availability, this study offers practical and insights into the design of ASR systems for underrepresented languages.

[3] MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA

Seonok Kim

Main category: cs.CL

TL;DR: MedBioRAG is a retrieval-augmented generation model that combines semantic/lexical search with supervised fine-tuning to significantly improve biomedical question-answering performance across multiple benchmark datasets.

DetailsMotivation: To enhance large language models' ability to perform complex biomedical QA tasks by developing a specialized retrieval-augmented approach that can efficiently retrieve and utilize relevant biomedical documents for more accurate and context-aware responses.

Method: Combines semantic and lexical search for document retrieval, implements document ranking, and uses supervised fine-tuning. The model is evaluated across text retrieval, close-ended QA, and long-form QA tasks using biomedical benchmark datasets.

Result: Outperforms previous state-of-the-art models and GPT-4o base model in all evaluated tasks. Shows improvements in NDCG and MRR scores for document retrieval, higher accuracy in close-ended QA, and better ROUGE scores in long-form QA.

Conclusion: Demonstrates the effectiveness of semantic search-based retrieval combined with LLM fine-tuning for biomedical applications, providing a robust solution for improving biomedical question-answering performance.

Abstract: Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications.

[4] KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang

Main category: cs.CL

TL;DR: KBQA-R1 is a reinforcement learning framework for KBQA that shifts from text imitation to interaction optimization, using GRPO for policy refinement and RRS for data synthesis, achieving SOTA performance on major benchmarks.

DetailsMotivation: Current LLM-based KBQA approaches suffer from two main failures: (1) generating hallucinated queries without verifying schema existence, and (2) exhibiting rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment.

Method: KBQA-R1 treats KBQA as a multi-turn decision process where the model learns to navigate the knowledge base using a list of actions. It uses Group Relative Policy Optimization (GRPO) to refine strategies based on execution feedback, and introduces Referenced Rejection Sampling (RRS) for data synthesis to align reasoning traces with ground-truth action sequences.

Result: Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

Conclusion: KBQA-R1 successfully addresses the limitations of current approaches by shifting from text imitation to interaction optimization via reinforcement learning, enabling more robust and verifiable knowledge base question answering.

Abstract: Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

[5] PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

Pawel Batorski, Paul Swoboda

Main category: cs.CL

TL;DR: PIAST: Fast automatic prompt construction using Monte Carlo Shapley estimation to optimize few-shot examples, outperforming existing methods on multiple tasks with limited compute budgets.

DetailsMotivation: LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and requires intricate crafting of few-shot examples. Existing methods often require exhaustive instruction search which is computationally expensive.

Method: Proposes an automatic prompt construction algorithm that augments human instructions by generating a small set of few-shot examples. Uses Monte Carlo Shapley estimation to iteratively replace/drop/keep examples based on utility. Employs aggressive subsampling and a replay buffer for faster evaluations. Can be run with different compute time budgets.

Result: On limited budget: outperforms existing automatic prompting methods on text simplification and GSM8K, second best on classification and summarization. With extended but modest compute budget: sets new state-of-the-art among automatic prompting methods on classification, simplification and GSM8K.

Conclusion: Carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. The method demonstrates that optimizing few-shot examples is more effective than extensive instruction tuning.

Abstract: LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at https://github.com/Batorskq/PIAST.

[6] MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data

Christopher Driggers-Ellis, Detravious Brinkley, Ray Chen, Aashish Dhawan, Daisy Zhe Wang, Christan Grant

Main category: cs.CL

TL;DR: MultiScript30k extends Multi30k dataset to include Arabic, Spanish, Ukrainian, and Chinese (Simplified/Traditional) using NLLB200-3.3B translation, addressing the limitation of original Multi30k’s European language focus.

DetailsMotivation: Original Multi30k dataset only covers four European languages (Czech, English, French, German), restricting multimodal machine translation research to Latin-script languages and stalling progress on diverse language families.

Method: Translated English version of Multi30k (Multi30k-En) using NLLB200-3.3B model to create MultiScript30k, covering Arabic, Spanish, Ukrainian, Simplified Chinese, and Traditional Chinese with over 30,000 sentences.

Result: Dataset achieves >0.8 cosine similarity and <0.000251 symmetric KL divergence for most languages (except Traditional Chinese). COMETKiwi scores show mixed performance: Arabic comparable to existing ArEnMulti30k, but Ukrainian 6.4% lower than Multi30k-Uk.

Conclusion: MultiScript30k successfully extends multimodal translation resources to diverse scripts and language families, though translation quality varies across languages compared to existing specialized extensions.

Abstract: Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over (30000) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than (0.8) cosine similarity and symmetric KL divergence less than (0.000251) for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4%$ greater than MultiScript30k-Uk per split.

[7] Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment

Alan Gerber, Sam Cooperman

Main category: cs.CL

TL;DR: Researchers developed an iMessage analyzer tool to extract insights from locally stored iMessage data on Mac, exploring topic modeling, response times, reluctance scoring, and sentiment analysis.

DetailsMotivation: As society increasingly relies on short-form electronic communication, it's important to understand what messaging data can reveal about user behavior. Apple's iMessage stores comprehensive message data locally on Mac devices, creating an opportunity to analyze personal messaging patterns that are typically inaccessible in other platforms.

Method: Created an iMessage text message analyzer tool that extracts and analyzes data from the locally stored iMessage database file on Mac computers. The tool focuses on answering five main research questions through topic modeling, response time analysis, reluctance scoring, and sentiment analysis techniques.

Result: The paper demonstrates that the analyzer can successfully extract meaningful insights from iMessage data, answering research questions about conversation topics, communication patterns, user engagement, and emotional content in messages.

Conclusion: The iMessage analyzer provides valuable tools for personal data analysis and has significant potential for future studies on messaging behavior, offering insights that are typically inaccessible due to companies’ data protection policies.

Abstract: What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data.

[8] Extending a Parliamentary Corpus with MPs’ Tweets: Automatic Annotation and Evaluation Using MultiParTweet

Mevlüt Bagci, Ali Abusaleh, Daniel Baumartz, Giueseppe Abrami, Maxim Konca, Alexander Mehler

Main category: cs.CL

TL;DR: MultiParTweet is a multilingual tweet corpus from X that connects politicians’ social media discourse with German parliamentary debates, enriched with automated annotations from 10 models (9 text-based + 1 VLM) for emotion, sentiment, and topic analysis, validated against human annotations.

DetailsMotivation: Social media is crucial in modern politics as it reflects politicians' ideologies and facilitates communication with younger generations. There's a need to connect online political discourse with parliamentary debates for comparative analysis.

Method: Created MultiParTweet corpus (39,546 tweets with 19,056 media items) linked to German political corpus GerParCor. Used 9 text-based models and 1 vision-language model (VLM) for automated emotion, sentiment, and topic annotations. Developed TTLABTweetCrawler tool for X data collection. Validated automated annotations against manually annotated subset.

Result: Models are mutually predictable from each other’s outputs. VLM-based annotations were preferred by human annotators, suggesting multimodal representations align better with human interpretation. Provided both the corpus and data collection tool.

Conclusion: MultiParTweet enables comparative analysis between online political communication and parliamentary debates. VLM-based multimodal annotations show better alignment with human judgment than text-only models, offering improved automated analysis of political social media content.

Abstract: Social media serves as a critical medium in modern politics because it both reflects politicians’ ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians’ social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.

[9] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Jonathan Kamp, Roos Bakker, Dominique Blok

Main category: cs.CL

TL;DR: The paper analyzes biases in feature attribution methods for explaining language models, proposing a framework to evaluate lexical and position biases across different methods and models.

DetailsMotivation: Different feature attribution methods produce inconsistent explanations for the same input, leading to user mistrust or inadequate trust. The authors aim to systematically understand and structure these biases rather than just noting superficial inconsistencies.

Method: Proposes a model- and method-agnostic framework with three evaluation metrics to assess lexical and position biases. Tests on two transformers: first with a controlled pseudo-random classification task on artificial data, then with a semi-controlled causal relation detection task on natural data.

Result: Found structural imbalance in lexical vs position biases - models scoring high on one type score low on the other. Also found that methods producing anomalous explanations are more likely to be biased themselves.

Conclusion: The framework reveals systematic biases in attribution methods, showing that biases are not random but follow patterns that can be quantified, helping users better understand and trust explanation methods.

Abstract: Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.

[10] Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

Longshen Ou, Xichu Ma, Ye Wang

Main category: cs.CL

TL;DR: This paper addresses the singability gap in melody-to-lyric generation by jointly learning wording and formatting through multi-stage training with musicological supervision.

DetailsMotivation: There's a substantial singability gap between machine-generated lyrics and human-written lyrics in melody-to-lyric generation tasks. Current approaches fail to properly capture the formatting and structural patterns needed for singable lyrics.

Method: The approach uses multi-stage training: 1) general-domain pretraining, 2) self-supervised length awareness training on large text-only lyric corpus, and 3) supervised melody-to-lyric training with multiple auxiliary supervision objectives based on musicological findings about melody-lyric relationships.

Result: The method improves adherence to line-count requirements by 3.8% and syllable-count requirements by 21.4% absolute compared to naïve fine-tuning, without degrading text quality. Human evaluation shows 42.2% and 74.2% relative gains in overall quality over two task-specific baselines.

Conclusion: Formatting-aware training is crucial for generating singable lyrics, and the proposed multi-stage approach with musicological supervision effectively narrows the singability gap in melody-to-lyric generation.

Abstract: Despite progress in melody-to-lyric generation, a substantial singability gap remains between machine-generated lyrics and those written by human lyricists. In this work, we aim to narrow this gap by jointly learning both wording and formatting for melody-to-lyric generation. After general-domain pretraining, our model acquires length awareness through an self-supervised stage trained on a large text-only lyric corpus. During supervised melody-to-lyric training, we introduce multiple auxiliary supervision objective informed by musicological findings on melody–lyric relationships, encouraging the model to capture fine-grained prosodic and structural patterns. Compared with naïve fine-tuning, our approach improves adherence to line-count and syllable-count requirements by 3.8% and 21.4% absolute, respectively, without degrading text quality. In human evaluation, it achieves 42.2% and 74.2% relative gains in overall quality over two task-specific baselines, underscoring the importance of formatting-aware training for generating singable lyrics.

[11] FIBER: A Multilingual Evaluation Resource for Factual Inference Bias

Evren Ayberk Munis, Deniz Yılmaz, Arianna Muti, Çağrı Toraman

Main category: cs.CL

TL;DR: FIBER is a multilingual benchmark for evaluating factual knowledge in LLMs across single- and multi-entity settings in English, Italian, and Turkish, revealing language-induced inference biases and performance differences.

DetailsMotivation: Existing benchmarks focus on single-entity facts and monolingual data, lacking comprehensive evaluation of factual knowledge in multilingual and multi-entity contexts, which is crucial for assessing LLM reliability and biases.

Method: Created FIBER benchmark with sentence completion, question-answering, and object-count prediction tasks across three languages. Evaluated models on single- vs multi-entity questions and analyzed language-induced inference biases.

Result: Prompt language influences entity selection (31% of topics show bias >0.5), with Turkish prompts showing higher bias than Italian in 83% of topics. Models struggle more with multi-entity questions. English achieves highest performance, larger models (8B, 7B) outperform smaller ones (3B-4B).

Conclusion: Language choice affects LLM factual inference, creating biases tied to language-country associations. Multi-entity questions are more challenging than single-entity ones. Larger models perform better, but performance varies across languages, highlighting the need for multilingual evaluation benchmarks.

Abstract: Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model’s generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.

[12] SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

Luca Foppiano, Sotaro Takeshita, Pedro Ortiz Suarez, Ekaterina Borisova, Raia Abu Ahmad, Malte Ostendorff, Fabio Barth, Julian Moreno-Schneider, Georg Rehm

Main category: cs.CL

TL;DR: SciLaD is a large-scale scientific language dataset with 10M+ English publications and 35M+ multilingual publications, plus an extensible pipeline for dataset generation and a pre-trained RoBERTa model.

DetailsMotivation: To create a comprehensive scientific language dataset using open-source tools and publicly available data to advance research in scientific NLP and scholarly document processing.

Method: Constructed dataset using open-source frameworks and public data sources, with curated English split and multilingual TEI XML split. Developed extensible pipeline for dataset generation and pre-trained a RoBERTa model on the dataset.

Result: Created SciLaD dataset with over 10M English publications and 35M+ multilingual publications. Pre-trained RoBERTa model achieves performance comparable to similar-sized scientific language models on comprehensive benchmarks.

Conclusion: SciLaD demonstrates that open-source tools can enable large-scale scientific data curation with high quality. The dataset and evaluation pipeline promote reproducibility and further research in scientific NLP.

Abstract: SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing.

Di Wu, Ruiyu Fang, Liting Jiang, Shuangyong Song, Xiaomeng Huang, Shiquan Wang, Zhongqiu Li, Lingling Shi, Mengjiao Bao, Yongxiang Li, Hao Huang

Main category: cs.CL

TL;DR: Survey paper reviewing recent advances in multi-intent spoken language understanding (SLU), covering multiple intent detection and slot filling tasks for utterances with multiple intents.

DetailsMotivation: Multi-intent SLU closely reflects real-world applications but lacks comprehensive systematic review. Need to organize existing research, analyze approaches, and guide future work in this growing field.

Method: Survey methodology: provides in-depth overview from two perspectives - decoding paradigms and modeling approaches. Compares performance of representative models and analyzes their strengths/limitations.

Result: Comprehensive review of multi-intent SLU research landscape. Analysis of different approaches, performance comparisons, and identification of current challenges in the field.

Conclusion: Survey offers valuable insights and reference for advancing multi-intent SLU research. Discusses current challenges and outlines promising future directions for the field.

Abstract: Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.

[14] Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach

Yun-Chung Liu, Rui Yang, Jonathan Chong Kai Liew, Ziran Yin, Henry Foote, Christopher J. Lindsell, Chuan Hong

Main category: cs.CL

TL;DR: A two-stage dynamic few-shot learning approach using LLMs to improve efficiency of title/abstract screening in systematic reviews, balancing performance and computational costs.

DetailsMotivation: Systematic reviews are crucial for evidence-based medicine but title/abstract screening has become increasingly time-consuming and resource-intensive due to rapid growth of research publications.

Method: Two-stage dynamic few-shot learning (DFSL) approach: first uses low-cost LLM for initial screening, then re-evaluates low-confidence instances with high-performance LLM to enhance screening while controlling computational costs.

Result: Evaluated across 10 systematic reviews, demonstrating strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate systematic review process.

Conclusion: The DFSL approach effectively improves efficiency and performance of LLMs in title/abstract screening for systematic reviews, offering a practical solution to reduce manual workload and speed up evidence synthesis.

Abstract: Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.

[15] When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Mrinal Rawat, Arkajyoti Chakraborty, Neha Gupta, Roberto Pieraccini

Main category: cs.CL

TL;DR: RL-based approach improves LLM reasoning and tool use by learning from task outcomes, outperforming SFT and base models.

DetailsMotivation: SFT struggles with generalization when data distributions change, and collecting high-quality reasoning traces for SFT is costly, subjective, and hard to scale. Reasoning-focused models show better generalization, but need scalable methods to learn reasoning strategies.

Method: Proposes RL pipeline where LLMs generate reasoning steps to guide tool invocation and answer generation. Uses Group Relative Policy Optimization (GRPO) with rewards based on tool accuracy and answer correctness to iteratively refine reasoning and actions.

Result: Achieves 1.5% relative improvement over SFT model (without explicit thinking) and 40% gain compared to vanilla Qwen3-1.7B base model. Improves both reasoning quality and tool invocation precision.

Conclusion: RL can effectively unify reasoning and action learning to build more capable and generalizable conversational agents, offering a scalable alternative to costly reasoning annotations.

Abstract: Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging – annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.

[16] AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu

Main category: cs.CL

TL;DR: AdaSD is a hyperparameter-free speculative decoding method that dynamically adjusts generation length and acceptance criteria during inference using adaptive thresholds based on token entropy and Jensen-Shannon distance, achieving up to 49% speedup with minimal accuracy loss.

DetailsMotivation: Large language models have slow inference due to their large parameter sizes. Existing speculative decoding approaches require additional training, extensive hyperparameter tuning, or prior analysis before deployment, which limits their practicality.

Method: Proposes Adaptive Speculative Decoding (AdaSD) with two adaptive thresholds: one for stopping candidate token generation and another for token acceptance. These thresholds are updated in real-time based on token entropy and Jensen-Shannon distance, requiring no pre-analysis or fine-tuning and working with off-the-shelf models.

Result: Experiments on benchmark datasets show AdaSD achieves up to 49% speedup over standard speculative decoding while limiting accuracy degradation to under 2%.

Conclusion: AdaSD provides a practical, hyperparameter-free solution for efficient and adaptive LLM inference that eliminates the need for pre-analysis or fine-tuning while maintaining high accuracy.

Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding while limiting accuracy degradation to under 2%, making it a practical solution for efficient and adaptive LLM inference.

[17] CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise

Qingsen Ma, Dianyun Wang, Ran Jing, Yujun Sun, Zhenbo Xu

Main category: cs.CL

TL;DR: CIP is a lightweight causal prompting framework that reduces hallucinations in LLMs by constructing causal relation sequences to guide reasoning toward relevant evidence, improving factual grounding and efficiency.

DetailsMotivation: Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. This leads to unreliable outputs and poor factual grounding.

Method: CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. It uses causal intervention and counterfactual reasoning to suppress non-causal reasoning paths.

Result: Experiments across seven mainstream language models show CIP consistently enhances reasoning quality and reliability: 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, 4x increase in effective information density, and up to 55.1% reduction in end-to-end response latency.

Conclusion: Causal reasoning serves as a promising paradigm for improving the explainability, stability, and efficiency of large language models, with CIP demonstrating significant practical benefits across multiple metrics.

Abstract: Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.

Shogo Fujita, Yuji Naraki, Yiqing Zhu, Shinsuke Mori

Main category: cs.CL

TL;DR: LegalRikai: Open Benchmark introduces a Japanese corporate legal practice benchmark with 100 complex samples requiring structured outputs, evaluated using human and automated methods to reveal model weaknesses in document editing and validate automated evaluation as a screening tool.

DetailsMotivation: To address the gap in practice-oriented legal AI research by creating a benchmark that emulates real Japanese corporate legal practices, moving beyond conventional short-text tasks to evaluate document-level capabilities.

Method: Created a benchmark with 100 complex samples requiring long-form structured outputs, developed by legal professionals under attorney supervision. Evaluated using both human assessment and automated evaluation with leading LLMs (GPT-5, Gemini 2.5 Pro, Claude Opus 4.1) across multiple practical criteria.

Result: Human evaluation revealed that abstract instructions prompted unnecessary modifications, exposing model weaknesses in document-level editing missed by conventional tasks. Automated evaluation aligned well with human judgment on criteria with clear linguistic grounding, though assessing structural consistency remains challenging. Automated evaluation proved useful as a screening tool when expert availability is limited.

Conclusion: The benchmark demonstrates the utility of automated evaluation for legal AI tasks and proposes a dataset evaluation framework to promote more practice-oriented research in the legal domain, highlighting the importance of document-level evaluation beyond short-text tasks.

Abstract: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.

[19] Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture

Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, Peng Cao

Main category: cs.CL

TL;DR: SMITH is a cognitive architecture that integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization, achieving state-of-the-art performance on the GAIA benchmark.

DetailsMotivation: LLM agents struggle with adapting to novel tasks due to limited tool availability and inefficient experience reuse. Existing approaches either use predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to poor exploration and suboptimal performance.

Method: SMITH uses hierarchical memory organization (procedural, semantic, episodic) for systematic capability expansion. It formalizes tool creation as iterative code generation in sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. Also includes curriculum learning with agent-ensemble difficulty re-estimation.

Result: Achieved 81.8% Pass@1 accuracy on GAIA benchmark, outperforming state-of-the-art baselines Alita (75.2%) and Memento (70.9%).

Conclusion: SMITH establishes a foundation for building truly adaptive agents that continuously evolve capabilities through principled integration of tool creation and experience accumulation.

Abstract: Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH’s effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation.

[20] qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs

Shreya Shukla, Aditya Sriram, Milinda Kuppur Narayanaswamy, Hiteshi Jain

Main category: cs.CL

TL;DR: qa-FLoRA: Query-adaptive, training-free LoRA fusion method that dynamically computes layer-level weights by measuring distribution divergence between base model and adapters, outperforming static fusion and training-free baselines.

DetailsMotivation: Existing LoRA fusion approaches for multi-domain composite queries either use static weights (equal relevance to all LoRAs) or require data-intensive supervised training for every possible LoRA combination, which is impractical and resource-intensive.

Method: Proposes qa-FLoRA, a query-adaptive, data-and-training-free method that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters, eliminating need for composite training data or domain-representative samples.

Result: Outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing gap with supervised baselines. Layer-level analysis reveals interpretable fusion patterns.

Conclusion: qa-FLoRA provides an effective, interpretable, and practical solution for robust multi-domain adaptation without requiring additional training data, making it readily applicable to existing adapter collections for handling complex composite queries.

Abstract: The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.

Tomáš Koref, Lena Held, Mahammad Namazov, Harun Kumru, Yassine Thlija, Christoph Burchard, Ivan Habernal

Main category: cs.CL

TL;DR: This paper develops automated NLP methods to analyze judicial reasoning in Czech Supreme Courts, refuting claims about formalistic judging in Central and Eastern Europe through argument mining and classification.

DetailsMotivation: The study aims to systematically analyze judicial reasoning at scale to challenge prevailing narratives about formalistic judging in Central and Eastern Europe, addressing the difficulty of analyzing judicial reasoning systematically.

Method: Created MADON dataset with expert annotations of 9,183 paragraphs from 272 Czech Supreme Court decisions. Used transformer LLMs adapted for Czech legal domain through continued pretraining, addressing dataset imbalance with asymmetric loss and class weighting. Developed a three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based ML.

Result: Best models achieved 82.6% macro-F1 for detecting argumentative paragraphs, 77.5% macro-F1 for classifying traditional legal argument types, and 83.2% macro-F1 for classifying decisions as formalistic/non-formalistic. Successfully challenged narratives about CEE formalism.

Conclusion: Legal argument mining enables reliable judicial philosophy classification and has potential for other computational legal studies tasks. The methodology is replicable across jurisdictions, with all resources publicly available.

Abstract: Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts’ decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6% macro-F1), classify traditional types of legal argument (77.5% macro-F1), and classify decisions as formalistic/non-formalistic (83.2% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon.

[22] Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello, Hideyuki Takada

Main category: cs.CL

TL;DR: Data selection methods significantly impact machine translation fine-tuning for LLMs, with semantic selectors outperforming lexical/geometric ones, and even small data differences (under 3%) causing substantial performance changes.

DetailsMotivation: To understand how different data selection strategies affect fine-tuning outcomes for machine translation tasks using open LLMs, particularly examining whether semantic-based selection outperforms other approaches.

Method: Used Japanese-English corpora to compare five data selectors (TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection) under controlled training conditions, measuring their impact on model performance.

Result: Semantic selectors consistently outperformed lexical and geometry-based heuristics. Even when selected data differed by less than 3%, the impact on model performance was substantial, showing high sensitivity to data quality.

Conclusion: Data selection quality is crucial for machine translation fine-tuning, with semantic-based methods being most effective, and even minor differences in selected data can significantly impact model performance.

Abstract: We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.

[23] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

Main category: cs.CL

TL;DR: Proposes a clip selection method for video summarization using lightweight video captioning and LLMs to identify key moments, achieving near-reference performance with low computational cost.

DetailsMotivation: VLMs struggle with long videos where important visual information gets lost, and there's a need for cost-effective analysis of lengthy video content.

Method: Divide video into short clips, generate compact visual descriptions using lightweight video captioning model, then use LLM to select K most relevant clips for multimodal summary.

Result: Achieves summarization performance close to reference clips (derived from human-annotated screenplays) while capturing more relevant video information than random selection, with low computational cost.

Conclusion: The proposed clip selection method effectively identifies key video moments for multimodal summarization, balancing performance and computational efficiency.

Abstract: Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

[24] CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, Chirag Agarwal

Main category: cs.CL

TL;DR: CLINIC is a comprehensive multilingual benchmark for evaluating healthcare language models across 5 trustworthiness dimensions (truthfulness, fairness, safety, robustness, privacy) in 15 languages, revealing significant shortcomings in current models.

DetailsMotivation: Current language models are predominantly trained in high-resource languages and lack reliable evaluation of trustworthiness in multilingual healthcare settings, creating barriers to real-world adoption in global healthcare contexts with linguistic diversity.

Method: Developed CLINIC benchmark that systematically evaluates LMs across 5 trustworthiness dimensions (truthfulness, fairness, safety, robustness, privacy) through 18 diverse tasks spanning 15 languages covering major continents, including healthcare topics like diseases, treatments, medications.

Result: Evaluation reveals LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks.

Conclusion: CLINIC highlights critical shortcomings in current healthcare LMs and lays the foundation for enhancing global reach and safety of language models in healthcare across diverse languages.

Abstract: Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

[25] Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning

Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang

Main category: cs.CL

TL;DR: Mistake Notebook Learning (MNL) is a training-free framework that creates a persistent knowledge base of abstracted error patterns from multiple failures, achieving near-SFT performance without gradient updates.

DetailsMotivation: Current LLM adaptation methods have limitations: gradient fine-tuning requires heavy computation and suffers from catastrophic forgetting, while In-Context Learning has low robustness and poor mistake learning capabilities.

Method: MNL uses batch-wise error abstraction to extract generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation for monotonic improvement.

Result: MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL achieves 28% accuracy (47% relative gain).

Conclusion: MNL proves to be a strong training-free alternative for complex reasoning tasks, offering a practical solution that avoids computational costs of fine-tuning while maintaining high performance.

Abstract: Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it’s a strong training-free alternative for complex reasoning.

[26] Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction

Kai Golan Hashiloni, Brenda Kasabe Nokai, Michal Shevach, Esthy Shemesh, Ronit Bartin, Anna Bergrin, Liran Harel, Nachum Dershowitz, Liat Nadai Arad, Kfir Bar

Main category: cs.CL

TL;DR: A Hebrew medical language model extracts structured clinical timelines from EHRs to construct patient journeys, achieving strong performance on new temporal relation datasets while maintaining privacy through de-identification.

DetailsMotivation: There's a need for Hebrew medical language models to extract structured clinical timelines from electronic health records to enable patient journey construction, addressing the gap in Hebrew medical NLP resources.

Method: Based on DictaBERT 2.0, continually pre-trained on over 5 million de-identified hospital records, with vocabulary adaptation for token efficiency. Evaluated using two new datasets (internal medicine/emergency and oncology) annotated for event temporal relations.

Result: The model achieves strong performance on both temporal relation datasets. Vocabulary adaptation improves token efficiency, and de-identification does not compromise downstream performance, supporting privacy-conscious development.

Conclusion: The Hebrew medical language model successfully extracts clinical timelines for patient journey construction, demonstrates that de-identification maintains performance, and is made available for research under ethical restrictions.

Abstract: We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets – one from internal medicine and emergency departments, and another from oncology – annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.

[27] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Mohor Banerjee, Nadya Yuki Wangsajaya, Syed Ali Redha Alsagoff, Min Sen Tan, Zachary Choy Kit Chun, Alvin Chan Guo Wei

Main category: cs.CL

TL;DR: Hallucination-reduction techniques (CoVe, DoLa, RAG) have opposing effects on LLM creativity: CoVe enhances divergent thinking, DoLa suppresses it, and RAG has minimal impact.

DetailsMotivation: While many methods reduce LLM hallucinations, their impact on creative generation remains unexplored, creating a critical gap for AI-assisted scientific discovery which requires both factual accuracy and creative hypothesis generation.

Method: Investigated three hallucination-reduction techniques (Chain of Verification, Decoding by Contrasting Layers, Retrieval-Augmented Generation) across multiple LLM families (LLaMA, Qwen, Mistral) at varying scales (1B-70B parameters) using two creativity benchmarks (NeoCoder and CS4).

Result: Hallucination-reduction methods have opposing effects on divergent creativity: CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact.

Conclusion: Findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications where balancing factual accuracy and creative exploration is crucial.

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.

[28] Visualizing token importance for black-box language models

Paulius Rauba, Qiyao Wei, Mihaela van der Schaar

Main category: cs.CL

TL;DR: DBSA is a lightweight, model-agnostic method for auditing black-box LLMs by analyzing output sensitivity to each input token, enabling visual exploration of token-level dependencies without distributional assumptions.

DetailsMotivation: There's a need to audit black-box LLMs in high-stakes domains (legal, medical, regulatory) where existing approaches focus on isolated aspects like bias detection. The challenge is understanding how outputs depend on input tokens when dealing with stochastic LLMs through inaccessible APIs, where computing prompt-level gradients is infeasible.

Method: Distribution-Based Sensitivity Analysis (DBSA) - a lightweight, model-agnostic procedure that evaluates output sensitivity for each input token without making distributional assumptions about the LLM. It’s designed as a plug-and-play tool for practitioners to visually explore LLM reliance on specific tokens.

Result: DBSA enables users to inspect LLM inputs and identify sensitivities that may be overlooked by existing interpretability methods. Through illustrative examples, it demonstrates practical utility for understanding token-level dependencies in black-box LLMs.

Conclusion: DBSA provides a practical solution for auditing black-box LLMs by offering token-level sensitivity analysis without requiring model internals or gradient computations, making it suitable for real-world applications with inaccessible API endpoints.

Abstract: We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question – can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.

[29] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Björn Deiseroth, Max Henning Höth, Kristian Kersting, Letitia Parcalabescu

Main category: cs.CL

TL;DR: The paper introduces a Merlin-Arthur training framework for RAG systems that treats retrieval as verifiable evidence rather than weak heuristics, improving groundedness and reducing hallucinations.

DetailsMotivation: Current RAG systems treat retrieval as weak heuristics rather than verifiable evidence, leading to unsupported answers, hallucinations under incomplete/misleading context, and reliance on spurious evidence.

Method: Adapts Merlin-Arthur protocol: Arthur (generator LLM) trains on questions with unknown provenance - Merlin provides helpful evidence, Morgana injects adversarial misleading context. Both use linear-time XAI to identify/modify most influential evidence. Introduces Explained Information Fraction (EIF) metric to evaluate explanation fidelity.

Result: Across three RAG datasets and two model families, M/A-trained LLMs show improved groundedness, completeness, soundness, reject behavior, and reduced hallucinations without needing manually annotated unanswerable questions. Retriever improves recall and MRR through automatically generated M/A hard positives/negatives.

Conclusion: Autonomous interactive-proof-style supervision provides principled and practical path toward reliable RAG systems that treat retrieved documents as verifiable evidence rather than suggestions.

Abstract: Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline – both the retriever and the generator – as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations – without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.

[30] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

Keerthana Murugaraj, Salima Lamsiyah, Marten During, Martin Theobald

Main category: cs.CL

TL;DR: BERTopic applied to historical newspaper archives (1955-2018) reveals evolving discourse on nuclear power and safety, overcoming limitations of traditional topic modeling like LDA.

DetailsMotivation: Traditional topic modeling methods (e.g., LDA) struggle with historical newspaper archives due to OCR noise, topic evolution, and large volumes. There's a need for better methods to capture dynamic discourse in historical texts.

Method: Used BERTopic, a neural topic-modeling approach leveraging transformer-based embeddings, to analyze articles from 1955-2018 on nuclear power and nuclear safety. Analyzed topic distributions and temporal evolution.

Result: Successfully extracted coherent themes, traced temporal evolution of topics, uncovered long-term trends and shifts in public discourse, including co-occurrence of nuclear power and nuclear weapons themes and their changing importance over time.

Conclusion: BERTopic demonstrates scalability and contextual sensitivity as a superior alternative to traditional topic modeling for historical research, offering richer insights into historical discourses from newspaper archives.

Abstract: Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.

[31] Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Sergey Pankratov, Dan Alistarh

Main category: cs.CL

TL;DR: The paper establishes tight lower bounds on runtime for deterministic speculative generation in LLMs, proving theoretical limits on parallel token generation speedup.

DetailsMotivation: Speculative generation accelerates LLM inference through parallel verification, but fundamental speedup limits remain poorly understood. The paper aims to establish rigorous theoretical bounds on achievable acceleration.

Method: Draws parallel between token generation and branching random walks to analyze optimal draft tree selection. Proves mathematical bounds using entropy and log-moment analysis under basic assumptions.

Result: Proves that expected tokens predicted per speculative iteration is bounded by (μ + μ₂)log(P)/μ² + O(1), where P is verifier capacity, μ is expected entropy, and μ₂ is expected second log-moment. Empirical validation on Llama models confirms bound tightness.

Conclusion: Establishes first tight lower bounds on deterministic speculative generation runtime, providing theoretical insights into parallel token generation limits and guiding future speculative decoding system design.

Abstract: Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight’’ lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier’s capacity, $μ$ is the expected entropy of the verifier’s output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.

[32] SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

Yuming Feng, Xinrui Jiang

Main category: cs.CL

TL;DR: SUMFORU is a steerable review summarization framework that generates personalized summaries aligned with user personas to support purchase decisions.

DetailsMotivation: Online product reviews are rich but noisy, overwhelming users. Existing LLM-based summarizers are generic and don't account for individual preferences, limiting their practical utility for personalized decision-making.

Method: Two-stage alignment framework: (1) persona-aware Supervised Fine-Tuning via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback using a preference estimator to capture persona-relevant signals. Built on Amazon 2023 Review Dataset.

Result: Achieves highest performance across all evaluation settings (rule-based, LLM-based, human-centered), showing consistent improvements in consistency, grounding, and preference alignment. Generalizes effectively to unseen product categories.

Conclusion: Demonstrates promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.

Abstract: Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.

[33] M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu

Main category: cs.CL

TL;DR: M3-Embedding is a versatile embedding model supporting 100+ languages, three retrieval functions (dense, multi-vector, sparse), and different input granularities from sentences to 8k-token documents, achieving SOTA results.

DetailsMotivation: To create a unified embedding model that addresses multiple limitations in current retrieval systems: language constraints (most models support only a few languages), functional limitations (models typically specialize in one retrieval type), and granularity restrictions (handling only short texts).

Method: 1. Self-knowledge distillation: integrates relevance scores from different retrieval functionalities as teacher signals to enhance training quality. 2. Optimized batching strategy: enables large batch sizes and high training throughput to improve embedding discriminativeness. 3. Unified architecture supporting multi-linguality (100+ languages), multi-functionality (dense, multi-vector, sparse retrieval), and multi-granularity (sentences to 8,192-token documents).

Result: M3-Embedding achieves state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks, demonstrating superior performance across diverse retrieval scenarios.

Conclusion: M3-Embedding successfully addresses the three-dimensional challenge of multi-linguality, multi-functionality, and multi-granularity through innovative training techniques, establishing a new versatile embedding model that outperforms existing approaches across various retrieval tasks.

Abstract: In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

[34] The Expressive Capacity of State Space Models: A Formal Language Perspective

Yash Sarrof, Yana Veitsman, Michael Hahn

Main category: cs.CL

TL;DR: SSMs and transformers have overlapping but distinct capabilities in language modeling, with SSMs excelling at exact state tracking and bounded hierarchical structure while having specific expressive limitations.

DetailsMotivation: To understand the fundamental capabilities of linear state space models (SSMs) compared to transformers and traditional RNNs, providing theoretical guidance for better language model architecture design.

Method: Comprehensive theoretical study analyzing the capacity of SSMs, comparing them to transformers and traditional RNNs, with empirical verification on the Mamba SSM model.

Result: SSMs and transformers have distinct strengths: SSMs can implement exact solutions to star-free state tracking problems that transformers struggle with, and can model bounded hierarchical structure optimally. However, current SSM designs have specific expressive limitations.

Conclusion: The study provides theoretical insights into SSM capabilities, revealing both advantages over transformers and design limitations, offering guidance for future SSM and language model research.

Abstract: Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba.

[35] Dynamics of Spontaneous Topic Changes in Next Token Prediction with Self-Attention

Mumin Jia, Jairo Diaz-Rodriguez

Main category: cs.CL

TL;DR: Self-attention models lack spontaneous topic changes like human cognition; theoretical analysis shows topic changes only occur when lower-priority tokens outnumber higher-priority ones, and longer contexts reduce spontaneity - unlike humans.

DetailsMotivation: Human cognition exhibits spontaneous topic shifts driven by emotional/contextual cues, while self-attention models rely on structured input patterns. The paper aims to characterize spontaneous topic changes in self-attention architectures and compare them to human spontaneous thought.

Method: 1) Theoretical analysis using a simplified single-layer self-attention model with Token Priority Graphs (TPGs) to define topics. 2) Empirical validation on modern state-of-the-art LLMs to verify theoretical findings.

Result: 1) Self-attention maintains token priority order related to input topics. 2) Spontaneous topic changes only occur when lower-priority tokens outnumber all higher-priority tokens. 3) Unlike humans, longer context length or more ambiguous input topics reduce likelihood of spontaneous changes. 4) These dynamics persist in modern LLMs.

Conclusion: There’s a fundamental disparity between human cognition and AI behavior regarding spontaneous topic changes. Self-attention architectures lack the spontaneity of human thought, with structural constraints that make topic changes less likely with longer contexts or ambiguous inputs - opposite to human cognitive patterns.

Abstract: Human cognition is punctuated by abrupt, spontaneous shifts between topics-driven by emotional, contextual, or associative cues-a phenomenon known as spontaneous thought in neuroscience. In contrast, self-attention based models depend on structured patterns over their inputs to predict each next token, lacking spontaneity. Motivated by this distinction, we characterize spontaneous topic changes in self-attention architectures, revealing both their similarities and their divergences from spontaneous human thought. First, we establish theoretical results under a simplified, single-layer self-attention model with suitable conditions by defining the topic as a set of Token Priority Graphs (TPGs). Specifically, we demonstrate that (1) the model maintains the priority order of tokens related to the input topic, (2) a spontaneous topic change can occur only if lower-priority tokens outnumber all higher-priority tokens of the input topic, and (3) unlike human cognition, the longer context length or the more ambiguous input topic reduces the likelihood of spontaneous change. Second, we empirically validate that these dynamics persist in modern, state-of-the-art LLMs, underscoring a fundamental disparity between human cognition and AI behaviour in the context of spontaneous topic changes. To the best of our knowledge, no prior work has explored these questions with a focus as closely aligned to human thought.

[36] Sorting the Babble in Babel: Assessing the Performance of Language Identification Algorithms on the OpenAlex Database

Maxime Holmberg Sainte-Marie, Diego Kozlowski, Lucía Céspedes, Vincent Larivière

Main category: cs.CL

TL;DR: Comparison of Python language identification algorithms for optimizing OpenAlex database indexing, finding FastSpell on Titles corpus best for recall/speed, LangID on greedy corpus best for precision.

DetailsMotivation: To optimize linguistic indexing of OpenAlex database by evaluating different language identification procedures, addressing the lack of truly multilingual large-scale bibliographic databases.

Method: Compared performance of Python-based language identification algorithms on different metadata corpora from manually-annotated article samples, analyzing precision, recall, and processing speeds, then simulated database-level performance using probabilistic confusion matrices and language frequency modeling.

Result: LangID algorithm on greedy corpus performs best when precision is prioritized; FastSpell algorithm on Titles corpus outperforms all alternatives when recall is considered important or processing times matter.

Conclusion: Results confirm OpenAlex database’s potential for cross-linguistic measurement and evaluation, with optimal algorithm selection depending on whether precision or recall/speed is prioritized.

Abstract: This project aims to optimize the linguistic indexing of the OpenAlex database by comparing the performance of various Python-based language identification procedures on different metadata corpora extracted from a manually-annotated article sample. The precision and recall performance of each algorithm, corpus, and language is first analyzed, followed by an assessment of processing speeds recorded for each algorithm and corpus type. These different performance measures are then simulated at the database level using probabilistic confusion matrices for each algorithm, corpus, and language, as well as a probabilistic modeling of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure that consists in the application of the FastSpell algorithm on the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic and comprehensive measurement and evaluation.

[37] Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Prateek Chhikara

Main category: cs.CL

TL;DR: LLMs often overconfident in factual QA tasks; adding distractors to prompts improves calibration significantly, with up to 460% accuracy improvement and 90% ECE reduction. Larger RLHF models have inherent calibration strengths but can be worse on easy questions, while smaller models benefit more from distractors but remain miscalibrated.

DetailsMotivation: LLMs show remarkable proficiency but suffer from overconfidence (misalignment between predicted confidence and true correctness), which poses significant risks in critical decision-making applications where reliable confidence estimates are essential.

Method: Comprehensive analysis across 9 LLMs and 3 factual QA datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Detailed evaluation of calibration using metrics like Expected Calibration Error (ECE).

Result: Explicitly incorporating distractors substantially mitigates miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Large RLHF-tuned models show inherent calibration strengths but paradoxically suffer increased miscalibration on easier queries. Smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Persistent calibration failures identified, particularly in person-based queries.

Conclusion: Concrete recommendations for reliable LLM deployments: targeted fine-tuning, structured prompting with distractors, and strategic model choice based on specific use cases and calibration requirements.

Abstract: Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.

[38] Scalable Best-of-N Selection for Large Language Models via Self-Certainty

Zhewei Kang, Xuandong Zhao, Dawn Song

Main category: cs.CL

TL;DR: Self-certainty is a novel, reward-free metric that uses LLMs’ own probability distributions to select high-quality responses, scaling effectively with sample size and working well for both reasoning and open-ended tasks.

DetailsMotivation: Current methods for improving LLM reasoning through Best-of-N selection either require computationally expensive reward models or have limitations with open-ended generation tasks and scalability. There's a need for an efficient, reward-free alternative that can handle diverse tasks effectively.

Method: Proposes self-certainty metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without external reward models. The approach aggregates distributional self-certainty across multiple samples, hypothesizing that higher certainty correlates with better response accuracy.

Result: Self-certainty (1) scales effectively with increasing sample size N like reward models but without computational overhead, (2) complements chain-of-thought to improve reasoning beyond greedy decoding, and (3) generalizes to open-ended tasks where traditional self-consistency methods fail.

Conclusion: Self-certainty establishes a practical and efficient way to improve LLM reasoning capabilities, offering a reward-free alternative that works across diverse tasks while maintaining computational efficiency.

Abstract: Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty

[39] PUMA: Discovery of Protein Units via Mutation-Aware Merging

Burak Suyunu, Özdeniz Dolu, Ibukunoluwa Abigail Olaosebikan, Hacer Karatas Bristow, Arzucan Özgür

Main category: cs.CL

TL;DR: PUMA discovers evolutionarily meaningful protein units through mutation-aware merging, creating hierarchical families that correlate with biological function and clinical variants.

DetailsMotivation: Proteins can be viewed as a "language of life" with amino acids as alphabet, but we need to identify fundamental units analogous to words that reflect evolutionary relationships and provide structure for understanding protein function.

Method: PUMA uses iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations, creating hierarchical genealogy with parent units and mutational variants.

Result: PUMA families are biologically meaningful: mutations within families correlate with clinically benign variants and high-scoring mutations in high-throughput assays, align with protein language model preferences, and map to known functional annotations.

Conclusion: PUMA provides evolutionarily grounded units with genealogical framework, offering structured approach for understanding the language of life through biologically meaningful protein families.

Abstract: Proteins are the essential drivers of biological processes. At the molecular level, they are chains of amino acids that can be viewed through a linguistic lens where the twenty standard residues serve as an alphabet combining to form a complex language, referred to as the language of life. To understand this language, we must first identify its fundamental units. Analogous to words, these units are hypothesized to represent an intermediate layer between single residues and larger domains. Crucially, just as protein diversity arises from evolution, these units should inherently reflect evolutionary relationships. We introduce PUMA (Protein Units via Mutation-Aware Merging) to discover these evolutionarily meaningful units. PUMA employs an iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations. This process creates a hierarchical genealogy where parent units and their mutational variants coexist, simultaneously producing a unit vocabulary and the genealogical structure connecting them. We validate that PUMA families are biologically meaningful; mutations within a PUMA family correlate with clinically benign variants and with high-scoring mutations in high-throughput assays. Furthermore, these units align with the contextual preferences of protein language models and map to known functional annotations. PUMA’s genealogical framework provides evolutionarily grounded units, offering a structured approach for understanding the language of life.

[40] MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

Zhoutong Ye, Mingze Sun, Huan-ang Gao, Xutong Wang, Xiangyang Wang, Yu Mei, Chang Liu, Qinwei Li, Chengwen Zhang, Qinghuan Lan, Chun Yu, Yuanchun Shi

Main category: cs.CL

TL;DR: MOAT is a new benchmark with 1005 complex real-world vision questions that challenge LMMs by requiring integration of multiple VL capabilities, revealing current models perform poorly (44% accuracy) despite being easy for humans.

DetailsMotivation: Current large multimodal models (LMMs) struggle with real-world tasks requiring combined vision-language capabilities and grounding of complex instructions, creating a gap between their potential and practical adoption.

Method: Created MOAT benchmark with 1005 diverse real-world vision questions organized around a taxonomy of 9 VL capabilities, then evaluated 17 proprietary and open-source LMMs to analyze performance patterns and failure causes.

Result: Best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below acceptable real-world standards. Analysis revealed bottlenecks in text-centric reasoning, specific VL capability limitations, and potential harmful effects of tiling.

Conclusion: MOAT exposes significant limitations in current LMMs for complex real-world tasks, providing a fine-grained diagnostic tool to guide future model development toward better integration of vision-language capabilities.

Abstract: Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs’ strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs’ ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.

[41] Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA

Abhijeet Sahdev

Main category: cs.CL

TL;DR: A novel statistical method using ASCII codes and PCA compression to analyze sentence structure balance across 11 text corpora without traditional syntactic tools.

DetailsMotivation: Traditional syntactic tools like POS tagging are complex and challenging in NLP. The study aims to understand sentence structure balance (usage of nouns, verbs, determiners, etc.) harmoniously without relying on such complex tools.

Method: Proposes a statistical method using ASCII codes to represent text from 11 diverse corpora, applying PCA compression to analyze lexical category alignment, and evaluating results through histograms and normality tests (Shapiro-Wilk and Anderson-Darling).

Result: Grok-generated story shows near normality indicating balanced sentence structures in LLM outputs, while only 4 out of the remaining 10 corpora pass the normality tests.

Conclusion: The ASCII-based approach simplifies text processing and complements syntactic tools as a resource-efficient method for assessing text balance, with potential applications in text quality evaluation and style analysis when integrated with syntactic approaches.

Abstract: While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.

[42] Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Sophia Hager, David Mueller, Kevin Duh, Nicholas Andrews

Main category: cs.CL

TL;DR: Uncertainty distillation: A method to teach LLMs to verbalize calibrated semantic confidences that correlate well with actual error rates, outperforming baselines in both effectiveness and efficiency.

DetailsMotivation: As LLMs are increasingly used for factual QA, they need to communicate meaningful uncertainty estimates. Current LLMs have inconsistent error rates relative to their expressed confidences, highlighting the need for better uncertainty quantification methods.

Method: Uncertainty distillation: Using held-out data to map initial uncertainty estimates to meaningful probabilities, then creating examples annotated with verbalized probabilities for supervised fine-tuning. Can be applied to black-box models via API-based fine-tuning.

Result: The method yields verbalized confidences that correlate well with observed error rates, outperforming strong baselines that are 20+ times slower at inference. Works effectively and efficiently even for black-box models.

Conclusion: Uncertainty distillation successfully teaches LLMs to verbalize calibrated semantic confidences, providing meaningful uncertainty estimates that align with actual error rates while being more efficient than alternative approaches.

Abstract: As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model’s confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model’s confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We find that our method yields verbalized confidences that correlate well with observed error rates, even when compared to strong baselines, some of which are more than twenty times slower at inference time. Additionally, we demonstrate that our method can be applied to black-box models that allow API-based fine-tuning, resulting in estimates of uncertainty that are both more effective and more efficient than any of our baselines.

[43] Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs

Mikhail Menschikov, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev

Main category: cs.CL

TL;DR: This paper studies position bias in LLMs across multiple languages and models, finding that bias patterns vary by model, explicit instructions can reduce accuracy, and models remain confident even when failing to use mid-context information.

DetailsMotivation: While LLMs are known to exhibit position bias (systematically underweighting information based on its location), it's unclear how this bias varies across different languages and model architectures, and how it interacts with prompting strategies.

Method: Conducted a multilingual study across five typologically diverse languages (English, Russian, German, Hindi, Vietnamese) and five model architectures, analyzing how position bias interacts with prompting strategies and affects output entropy.

Result: (1) Position bias is primarily model-driven with language-specific nuances; some models consistently favor late positions, challenging the assumption of universal early-token preference. (2) Explicit instructions about relevant context markers unexpectedly reduce accuracy across all languages. (3) Accuracy drops most when relevant information appears in the middle, but output entropy doesn’t increase correspondingly, suggesting models remain confident even when failing.

Conclusion: Position bias in LLMs is complex and model-dependent, standard prompt-engineering practices may be counterproductive, and models maintain confidence even when failing to use contextual information effectively, highlighting the need for more nuanced understanding of LLM behavior across languages.

Abstract: Large Language Models (LLMs) exhibit position bias systematically underweighting information based on its location in the context but how this bias varies across languages and models remains unclear. We conduct a multilingual study across five typologically diverse languages (English, Russian, German, Hindi, Vietnamese) and five model architectures, analyzing how position bias interacts with prompting strategies and affects output entropy. Our key findings are: (1) Position bias is primarily model-driven but shows language-specific nuances. Notably, Qwen2.5-7B-Instruct, DeepSeek 7B Chat and Mistral 7B consistently favor late positions challenging the common assumption of universal early-token preference. (2) Explicitly instructing the model, in the presence of irrelevant distractors, that “the most relevant context to the query is marked as 1” unexpectedly reduces accuracy across all languages, questioning standard prompt-engineering practices. (3) Accuracy consistently drops most when relevant information appears in the middle of the context, yet this is not reflected in a corresponding increase in output entropy, suggesting the model remains confident even when it fails to use mid-context cues.

[44] SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Esaú Villatoro-Tello, Thomas Schaaf, Ricard Marxer, Petr Motlicek

Main category: cs.CL

TL;DR: SDialog is an open-source Python toolkit that provides an end-to-end framework for building and analyzing LLM-based conversational agents, integrating dialog generation, evaluation, and mechanistic interpretability.

DetailsMotivation: There's a need for a unified framework that combines dialog generation, evaluation, and interpretability tools to enable systematic research and development of conversational AI systems.

Method: Built around a standardized Dialog representation, SDialog offers: 1) persona-driven multi-agent simulation with composable orchestration, 2) comprehensive evaluation with linguistic metrics, LLM-as-a-judge, and functional correctness validators, 3) mechanistic interpretability tools for activation inspection and steering, and 4) audio generation with full acoustic simulation.

Result: SDialog provides a complete toolkit that integrates with all major LLM backends, enabling mixed-backend experiments under a unified API for building, benchmarking, and understanding conversational systems.

Conclusion: By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark, and understand conversational systems more systematically.

Abstract: We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

[45] Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

Jiaxin Deng, Qingcheng Zhu, Junbiao Pang, Linlin Yang, Zhongqian Fu, Baochang Zhang

Main category: cs.CL

TL;DR: FMLoRA and EFMLoRA improve LoRA generalization by seeking flat minima through theoretical transfer of perturbations from full parameter space to low-rank subspace, achieving better performance with comparable efficiency.

DetailsMotivation: Little research explores correlation between expressive ability and generalization in LoRA, and connection between sharpness and generalization hasn't been explored for LoRA due to lack of tools to seek flat minima.

Method: Propose Flat Minima LoRA (FMLoRA) and efficient EFMLoRA that theoretically demonstrate perturbations in full parameter space can be transferred to low-rank subspace, eliminating interference from perturbations across multiple matrices.

Result: EFMLoRA achieves comparable optimization efficiency to LoRA while attaining better performance: 1.0% improvement over LoRA and 0.5% over full fine-tuning on GLUE with RoBERTa-large; 1.5% and 1.0% improvements on SQA and VizWiz datasets with Qwen-VL-Chat.

Conclusion: Generalization of LoRA is closely related to sharpness, which was omitted by previous methods, and EFMLoRA effectively addresses this by seeking flat minima in low-rank adaptation.

Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat Minima LoRA (FMLoRA) and its efficient version, i.e., EFMLoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFMLoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFMLoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models, e.g., Qwen-VL-Chat, there are performance improvements of 1.5% and 1.0% on the SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

[46] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

Kushan Mitra, Dan Zhang, Hannah Kim, Estevam Hruschka

Main category: cs.CL

TL;DR: RECAP is a new benchmark for evaluating intent rewriting in conversational AI, converting ambiguous dialogues into clear user goals to improve agent planning.

DetailsMotivation: Real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection challenging for conversational assistants. Traditional classification approaches struggle in open-ended settings, leading to brittle interpretations and poor downstream planning.

Method: Proposed RECAP benchmark with diverse challenges (ambiguity, intent drift, vagueness, mixed-goal conversations) and LLM-based evaluator for planning utility. Developed prompt-based rewriting approach and fine-tuned DPO-based rewriters.

Result: Prompt-based rewriting approach outperforms baselines in plan preference. Fine-tuning two DPO-based rewriters yields additional utility gains.

Conclusion: Intent rewriting is a critical and tractable component for improving agentic planning in open-domain dialogue systems, with RECAP providing a valuable evaluation framework.

Abstract: Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines, in terms of plan preference. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agentic planning in open-domain dialogue systems.

[47] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev

Main category: cs.CL

TL;DR: MixtureVitae is a legally-safe pretraining corpus combining public-domain, permissively licensed, and justified low-risk sources with instruction/reasoning data, achieving strong performance while minimizing legal risks.

DetailsMotivation: To create a practical, legally mitigated foundation for training capable LLMs that reduces reliance on indiscriminate web scraping while maintaining competitive performance.

Method: Risk-mitigated sourcing strategy combining public-domain/permissively licensed text with justified low-risk additions, plus instruction/reasoning/synthetic data. Multi-stage pipeline for license-aware filtering, safety/quality screening, and domain-aware mixing.

Result: Models trained on MixtureVitae consistently outperform other permissive datasets across benchmarks, with particularly strong math/code performance and competitive QA results. At 1.7B/300B setting, they surpass FineWeb-Edu and approach DCLM in later training stages.

Conclusion: Permissive-first, risk-mitigated data provides a practical and legally safe foundation for training capable LLMs without sacrificing competitiveness, reducing reliance on indiscriminate web scraping.

Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

[48] Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova

Main category: cs.CL

TL;DR: Mirror-SD breaks the latency-acceptance tradeoff in speculative decoding by using parallel heterogeneous execution (GPU+NPU) with dual speculation pipelines and multi-token streaming, achieving 2.8x-5.8x speedups.

DetailsMotivation: Current speculative decoding methods face a fundamental tradeoff: increasing draft size improves acceptance rates but adds latency overhead, limiting speed gains. Prior methods like Medusa, Hydra, and EAGLE reduce draft cost but degrade acceptance or introduce overheads that prevent scaling.

Method: Mirror-SD uses: 1) Branch-complete rollouts from early-exit signals in parallel with target model’s suffix, 2) Explicit mapping across heterogeneous accelerators (GPU and NPU) for cross-device parallelism, 3) Dual speculation pipelines where draft speculates forward continuations while target speculates correction paths, 4) Speculative streaming allowing draft to emit multiple tokens per step to reduce latency without weakening acceptance.

Result: On SpecBench with server-scale models (14B to 66B parameters), Mirror-SD achieves 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline (EAGLE3).

Conclusion: Mirror-SD successfully breaks the latency-acceptance tradeoff in speculative decoding through parallel heterogeneous execution and multi-token speculative streaming, pushing speculative decoding toward its ideal regime of high acceptance with low overhead.

Abstract: Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model’s suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

[49] Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention

Shibing Mo, Haoyang Ruan, Kai Wu, Jing Liu

Main category: cs.CL

TL;DR: TSAN is a test-time preference optimization method that uses natural language self-attention to analyze and synthesize strengths from multiple candidate responses without parameter updates, outperforming supervised models with just 3 iterations.

DetailsMotivation: Current test-time methods for aligning LLMs with human preferences typically critique and revise single candidate responses, lacking a principled mechanism to systematically analyze, weigh, and synthesize strengths from multiple promising candidates that may excel in different aspects.

Method: TSAN emulates self-attention entirely in natural language: formats multiple candidates into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new preference-aligned response through textual gradient space optimization.

Result: With just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses current state-of-the-art test-time alignment methods by effectively leveraging multiple candidate solutions.

Conclusion: TSAN introduces a novel paradigm for test-time preference optimization that operates without parameter updates, enabling interpretable iterative optimization through textual self-attention to combine the best elements of multiple candidate responses.

Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities, but aligning their outputs with human preferences typically requires expensive supervised fine-tuning. Recent test-time methods leverage textual feedback to overcome this, but they often critique and revise a single candidate response, lacking a principled mechanism to systematically analyze, weigh, and synthesize the strengths of multiple promising candidates. Such a mechanism is crucial because different responses may excel in distinct aspects (e.g., clarity, factual accuracy, or tone), and combining their best elements may produce a far superior outcome. This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization that requires no parameter updates. TSAN emulates self-attention entirely in natural language to overcome this gap: it analyzes multiple candidates by formatting them into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new, preference-aligned response under the guidance of the learned textual attention. This entire process operates in a textual gradient space, enabling iterative and interpretable optimization. Empirical evaluations demonstrate that with just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the current state-of-the-art test-time alignment method by effectively leveraging multiple candidate solutions.

[50] Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning

Darshan Fofadiya

Main category: cs.CL

TL;DR: The Idea-Gated Transformer separates semantic planning from syntactic generation using an auxiliary Idea Head that predicts future bag-of-words distributions, creating Concept Vectors that gate vocabulary during generation to prevent topic drift.

DetailsMotivation: Autoregressive LLMs trained on Next-Token Prediction suffer from Topic Drift - generation wanders away from initial prompts due to reliance on local associations rather than global planning. Scaling model size helps but doesn't solve the fundamental myopia of NTP.

Method: Introduces Idea-Gated Transformer with auxiliary Idea Head trained to predict bag-of-words distribution for future context windows, creating latent “Concept Vectors.” Uses differentiable gating mechanism that suppresses semantically irrelevant tokens in real-time, pruning the search space during generation.

Result: On WikiText-103, achieves comparable validation perplexity to GPT-2 baseline but exhibits significantly superior Domain Retention. Gating mechanism successfully locks generation into specific semantic clusters (Finance, Science) and resists associative drift.

Conclusion: The Idea-Gated Transformer offers a parameter-efficient path toward more controllable language modeling by separating semantic planning from syntactic generation, effectively preventing topic drift while maintaining generation quality.

Abstract: Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from Topic Drift where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning. While scaling model size mitigates this, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary Idea Head trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector’’ that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.

[51] Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5

Huey Sun, Anabel Yong, Lorenzo Gilly, Felipe Jin

Main category: cs.CL

TL;DR: The paper analyzes why encoder-decoder models like FLAN-T5 fail when instructions conflict with memorized training patterns, introduces a gradient-based activation-steering method to fix this, and achieves dramatic improvement from 52% to 99.7% on MemoTrap tasks.

DetailsMotivation: Encoder-decoder models like FLAN-T5 often fail when instructions conflict with memorized continuations ingrained during training, creating a reliability problem where models prioritize memorized patterns over following new instructions.

Method: 1) Adapt DoLa to FLAN-T5 to examine representation evolution in decoder layers; 2) Introduce gradient-based activation-steering method that injects an “instruction-compliance” direction into mid-decoder layers where representations are meaningful yet malleable.

Result: The intervention dramatically improves MemoTrap performance from 52% to 99.7%, demonstrating that mechanistic steering can succeed where contrastive decoding fails in Seq2Seq architectures.

Conclusion: Mechanistic steering through gradient-based activation injection in mid-decoder layers effectively resolves the instruction-memorization conflict in encoder-decoder models, providing a more reliable solution than contrastive decoding approaches.

Abstract: Encoder-decoder models such as FLAN-T5 are finetuned to follow instructions, but often fail when the instructions conflict with memorized continuations ingrained during training. To understand this behavior, we adapt DoLa to FLAN-T5 and examine how representations evolve in the decoder. Our findings show that T5’s intermediate layers undergo rapid shifts driven by cross-attention to the encoder. When projected through the language modeling head, each depth presents highly volatile token preferences, leading to unreliable behavior with contrastive decoding. Motivated by this, we introduce a gradient-based activation-steering method that injects an “instruction-compliance” direction into mid-decoder layers, where the representation is both meaningful and still malleable. This intervention dramatically improves MemoTrap performance (52% to 99.7%), demonstrating that mechanistic steering can succeed where contrastive decoding fails in Seq2Seq architectures.

[52] DeepSeek’s WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

James Luther, Donald Brown

Main category: cs.CL

TL;DR: This paper analyzes cultural alignment of various LLMs (DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, GPT-5) with US and Chinese cultures using Hofstede’s VSM13 surveys and cultural prompting strategies.

DetailsMotivation: As LLMs become more human-like in text generation, their cultural alignment becomes crucial for human-computer interaction. Understanding how well these models reflect different cultural perspectives is important for their effective deployment across diverse cultural contexts.

Method: The study uses Hofstede’s VSM13 international surveys to assess cultural alignment. It employs a combination of prompt language (English/Simplified Chinese) and cultural prompting (system prompts that shift model alignment to specific countries) to align LLMs with US and Chinese cultures.

Result: DeepSeek-V3, V3.1, and GPT-5 show close alignment with US survey responses but not with China, even with cultural prompts. GPT-4 aligns closer to China when prompted in English, but cultural prompting can shift it toward US alignment. GPT-4o and GPT-4.1 respond to both prompt language and cultural prompting to achieve acceptable alignments with both cultures.

Conclusion: Different LLMs exhibit varying degrees of cultural alignment, with some showing inherent biases toward specific cultures. Cultural prompting strategies can effectively shift model alignment, but effectiveness varies by model. This highlights the importance of considering cultural factors in LLM development and deployment.

Abstract: Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede’s VSM13 international surveys to understand the cultural alignment of the following models: DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, and GPT-5. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model’s alignment to reflect a specific country, to align these LLMs with the United States and China. Our results show that DeepSeek-V3, V3.1, and OpenAI’s GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.

[53] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Yining Yang, Ben Maurer, Wenlin Chen, David Recordon, Yilun Du, Minlan Yu, Ying Zhang

Main category: cs.CL

TL;DR: Confucius Code Agent (CCA) is an open-source AI software engineer that achieves state-of-the-art performance (54.3% on SWE-Bench-Pro) while addressing industrial-scale challenges like massive repository reasoning, durable memory, and robust toolchain coordination.

DetailsMotivation: Existing open-source coding agents lack industrial-scale capabilities, while proprietary agents lack extensibility and transparency. There's a need for an open-source solution that bridges the gap between research prototypes and production-grade systems.

Method: Built on Confucius SDK with three perspectives: Agent Experience (AX), User Experience (UX), Developer Experience (DX). Features include hierarchical working memory for long-context reasoning, persistent note-taking for cross-session learning, modular extensions for tool use, and a meta-agent for automated configuration synthesis and refinement.

Result: CCA achieves state-of-the-art Resolve@1 performance of 54.3% on SWE-Bench-Pro, substantially improving over prior coding agents and demonstrating industrial-scale capabilities.

Conclusion: Confucius SDK and CCA provide a transparent, extensible foundation for AI agents that bridges research and production, supporting industrial-scale agent development and deployment while maintaining open-source accessibility.

Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

[54] Textual Data Bias Detection and Mitigation – An Extensible Pipeline with Experimental Evaluation

Rebekka Görge, Sujan Sai Gannamaneni, Tabea Naeven, Hammam Abdelwahab, Héctor Allende-Cid, Armin B. Cremers, Lennard Helmer, Michael Mock, Anna Schmitz, Songkai Xue, Elif Yildirir, Maximilian Poretschkin, Stefan Wrobel

Main category: cs.CL

TL;DR: Proposes a comprehensive pipeline for detecting and mitigating two types of data bias (representation bias and explicit stereotypes) in LLM training data, with evaluation showing successful data debiasing but inconsistent model bias reduction.

DetailsMotivation: LLM training data contains harmful biases including representation imbalances and stereotypes, but practical guidance for bias mitigation is lacking despite regulatory requirements like the EU AI Act.

Method: Four-component pipeline: 1) LLM-generated word lists for group label detection, 2) Demographic Representation Score for representation bias quantification, 3) Sociolinguistically informed filtering for stereotype detection/mitigation, 4) Grammar- and Context-Aware Counterfactual Data Augmentation for representation bias compensation.

Result: Successfully reduces representation bias and explicit stereotypes in text datasets. However, LLMs fine-tuned on debiased data show inconsistent improvement on bias benchmarks, revealing gaps in current evaluation methodologies.

Conclusion: While the pipeline effectively debiases data, current bias evaluation methods are insufficient, highlighting the need for targeted data manipulation to address specific model bias manifestations and better evaluation frameworks.

Abstract: Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.

[55] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: Intern-S1-MO is a long-horizon math agent that uses multi-round hierarchical reasoning with lemma-based memory to solve IMO-level problems, achieving silver medal performance on IMO2025 and gold medal level on CMO2025.

DetailsMotivation: Current Large Reasoning Models (LRMs) struggle with ultra-hard math problems like IMO due to context length limitations. Existing approaches are prompt-based, use proprietary models, and lack systematic training pipelines for long-horizon reasoning.

Method: 1) Intern-S1-MO: A multi-agent system with reasoning, summary, and verification agents that maintains lemma-based compact memory for multi-round hierarchical reasoning. 2) OREAL-H: An RL framework that uses online explored trajectories to bootstrap LRM reasoning ability and improve overall agent performance.

Result: Achieves 26/35 points on IMO2025 non-geometry problems (silver medal level), surpasses advanced LRMs on HMMT2025, AIME2025, and CNMO2025 benchmarks, and scores 102/126 on CMO2025 (gold medal level) under human expert judgment.

Conclusion: The lemma-based memory approach enables effective long-horizon reasoning for ultra-hard math problems, breaking through context constraints. The multi-agent hierarchical system with RL training significantly advances mathematical reasoning capabilities to competition-level performance.

Abstract: Large Reasoning Models (LRMs) have expanded the mathematical reasoning frontier through Chain-of-Thought (CoT) techniques and Reinforcement Learning with Verifiable Rewards (RLVR), capable of solving AIME-level problems. However, the performance of LRMs is heavily dependent on the extended reasoning context length. For solving ultra-hard problems like those in the International Mathematical Olympiad (IMO), the required reasoning complexity surpasses the space that an LRM can explore in a single round. Previous works attempt to extend the reasoning context of LRMs but remain prompt-based and built upon proprietary models, lacking systematic structures and training pipelines. Therefore, this paper introduces Intern-S1-MO, a long-horizon math agent that conducts multi-round hierarchical reasoning, composed of an LRM-based multi-agent system including reasoning, summary, and verification. By maintaining a compact memory in the form of lemmas, Intern-S1-MO can more freely explore the lemma-rich reasoning spaces in multiple reasoning stages, thereby breaking through the context constraints for IMO-level math problems. Furthermore, we propose OREAL-H, an RL framework for training the LRM using the online explored trajectories to simultaneously bootstrap the reasoning ability of LRM and elevate the overall performance of Intern-S1-MO. Experiments show that Intern-S1-MO can obtain 26 out of 35 points on the non-geometry problems of IMO2025, matching the performance of silver medalists. It also surpasses the current advanced LRMs on inference benchmarks such as HMMT2025, AIME2025, and CNMO2025. In addition, our agent officially participates in CMO2025 and achieves a score of 102/126 under the judgment of human experts, reaching the gold medal level.

cs.CV

[56] Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification

Anoop Krishnan

Main category: cs.CV

TL;DR: Text-guided approaches using image captions improve fairness in facial gender classification without demographic labels.

DetailsMotivation: Address demographic bias in facial gender classification algorithms by leveraging semantic information from image captions to create more equitable AI systems.

Method: Two text-guided strategies: 1) Image Text Matching (ITM) guidance for fine-grained image-text alignment, and 2) Image Text fusion combining both modalities into comprehensive representations.

Result: Methods effectively mitigate bias and improve accuracy across gender and racial groups on benchmark datasets, outperforming existing methods without requiring demographic labels.

Conclusion: Textual guidance provides interpretable training for fairer facial analysis, offering insights into reducing disparities through semantic information in demographic-agnostic applications.

Abstract: In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.

[57] SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang, Jiayuan Rao, Haoning Wu, Weidi Xie

Main category: cs.CV

TL;DR: SoccerMaster: A unified vision foundation model for diverse soccer understanding tasks that outperforms task-specific expert models.

DetailsMotivation: Soccer understanding has unique domain-specific complexities, and prior works rely on isolated task-specific models rather than a unified approach.

Method: Developed SoccerMaster - a soccer-specific vision foundation model using supervised multi-task pretraining, with automated data curation pipeline (SoccerFactory) to generate scalable spatial annotations from existing soccer video datasets.

Result: SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks including fine-grained perception and semantic reasoning.

Conclusion: The unified approach demonstrates breadth and superiority over specialized models, with data, code, and model to be publicly available.

Abstract: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

[58] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Yiqing Yang, Kin-Man Lam

Main category: cs.CV

TL;DR: End-to-end trainable framework for task-adaptive video frame selection using SLM-generated queries, set-level optimization, and student-teacher mutual learning to overcome limitations of independent scoring and static pseudo labels.

DetailsMotivation: Traditional top-K frame selection methods score frames independently, leading to temporally clustered and visually redundant selections. Also, training lightweight selectors with static MLLM-generated pseudo labels prevents dynamic adaptation to task objectives.

Method: 1) Chain-of-Thought approach guides SLM to generate task-specific implicit query vectors; 2) Continuous set-level objective function with relevance, coverage, and redundancy optimized via Gumbel-Softmax; 3) Student-teacher mutual learning aligns SLM selector and MLLM reasoner distributions via KL divergence and cross-entropy loss.

Result: Significantly outperforms existing approaches across multiple benchmarks including Video-MME, LongVideoBench, MLVU, and NExT-QA.

Conclusion: Proposed end-to-end trainable framework enables task-adaptive frame selection by addressing independent scoring limitations and eliminating reliance on static pseudo labels through dynamic optimization and mutual learning.

Abstract: Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.

[59] Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation

Marshal Ashif Shawkat, Moidul Hasan, Taufiq Hasan

Main category: cs.CV

TL;DR: Knowledge distillation trains CNN models to reduce spurious correlations and localize TB abnormalities without bounding-box annotations, achieving 0.2428 mIOU and outperforming teacher models.

DetailsMotivation: TB is a major global health issue, especially in resource-limited areas. CXR imaging is accessible but requires expert interpretation. Current ML models for TB classification often rely on spurious correlations and fail to generalize. Creating large, high-quality annotated medical datasets is expensive and logistically challenging, requiring multiple domain experts to reach consensus.

Method: The study repurposes knowledge distillation technique to train CNN models. Uses a teacher-student framework with ResNet50 architecture. Trained on TBX11k dataset. The method reduces spurious correlations and localizes TB-related abnormalities without requiring bounding-box annotations.

Result: Achieved impressive 0.2428 mIOU score. The student model consistently outperforms the teacher model, demonstrating improved robustness. Shows potential for broader clinical deployment in diverse settings.

Conclusion: Knowledge distillation is effective for training TB classification models that reduce spurious correlations and localize abnormalities without expensive bounding-box annotations. The student model’s superior performance over the teacher suggests improved generalization capabilities, making it promising for clinical deployment in resource-limited settings.

Abstract: Tuberculosis (TB) remains one of the leading causes of mortality worldwide, particularly in resource-limited countries. Chest X-ray (CXR) imaging serves as an accessible and cost-effective diagnostic tool but requires expert interpretation, which is often unavailable. Although machine learning models have shown high performance in TB classification, they often depend on spurious correlations and fail to generalize. Besides, building large datasets featuring high-quality annotations for medical images demands substantial resources and input from domain specialists, and typically involves several annotators reaching agreement, which results in enormous financial and logistical expenses. This study repurposes knowledge distillation technique to train CNN models reducing spurious correlations and localize TB-related abnormalities without requiring bounding-box annotations. By leveraging a teacher-student framework with ResNet50 architecture, the proposed method trained on TBX11k dataset achieve impressive 0.2428 mIOU score. Experimental results further reveal that the student model consistently outperforms the teacher, underscoring improved robustness and potential for broader clinical deployment in diverse settings.

[60] E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring

Jack Brady, Andrew Dailey, Kristen Schang, Zo Vic Shong

Main category: cs.CV

TL;DR: Survey paper proposing event-based cameras as a novel sensor for studying urban dynamics, highlighting their advantages over traditional cameras and suggesting multi-sensor fusion approaches.

DetailsMotivation: Current methods for understanding human movement and city dynamics have limitations. Traditional urban monitoring methods (manual observation, cameras, sensors) have evolved but still need improvement. Event-based cameras offer unique capabilities that could unlock better practices for studying urban dynamics.

Method: The paper conducts a survey analysis of event-based cameras, examining their technical characteristics, applications, advantages, challenges, and machine learning applications. It proposes using event-based cameras as a medium for capturing urban dynamics information and suggests multi-sensor fusion approaches.

Result: Event-based cameras are identified as advantageous for urban dynamics studies due to their ability to work in low-light conditions, capture changes in light intensity rather than RGB values, and maintain privacy while capturing important information. Multi-sensor fusion with infrared, LiDAR, or vibration sensors can enhance event-based camera capabilities and overcome their limitations.

Conclusion: Event-based cameras represent a promising new approach for studying urban dynamics, offering privacy-preserving capabilities and unique sensing advantages. Multi-sensor fusion strategies can further improve their effectiveness in urban monitoring applications.

Abstract: Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city’s inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.

[61] Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

Chenjun Li, Cheng Wan, Laurin Lux, Alexander Berger, Richard B. Rosen, Martin J. Menten, Johannes C. Paetzold

Main category: cs.CV

TL;DR: SVR framework generates synthetic OCTA images with DR features and corresponding reasoning text to train VLMs for medical diagnosis, achieving 89.67% zero-shot accuracy.

DetailsMotivation: VLMs need large-scale image-text datasets for medical reasoning, but specialized domains like OCTA imaging lack precise text descriptions of pathologies.

Method: Synthetic Vasculature Reasoning (SVR) framework controllably synthesizes realistic retinal vasculature with DR features and automatically generates granular reasoning texts, creating OCTA-100K-SVR dataset.

Result: Qwen3-VL-8b trained on OCTA-100K-SVR achieves 89.67% zero-shot balanced classification accuracy on real OCTA images, outperforming supervised baselines and improving explanation quality and pathology localization.

Conclusion: SVR enables training interpretable medical VLMs without large-scale real datasets, demonstrating strong performance on clinical OCTA diagnosis with enhanced reasoning capabilities.

Abstract: Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.

[62] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

Main category: cs.CV

TL;DR: EditMGT: A Masked Generative Transformer-based image editing framework that uses localized decoding to preserve non-target regions, achieving faster editing with comparable quality to diffusion models.

DetailsMotivation: Diffusion models for image editing have global denoising dynamics that conflate local editing targets with full-image context, causing unintended modifications in non-target regions. The authors seek an alternative approach that can explicitly preserve non-relevant areas during editing.

Method: 1) Uses Masked Generative Transformers (MGTs) with localized decoding paradigm; 2) Leverages MGT’s cross-attention maps for edit-relevant region localization with multi-layer attention consolidation; 3) Introduces region-hold sampling to restrict token flipping within low-attention areas; 4) Trains on CrispEdit-2M dataset (7 categories); 5) Adapts pre-trained text-to-image MGT via attention injection without adding parameters.

Result: With <1B parameters, achieves similarity performance with 6x faster editing than baselines. Improves style change by 3.6% and style transfer by 17.6% on standard benchmarks. Delivers comparable or superior editing quality while better preserving non-target regions.

Conclusion: MGTs offer a promising alternative to diffusion models for image editing, providing inherent capacity for localized modifications with better preservation of non-target regions, faster inference, and competitive editing quality.

Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT’s cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

[63] Learning from a Generative Oracle: Domain Adaptation for Restoration

Yuyang Hu, Mojtaba Sahraee-Ardakan, Arpit Bansal, Kangfu Mei, Christian Qi, Peyman Milanfar, Mauricio Delbracio

Main category: cs.CV

TL;DR: LEGO is a three-stage framework for adapting pre-trained image restoration models to real-world out-of-distribution degradations without paired data, using a generative oracle to create pseudo-ground-truths for fine-tuning.

DetailsMotivation: Pre-trained image restoration models fail on real-world out-of-distribution degradations due to domain gaps. Traditional adaptation methods require complex architectural changes or paired data, which is unavailable for unseen domains.

Method: Three-stage framework: 1) Get initial restorations from pre-trained model, 2) Use frozen generative oracle to refine estimates into high-quality pseudo-ground-truths, 3) Fine-tune original model with mixed-supervision combining in-distribution data and pseudo-pairs.

Result: LEGO effectively bridges domain gaps, significantly improving performance on diverse real-world benchmarks without sacrificing original robustness or requiring architectural modifications.

Conclusion: LEGO provides a practical solution for post-training domain adaptation by converting unsupervised challenges into tractable pseudo-supervised problems using generative oracles, enabling effective adaptation to real-world out-of-distribution degradations.

Abstract: Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.

[64] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Felix O’Mahony, Roberto Cipolla, Ayush Tewari

Main category: cs.CV

TL;DR: VDAWorld: A framework using Vision-Language Models to create tractable world representations from images, enabling adaptive physics simulation and future state prediction.

DetailsMotivation: Current generative video models violate physical/logical rules, lack interactivity, and are opaque black boxes unsuitable for building structured, queryable worlds.

Method: VLM acts as intelligent agent to distill image-caption pairs into abstract representations, selects appropriate vision tools for scene construction, and chooses compatible physics simulators for dynamic modeling.

Result: VDAWorld produces high-quality simulations across diverse dynamic scenarios by combining intelligent abstraction with adaptive simulation.

Conclusion: The framework offers a new paradigm for world modeling that overcomes limitations of generative video models through tractable representations and adaptive simulation.

Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

[65] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Haotian Wang, Yuzhe Weng, Xinyi Yu, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu

Main category: cs.CV

TL;DR: REST is the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework that addresses slow inference speeds of diffusion models through compact video latent space, ID-Context Cache mechanism, and Asynchronous Streaming Distillation training.

DetailsMotivation: Diffusion models have advanced talking head generation but suffer from slow inference speeds and non-autoregressive paradigms that limit real-time applications. There's a need for diffusion-based models that can operate in real-time streaming scenarios.

Method: 1) Learn compact video latent space through high spatiotemporal VAE compression for real-time generation. 2) Introduce ID-Context Cache mechanism combining ID-Sink and Context-Cache principles for key-value caching to maintain temporal consistency and identity coherence during streaming. 3) Propose Asynchronous Streaming Distillation (ASD) training strategy using a non-streaming teacher with asynchronous noise schedule to supervise streaming student model and mitigate error accumulation.

Result: REST outperforms state-of-the-art methods in both generation speed and overall performance, demonstrating substantial value for real-time talking head generation applications.

Conclusion: REST successfully bridges the gap between autoregressive and diffusion-based approaches, enabling real-time, end-to-end streaming audio-driven talking head generation with improved speed and performance compared to existing methods.

Abstract: Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

[66] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description

Nazanin Mahjourian, Vinh Nguyen

Main category: cs.CV

TL;DR: VLM-IRIS adapts vision-language models for zero-shot infrared industrial sensing by converting thermal images to RGB-compatible magma representations, enabling workpiece detection without retraining.

DetailsMotivation: Manufacturing environments often have low-light conditions where conventional vision systems fail. Infrared cameras work well in these conditions, but supervised AI requires large labeled datasets. While vision-language models offer zero-shot capabilities, they can't process infrared data since they're trained on RGB images.

Method: VLM-IRIS preprocesses infrared images from FLIR Boson sensors into RGB-compatible inputs using magma representation. It applies centroid prompt ensembling with CLIP ViT-B/32 encoder for zero-shot predictions without model retraining.

Result: The framework successfully demonstrates zero-shot workpiece presence detection on a 3D printer bed, leveraging temperature differences between build plate and workpieces for thermal imaging applications.

Conclusion: The proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring in industrial settings, enabling infrared data processing without the need for large labeled datasets.

Abstract: Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.

[67] Embodied Image Compression

Chunyi Li, Rui Qing, Jianbo Zhang, Yuan Tian, Xiangyang Zhu, Zicheng Zhang, Xiaohong Liu, Weisi Lin, Guangtao Zhai

Main category: cs.CV

TL;DR: This paper introduces Embodied Image Compression (EIC) as a new research problem for compressing visual data for embodied AI agents operating in real-world environments, establishing the EmbodiedComp benchmark for evaluation under ultra-low bitrate conditions.

DetailsMotivation: With the evolution of machine intelligence from task-specific models to embodied agents in real-world environments, there's a need to address communication constraints in multi-agent systems and ensure real-time task execution, requiring specialized compression techniques for embodied AI.

Method: The paper introduces the scientific problem of Embodied Image Compression and establishes a standardized benchmark called EmbodiedComp for systematic evaluation under ultra-low bitrate conditions in closed-loop settings. The approach involves extensive empirical studies in both simulated and real-world environments.

Result: The research demonstrates that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold, highlighting the need for specialized compression techniques for embodied agents.

Conclusion: EmbodiedComp is anticipated to catalyze the development of domain-specific compression tailored for embodied agents, thereby accelerating the deployment of Embodied AI in real-world applications by addressing communication constraints in multi-agent systems.

Abstract: Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.

[68] VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu

Main category: cs.CV

TL;DR: VGent is a modular visual grounding system that separates reasoning (frozen MLLM encoder) from bounding box prediction (decoder), achieving SOTA performance with 20.6% F1 improvement and fast inference.

DetailsMotivation: Current visual grounding models have trade-offs: auto-regressive MLLMs are slow and hallucinate, while re-aligned LLMs compromise pretrained reasoning. Need a solution that preserves reasoning power while enabling efficient, accurate grounding.

Method: Modular encoder-decoder architecture: frozen MLLM encoder for reasoning, decoder with detector-proposed boxes as queries selects targets via cross-attention. Includes QuadThinker (RL training for multi-target reasoning), mask-aware labels for detection-segmentation ambiguity, and global target recognition.

Result: Achieves new SOTA: +20.6% F1 improvement over prior methods, +8.2% gIoU and +5.8% cIoU boost under visual reference challenges, while maintaining constant fast inference latency.

Conclusion: VGent successfully disentangles reasoning from grounding, leveraging advances in both object detection and MLLMs without compromising either, enabling modular upgrades and efficient inference.

Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM’s pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder’s hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

[69] Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization

Brennan Flannery, Thomas DeSilvio, Jane Nguyen, Satish E. Viswanath

Main category: cs.CV

TL;DR: Intelligent correlation-guided fusion of multiple pathology foundation models improves cancer grading/staging performance by creating compact, task-specific representations that enhance both accuracy and interpretability.

DetailsMotivation: While foundation models show strong performance in pathology, there's limited understanding of their complementarity, redundancy in embedding spaces, or biological feature interpretation. The study aims to systematically evaluate how to best integrate multiple pathology FMs for improved cancer diagnosis.

Method: Used H&E whole-slide images from kidney, prostate, and rectal cancers. Evaluated both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE). Compared three fusion schemes: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Used patient-stratified cross-validation with hold-out testing.

Result: Intelligent fusion of tile-level embeddings consistently outperformed both the best single FMs and naive fusion across all three cancer types. Analysis showed substantial global similarity but lower local neighborhood agreement across FM embedding spaces, indicating complementary fine-grained information. Attention maps revealed intelligent fusion concentrated on tumor regions while reducing focus on benign areas.

Conclusion: Correlation-guided intelligent fusion of pathology foundation models creates compact, task-tailored representations that enhance both predictive performance and interpretability in computational pathology tasks, suggesting this approach can effectively leverage complementary information across multiple FMs.

Abstract: Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic H&E whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.

[70] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen, Shaurya Dewan, Stan Birchfield

Main category: cs.CV

TL;DR: Fast-FoundationStereo achieves real-time stereo matching with strong zero-shot generalization by combining knowledge distillation, neural architecture search, and structured pruning to accelerate foundation models.

DetailsMotivation: Stereo foundation models have strong zero-shot generalization but are too slow for real-time applications, while efficient stereo architectures sacrifice robustness and require costly per-domain fine-tuning.

Method: Three-part divide-and-conquer acceleration: (1) knowledge distillation to compress hybrid backbone into efficient student, (2) blockwise neural architecture search for optimal cost filtering designs, (3) structured pruning for iterative refinement module. Plus automatic pseudo-labeling pipeline to curate 1.4M in-the-wild stereo pairs.

Result: Model runs over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, establishing new state-of-the-art among real-time stereo methods.

Conclusion: Fast-FoundationStereo successfully bridges the gap between strong zero-shot generalization and real-time performance in stereo vision, achieving the first real-time stereo foundation model.

Abstract: Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/

[71] Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon

Main category: cs.CV

TL;DR: ItemizedCLIP is a framework for learning visual representations from itemized text supervision (multiple independent text items per image), achieving better zero-shot performance and interpretability across medical imaging and remote sensing domains.

DetailsMotivation: Many visual domains like medical imaging and remote sensing have itemized text annotations where multiple text items describe distinct, semantically independent findings within a single image. Standard vision-language models trained with redundant captions don't handle this type of supervision well.

Method: ItemizedCLIP uses a cross-attention module to produce text item-conditioned visual embeddings and tailored objectives that enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items).

Result: Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one synthetic dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines.

Conclusion: ItemizedCLIP produces semantically grounded, item-differentiable, complete, and visually interpretable representations that effectively handle itemized text supervision in non-object-centric domains.

Abstract: Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at https://github.com/MLNeurosurg/ItemizedCLIP.

[72] Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Anatole Jacquin de Margerie, Alexis Roger, Irina Rish

Main category: cs.CV

TL;DR: The paper presents a reproduction study of Monkey VLM, confirming its tile-based approach for high-resolution image understanding while investigating global context effects and reporting task-dependent performance variations.

DetailsMotivation: To address reproducibility concerns in complex multimodal models by transparently replicating and analyzing the Monkey VLM approach for high-resolution image understanding.

Method: Replicated the original Monkey VLM’s tile-based strategy using open checkpoints, reimplemented the training pipeline, and extended the work by investigating the effect of including global context.

Result: Confirmed that tiling effectively recovers local details as originally claimed, but found deviations in results depending on task type and tile granularity, with insights on global context inclusion.

Conclusion: The reproduction validates the core Monkey VLM approach while revealing important task-dependent variations and providing practical guidance for future high-resolution multimodal modeling.

Abstract: Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.

[73] Lightweight 3D Gaussian Splatting Compression via Video Codec

Qi Yang, Geert Van Der Auwera, Zhu Li

Main category: cs.CV

TL;DR: LGSCV is a lightweight 3D Gaussian Splatting compression method using video codecs with two-stage Morton scanning and MiniPLAS optimization for better rate-distortion performance.

DetailsMotivation: Current GS compression methods using PLAS are computationally expensive and time-consuming, limiting deployment on lightweight devices. Need efficient compression compatible with standard video codecs.

Method: Two-stage Morton scan (3D then 2D) creates blockwise 2D maps for video codecs. PCA reduces SH dimensionality, and MiniPLAS permutes primitives within blocks to improve RD performance at medium-low bitrates.

Result: Achieves over 20% RD gain vs SOTA, reduces 2D map generation to ~1 second, cuts encoding time by 50% on MPEG dataset.

Conclusion: LGSCV enables efficient GS compression for lightweight devices by combining Morton scanning with MiniPLAS optimization, achieving superior performance with significantly reduced computational cost.

Abstract: Current video-based GS compression methods rely on using Parallel Linear Assignment Sorting (PLAS) to convert 3D GS into smooth 2D maps, which are computationally expensive and time-consuming, limiting the application of GS on lightweight devices. In this paper, we propose a Lightweight 3D Gaussian Splatting (GS) Compression method based on Video codec (LGSCV). First, a two-stage Morton scan is proposed to generate blockwise 2D maps that are friendly for canonical video codecs in which the coding units (CU) are square blocks. A 3D Morton scan is used to permute GS primitives, followed by a 2D Morton scan to map the ordered GS primitives to 2D maps in a blockwise style. However, although the blockwise 2D maps report close performance to the PLAS map in high-bitrate regions, they show a quality collapse at medium-to-low bitrates. Therefore, a principal component analysis (PCA) is used to reduce the dimensionality of spherical harmonics (SH), and a MiniPLAS, which is flexible and fast, is designed to permute the primitives within certain block sizes. Incorporating SH PCA and MiniPLAS leads to a significant gain in rate-distortion (RD) performance, especially at medium and low bitrates. MiniPLAS can also guide the setting of the codec CU size configuration and significantly reduce encoding time. Experimental results on the MPEG dataset demonstrate that the proposed LGSCV achieves over 20% RD gain compared with state-of-the-art methods, while reducing 2D map generation time to approximately 1 second and cutting encoding time by 50%. The code is available at https://github.com/Qi-Yangsjtu/LGSCV .

[74] UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis

Md Abu Bakr Siddique, Vaishnav Ramesh, Junliang Liu, Piyush Singh, Md Jahidul Islam

Main category: cs.CV

TL;DR: UStyle is the first data-driven framework for waterbody style transfer in underwater images without needing reference images, using depth-aware physics-based synthesis and novel loss functions to preserve scene structure while transferring waterbody styles.

DetailsMotivation: Traditional style transfer methods fail in underwater environments due to wavelength-dependent attenuation, depth-dependent backscattering, and inability to preserve object/scene geometry in high-scattering mediums. Waterbody style transfer remains unexplored in underwater imaging literature.

Method: Proposes UStyle with depth-aware whitening and coloring transform (DA-WCT) that integrates physics-based waterbody synthesis. Uses carefully designed loss functions for colorfulness, lightness, structural integrity, frequency-domain characteristics, and high-level content in VGG and CLIP feature spaces.

Result: UStyle surpasses state-of-the-art methods that rely solely on end-to-end reconstruction loss. Introduces UF7D dataset with seven distinct waterbody styles as a benchmark. Framework and dataset are publicly released.

Conclusion: UStyle provides a robust no-reference underwater image style transfer framework that addresses domain-specific challenges, enabling waterbody style transfer while preserving scene structure, establishing a foundation for future underwater imaging research.

Abstract: The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: https://github.com/uf-robopi/UStyle.

[75] Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Anh-Kiet Duong, Petra Gomez-Krämer

Main category: cs.CV

TL;DR: Extended TSM with background class for temporal action localization in multi-perspective videos, using multi-task learning and ensemble methods to win ICCV 2025 BinEgo-360 Challenge.

DetailsMotivation: Address the challenge of temporal action localization in complex multi-perspective and multi-modal video settings, where panoramic, third-person, and egocentric recordings require robust action detection across different viewpoints.

Method: Extended Temporal Shift Module (TSM) with background class for fixed-length interval classification; multi-task learning framework combining scene classification and TAL; weighted ensemble of multiple models for improved robustness.

Result: Ranked first in both initial and extended rounds of ICCV 2025 BinEgo-360 Challenge, demonstrating superior performance in temporal action localization across multi-perspective video data.

Conclusion: The combination of multi-task learning, efficient TSM backbone, and ensemble methods provides an effective solution for temporal action localization in complex multi-perspective video environments.

Abstract: We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.

[76] CADKnitter: Compositional CAD Generation from Text and Geometry Guidance

Tri Le, Khang Nguyen, Baoru Huang, Tung D. Ta, Anh Nguyen

Main category: cs.CV

TL;DR: CADKnitter is a compositional CAD generation framework that creates complementary CAD parts following both geometric constraints of existing models and semantic constraints of text prompts, outperforming existing methods.

DetailsMotivation: Traditional CAD modeling is time-consuming and requires expertise. While 3D generation has helped, existing methods focus on single-part generation which doesn't suit real-world applications where multiple parts need assembly under semantic and geometric constraints.

Method: Proposes CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. Also creates KnitCAD dataset with 310,000+ CAD models, textual prompts, and assembly metadata for training.

Result: Intensive experiments show CADKnitter outperforms state-of-the-art baselines by a clear margin in generating complementary CAD parts that satisfy both geometric and semantic constraints.

Conclusion: CADKnitter advances CAD generation from single-part to compositional generation, enabling more practical real-world applications by handling assembly constraints and semantic requirements simultaneously.

Abstract: Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.

[77] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Zhengyang Yu, Akio Hayakawa, Masato Ishii, Qingtao Yu, Takashi Shibuya, Jing Zhang, Yuki Mitsufuji

Main category: cs.CV

TL;DR: AutoRefiner is a noise refinement method for autoregressive video diffusion models that improves sample fidelity through pathwise noise refinement and reflective KV-cache, addressing limitations of naive text-to-image noise refiner extensions.

DetailsMotivation: Autoregressive video diffusion models (AR-VDMs) show promise for real-time applications but have room for improvement in sample fidelity. While inference-time alignment methods exist, optimization-based approaches are computationally impractical for AR-VDMs, and naive extensions of text-to-image noise refiners fail for video models.

Method: AutoRefiner introduces two key designs: 1) Pathwise noise refinement that refines noise along stochastic denoising paths, and 2) Reflective KV-cache that enables efficient noise modulation. The method serves as a plug-in module for AR-VDMs without requiring model parameter updates.

Result: Experiments demonstrate that AutoRefiner effectively enhances sample fidelity in AR-VDMs by refining noise along denoising paths, serving as an efficient plug-in solution that improves video generation quality.

Conclusion: AutoRefiner successfully extends noise refinement techniques to autoregressive video diffusion models, providing an efficient inference-time solution to improve sample fidelity without the computational burden of optimization-based methods.

Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

[78] SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

Tianye Qi, Weihao Li, Nick Barnes

Main category: cs.CV

TL;DR: SmokeBench benchmark evaluates MLLMs on wildfire smoke detection and localization, finding models struggle with early-stage smoke localization despite some classification ability.

DetailsMotivation: Wildfire smoke is visually similar to clouds and challenging to detect in early stages, creating a need to evaluate multimodal LLMs' capabilities for safety-critical wildfire monitoring.

Method: Created SmokeBench benchmark with four tasks: smoke classification, tile-based localization, grid-based localization, and smoke detection. Evaluated 7 MLLMs including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro.

Result: Models can classify smoke when it covers large areas but struggle with accurate localization, especially in early stages. Smoke volume strongly correlates with performance, while contrast has minor impact.

Conclusion: Current MLLMs have critical limitations for safety-critical wildfire monitoring, highlighting need for improved methods for early-stage smoke localization.

Abstract: Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.

[79] VFMF: World Modeling by Forecasting Vision Foundation Model Features

Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi

Main category: cs.CV

TL;DR: The paper introduces a generative forecasting method that performs autoregressive flow matching in vision foundation model (VFM) feature space, addressing the limitations of both pixel-based video generation and deterministic regression approaches.

DetailsMotivation: Existing approaches have trade-offs: pixel-based video generation is computationally intensive and not directly useful for decision-making, while deterministic regression of VFM features averages over plausible futures and fails to capture uncertainty. There's a need for efficient, actionable forecasting that preserves uncertainty.

Method: The method uses generative modeling in VFM feature space through autoregressive flow matching. Key innovation is encoding VFM features into a compact latent space suitable for diffusion, which preserves information better than PCA-based alternatives. This allows stochastic conditional generation of future world states.

Result: The latent space preserves information more effectively than PCA-based alternatives for both forecasting and other applications like image generation. With matched architecture and compute, the method produces sharper and more accurate predictions than regression across all modalities (semantic segmentation, depth, surface normals, RGB).

Conclusion: Stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models, combining the efficiency and actionability of VFM features with the uncertainty modeling of generative approaches.

Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

[80] FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model

Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: FutureX enhances end-to-end autonomous driving planners by using Chain of Thought reasoning with a latent world model to predict future scenes and refine trajectories, switching between instant mode for simple scenes and thinking mode for complex scenarios.

DetailsMotivation: Current end-to-end planners rely only on current scene representations, which can lead to suboptimal responses in dynamic traffic environments where ego actions affect future scenes. There's a need to model scene evolution through complex reasoning about interactions between the ego vehicle and environment.

Method: Proposes FutureX pipeline with Auto-think Switch that decides when additional reasoning is needed. In Thinking mode, uses Latent World Model with Chain of Thought-guided rollout to predict future scene representations, then Summarizer Module refines motion plans. In Instant mode, generates plans in forward pass for simple scenes.

Result: Extensive experiments show FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency. Achieves substantial performance gains, including 6.2 PDMS improvement for TransFuser on NAVSIM benchmark.

Conclusion: FutureX successfully integrates Chain of Thought reasoning with world modeling to improve autonomous driving planners by enabling future scene prediction and trajectory refinement, balancing computational efficiency with complex reasoning for dynamic environments.

Abstract: In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.

[81] RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

Main category: cs.CV

TL;DR: RoomPilot is a unified framework that converts text or CAD floor plans into structured indoor scenes using an Indoor Domain-Specific Language (IDSL), enabling controllable, interactive scene generation with realistic object behaviors.

DetailsMotivation: Existing approaches for indoor scene generation are either limited in input modalities or rely on stochastic processes that reduce controllability, hindering applications in game development, architectural visualization, and embodied AI training.

Method: RoomPilot parses diverse multi-modal inputs (textual descriptions or CAD floor plans) into an Indoor Domain-Specific Language (IDSL), which serves as a shared semantic representation. It uses a curated dataset of interaction-annotated assets to synthesize environments with realistic object behaviors, unlike conventional procedural methods that produce visually plausible but functionally inert layouts.

Result: Extensive experiments validate RoomPilot’s strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity compared to existing approaches.

Conclusion: RoomPilot represents a significant step toward general-purpose controllable 3D indoor scene generation by enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics through a well-designed IDSL framework.

Abstract: Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs–textual descriptions or CAD floor plans–into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

[82] WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu

Main category: cs.CV

TL;DR: WildCap enables high-quality facial appearance capture from smartphone videos in uncontrolled lighting using a hybrid inverse rendering framework that combines data-driven preprocessing with model-based optimization and texel grid lighting.

DetailsMotivation: Existing facial appearance capture methods require controllable lighting, which increases cost and limits usability. There's a need for high-quality capture from in-the-wild smartphone videos without specialized lighting setups.

Method: Proposes WildCap: 1) Uses SwitchLight (data-driven method) to convert in-the-wild images to constrained conditions, 2) Adopts model-based inverse rendering, 3) Introduces texel grid lighting model to handle non-physical artifacts from network predictions, 4) Jointly samples diffusion prior for reflectance maps and optimizes lighting to resolve scale ambiguity.

Result: Achieves significantly better results than prior methods in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.

Conclusion: WildCap enables high-quality facial appearance capture from ordinary smartphone videos without specialized lighting, making facial capture more accessible and practical for real-world applications.

Abstract: Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{https://yxuhan.github.io/WildCap/index.html}{\textcolor{magenta}{here}}.

[83] Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition

Wen-Jue He, Xiaofeng Zhu, Zheng Zhang

Main category: cs.CV

TL;DR: Cross-modal Prompting (ComP) method for incomplete multi-modal emotion recognition that uses progressive prompts and cross-modal knowledge propagation to handle missing data and improve recognition accuracy.

DetailsMotivation: Incomplete multi-modal emotion recognition faces challenges with performance gaps, modality under-optimization, and missing data issues that hinder effective multi-modal learning.

Method: Progressive prompt generation with dynamic gradient modulator produces modality semantic cues; cross-modal knowledge propagation amplifies consistent information; coordinator dynamically re-weights modality outputs.

Result: Extensive experiments on 4 datasets with 7 state-of-the-art methods under different missing rates validate the effectiveness of the proposed method.

Conclusion: The ComP method successfully addresses incomplete multi-modal emotion recognition challenges by enhancing modality-specific features and improving overall recognition accuracy through coherent information emphasis.

Abstract: Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality’s performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model’s efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.

[84] PersonaLive! Expressive Portrait Image Animation for Live Streaming

Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun

Main category: cs.CV

TL;DR: PersonaLive is a diffusion-based framework for real-time portrait animation that achieves 7-22x speedup over prior methods through hybrid implicit signals, appearance distillation, and streaming generation.

DetailsMotivation: Current diffusion-based portrait animation models focus on visual quality but overlook generation latency and real-time performance, limiting their application in live streaming scenarios.

Method: 1) Hybrid implicit signals (implicit facial representations + 3D implicit keypoints) for expressive motion control. 2) Fewer-step appearance distillation to eliminate redundancy and improve efficiency. 3) Autoregressive micro-chunk streaming generation with sliding training and historical keyframe mechanism for low-latency stable video generation.

Result: Achieves state-of-the-art performance with 7-22x speedup over prior diffusion-based portrait animation models, enabling real-time streaming applications.

Conclusion: PersonaLive successfully addresses the latency limitations of diffusion-based portrait animation, making real-time streaming applications feasible through its multi-stage training approach and efficient generation paradigm.

Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

[85] Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers

Ali El Bellaj, Mohammed-Amine Cheddadi, Rhassan Berber

Main category: cs.CV

TL;DR: Reformer-based vision model reduces theoretical complexity from O(n²) to O(n log n) but ViT outperforms it in practical efficiency on larger datasets.

DetailsMotivation: Standard Vision Transformers (ViTs) have quadratic computational complexity with token count, making them expensive for high-resolution images and resource-constrained settings. The authors investigate Reformer architecture as a more efficient alternative.

Method: Combine patch-based tokenization with locality-sensitive hashing (LSH) attention to approximate global self-attention while reducing theoretical time complexity. Evaluate on CIFAR-10 (small-scale), ImageNet-100 (accuracy-efficiency trade-off), and high-resolution medical imaging dataset.

Result: Reformer achieves higher accuracy on CIFAR-10 compared to ViT baseline, but ViT consistently outperforms Reformer in practical efficiency and end-to-end computation time on larger and higher-resolution settings.

Conclusion: Despite theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images. ViT remains more practical for current vision tasks.

Abstract: Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy–efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.

[86] Evaluating the Efficacy of Sentinel-2 versus Aerial Imagery in Serrated Tussock Classification

Rezwana Sultana, Manzur Murshed, Kathryn Sheffield, Singarayer Florentine, Tsz-Kwan Lee, Shyh Wei Teng

Main category: cs.CV

TL;DR: Multi-temporal Sentinel-2 satellite imagery with enhanced features achieves comparable accuracy to aerial imagery for landscape-scale monitoring of invasive serrated tussock grass in Australia.

DetailsMotivation: Current ground surveys for invasive serrated tussock are effective but not scalable for landscape monitoring. Aerial imagery is too expensive, while satellite imagery offers cost-effective scalability but with lower resolution. The study aims to determine if multi-temporal Sentinel-2 data can provide comparable monitoring capabilities despite its lower spatial resolution.

Method: Developed eleven models using various combinations of Sentinel-2 spectral bands, texture features, vegetation indices, and seasonal data. Used random forest classifier to evaluate performance. Compared results with aerial imaging models on the same dataset.

Result: Best Sentinel-2 model (M76*) achieved 68% Overall Accuracy and 0.55 Overall Kappa, slightly outperforming the best aerial imaging model (67% OA, 0.52 OK). Multi-temporal feature-enhanced satellite models showed comparable performance to higher-resolution aerial imagery.

Conclusion: Multi-seasonal feature-enhanced satellite-based models offer a viable, cost-effective alternative for scalable invasive species classification, demonstrating that lower-resolution satellite data can achieve comparable accuracy to aerial imagery when leveraging spectral and phenological information.

Abstract: Invasive species pose major global threats to ecosystems and agriculture. Serrated tussock (\textit{Nassella trichotoma}) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs. In Victoria, Australia, it presents a major challenge due to its aggressive spread and ecological impact. While current ground surveys and subsequent management practices are effective at small scales, they are not feasible for landscape-scale monitoring. Although aerial imagery offers high spatial resolution suitable for detailed classification, its high cost limits scalability. Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution. This study evaluates whether multi-temporal Sentinel-2 imagery, despite its lower spatial resolution, can provide a comparable and cost-effective alternative for landscape-scale monitoring of serrated tussock by leveraging its higher spectral resolution and seasonal phenological information. A total of eleven models have been developed using various combinations of spectral bands, texture features, vegetation indices, and seasonal data. Using a random forest classifier, the best-performing Sentinel-2 model (M76*) has achieved an Overall Accuracy (OA) of 68% and an Overall Kappa (OK) of 0.55, slightly outperforming the best-performing aerial imaging model’s OA of 67% and OK of 0.52 on the same dataset. These findings highlight the potential of multi-seasonal feature-enhanced satellite-based models for scalable invasive species classification.

[87] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Xiangyang Luo, Qingyu Li, Xiaokun Liu, Wenyu Qin, Miao Yang, Meng Wang, Pengfei Wan, Di Zhang, Kun Gai, Shao-Lun Huang

Main category: cs.CV

TL;DR: FilmWeaver is a novel framework for generating consistent multi-shot videos of arbitrary length using autoregressive diffusion with dual-level cache mechanisms for inter-shot consistency and intra-shot coherence.

DetailsMotivation: Current video generation models struggle with multi-shot videos, particularly in maintaining character/background consistency across shots and generating videos of arbitrary length and shot count.

Method: Uses autoregressive diffusion paradigm with dual-level cache: shot memory for inter-shot consistency (caches keyframes from preceding shots) and temporal memory for intra-shot coherence (retains frame history from current shot). Supports flexible user interaction and downstream tasks like multi-concept injection and video extension.

Result: Surpasses existing approaches on both consistency and aesthetic quality metrics. Demonstrates high versatility for downstream tasks and enables creation of more consistent, controllable, and narrative-driven video content.

Conclusion: FilmWeaver addresses critical limitations in multi-shot video generation, opening new possibilities for consistent, controllable, narrative-driven video content through its decoupled consistency design and dual-level cache mechanism.

Abstract: Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io

[88] RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection

Rongcheng Wu, Hao Zhu, Shiying Zhang, Mingzhe Wang, Zhidong Li, Hui Li, Jianlong Zhou, Jiangtao Cui, Fang Chen, Pingyang Sun, Qiyu Liao, Ye Lin

Main category: cs.CV

TL;DR: RcAE: Recursive autoencoder with iterative reconstruction for industrial anomaly detection, using cross-recursion tracking and detail preservation to outperform traditional methods with high efficiency.

DetailsMotivation: Traditional autoencoder-based anomaly detection methods struggle with incomplete anomaly suppression and loss of fine details due to single-pass decoding, which fails to handle anomalies with varying severity and scale effectively.

Method: Proposes Recursive Autoencoder (RcAE) with iterative reconstruction to progressively suppress anomalies while refining normal structures. Includes Cross Recursion Detection (CRD) module to track inconsistencies across recursion steps, and Detail Preservation Network (DPN) to recover high-frequency textures lost during reconstruction.

Result: Significantly outperforms existing non-diffusion methods, achieves performance on par with recent diffusion models with only 10% of their parameters, and offers substantially faster inference.

Conclusion: The recursive architecture with cross-recursion tracking and detail preservation provides a practical and efficient approach for real-world industrial anomaly detection applications, balancing performance with computational efficiency.

Abstract: Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.

[89] Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context

Cuifeng Shen, Lumin Xu, Xingguo Zhu, Gengdai Liu

Main category: cs.CV

TL;DR: ARVAE is an autoregressive video autoencoder that decouples temporal and spatial information using flow fields and spatial compensation, enabling efficient compression and reconstruction of arbitrary-length videos with high quality.

DetailsMotivation: Existing video autoencoders often entangle spatial and temporal information, which limits their ability to capture temporal consistency and leads to suboptimal performance in video reconstruction and generation tasks.

Method: ARVAE compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner. It uses a temporal-spatial decoupled representation combining downsampled flow fields for temporal coherence with spatial relative compensation for new content. The encoder compresses current and previous frames into temporal motion and spatial supplement, while the decoder reconstructs the original frame from these latent representations given the preceding frame. A multi-stage training strategy is employed for progressive optimization.

Result: ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. It demonstrates strong potential for downstream video generation applications.

Conclusion: ARVAE effectively addresses the temporal-spatial entanglement problem in video autoencoders through its autoregressive approach and decoupled representation, offering efficient compression and high-quality reconstruction for arbitrary-length videos with promising applications in video generation tasks.

Abstract: Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.

[90] Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining

Yasaman Hashem Pour, Nazanin Mahjourian, Vinh Nguyen

Main category: cs.CV

TL;DR: A few-shot VLM approach for verifying manually generated G-code by simultaneously analyzing both G-code text and HMI screenshots, improving error detection in CNC training.

DetailsMotivation: Existing LLM-based G-code verification only examines programming errors but cannot leverage HMI knowledge since LLMs lack vision capabilities. CNC machining requires understanding both G-code and HMI displays for comprehensive error detection.

Method: Proposes a few-shot VLM approach that evaluates paired G-code text and HMI screenshots from a lathe. Uses structured JSON schema based on prior knowledge for few-shot learning, with correct/error examples as few-shot prompts to guide the VLM.

Result: Few-shot VLM outperformed zero-shot VLM in detecting HMI errors and G-code discrepancies. Showed enhanced overall error detection capability for more comprehensive debugging.

Conclusion: The framework is suitable for verifying manually generated G-code in CNC training, providing more comprehensive verification by combining visual HMI analysis with G-code text analysis.

Abstract: Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.

[91] MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction

Bate Li, Houqiang Zhong, Zhengxue Cheng, Qiang Hu, Qiang Wang, Li Song, Wenjun Zhang

Main category: cs.CV

TL;DR: MultiEgo is the first multi-view egocentric dataset for 4D dynamic scene reconstruction, featuring five social interaction scenes with synchronized egocentric videos from AR glasses.

DetailsMotivation: Existing reconstruction datasets focus on static multi-view or single-egocentric setups, lacking multi-view egocentric datasets for dynamic scene reconstruction needed for holographic documentation of social interactions.

Method: Created MultiEgo dataset with five social interaction scenes (meetings, performances, presentation) using AR glasses. Developed hardware-based data acquisition system with sub-millisecond temporal synchronization across views and accurate pose annotations.

Result: The dataset provides five authentic egocentric videos per scene with precise synchronization and pose data. Experimental validation demonstrates practical utility for free-viewpoint video applications.

Conclusion: MultiEgo establishes a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research, addressing the gap in existing datasets for dynamic social interaction documentation.

Abstract: Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.

[92] SATMapTR: Satellite Image Enhanced Online HD Map Construction

Bingyuan Huang, Guanyi Zhao, Qian Xu, Yang Lou, Yung-Hui Li, Jianping Wang

Main category: cs.CV

TL;DR: SATMapTR is a novel online HD map construction model that effectively fuses satellite images with BEV features using gated refinement and geometry-aware fusion, achieving state-of-the-art performance on nuScenes dataset.

DetailsMotivation: HD maps need real-time construction but suffer from low-quality onboard sensor data due to limitations and occlusions. Satellite images offer wide-area views but are degraded by shadows and occlusions, and existing fusion methods remain ineffective.

Method: SATMapTR uses two key components: (1) a gated feature refinement module that adaptively filters satellite image features by combining high-level semantics with low-level structural cues, and (2) a geometry-aware fusion module that consistently fuses satellite and BEV features at grid-to-grid level to minimize interference from irrelevant regions.

Result: On nuScenes dataset, SATMapTR achieves highest mAP of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.

Conclusion: SATMapTR effectively addresses the challenges of fusing satellite images for online HD map construction through adaptive feature refinement and geometry-aware fusion, demonstrating superior performance and robustness in diverse driving scenarios.

Abstract: High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird’s Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.

[93] KeyframeFace: From Text to Expressive Facial Keyframes

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

Main category: cs.CV

TL;DR: KeyframeFace introduces a large-scale multimodal dataset and LLM-based framework for text-to-facial-animation with keyframe-level supervision, enabling interpretable and expressive 3D facial performance generation from natural language.

DetailsMotivation: Existing datasets and methods for facial animation lack semantic grounding and temporal structure needed for expressive human performance generation from natural language. Current approaches focus on speech-driven animation or unstructured expression sequences, missing the nuanced understanding required for text-to-animation tasks.

Method: 1) Created KeyframeFace dataset with 2,100 expressive scripts paired with videos, ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations using LLMs/MLLMs. 2) Proposed a text-to-animation framework that leverages LLM priors to align semantic understanding with interpretable ARKit coefficient structure for facial motion synthesis.

Result: KeyframeFace provides the first large-scale multimodal dataset with keyframe-level supervision for text-to-animation research. The LLM-based framework enables high-fidelity expressive animation by bridging semantic understanding with interpretable facial motion representation.

Conclusion: KeyframeFace and the proposed LLM-based framework establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation, addressing limitations of existing approaches and enabling more expressive and semantically grounded facial performance generation.

Abstract: Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit’s coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.

[94] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang

Main category: cs.CV

TL;DR: DentalGPT is a specialized 7B-parameter multimodal LLM for dentistry that achieves state-of-the-art performance through high-quality domain data injection and reinforcement learning, outperforming larger models on dental diagnostic tasks.

DetailsMotivation: Current multimodal LLMs struggle with fine-grained dental visual details and lack sufficient reasoning ability for precise dental diagnosis, creating a need for specialized dental AI systems.

Method: 1) Constructed largest annotated multimodal dental dataset (120k+ images with detailed descriptions highlighting diagnostic features); 2) High-quality domain knowledge injection through training on this dataset; 3) Reinforcement learning stage to enhance multimodal complex reasoning capabilities.

Result: DentalGPT achieves superior performance on intraoral and panoramic benchmarks, and dental subsets of medical VQA benchmarks, outperforming state-of-the-art MLLMs in disease classification and dental VQA tasks despite having only 7B parameters.

Conclusion: High-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs, demonstrating the value of targeted domain expertise in medical AI.

Abstract: Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM’s visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

[95] MLLM Machine Unlearning via Visual Knowledge Distillation

Yuhang Wang, Zhenxing Niu, Haoxuan Ji, Guangyu He, Haichang Gao, Gang Hua

Main category: cs.CV

TL;DR: Proposes a novel machine unlearning method for MLLMs that selectively erases visual knowledge while preserving textual knowledge using Visual Knowledge Distillation from intermediate representations.

DetailsMotivation: Most existing machine unlearning methods are designed for LLMs, leaving MLLM-oriented unlearning underdeveloped. There's a need for approaches that can selectively remove sensitive visual information from MLLMs while maintaining their textual capabilities.

Method: Disentangles visual and textual knowledge in MLLMs, introduces Visual Knowledge Distillation (VKD) that uses intermediate visual representations as supervision signals, and only fine-tunes the visual components of the MLLM for efficiency.

Result: Outperforms state-of-the-art unlearning methods in both effectiveness and efficiency. First to evaluate robustness against relearning attacks, demonstrating superior performance.

Conclusion: The proposed VKD-based approach provides an effective and efficient solution for MLLM unlearning, enabling selective removal of visual knowledge while preserving textual capabilities, with demonstrated robustness against attacks.

Abstract: Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.

[96] Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene

Junqiao Wang, Yuanfei Huang, Hua Huang

Main category: cs.CV

TL;DR: First video flare removal method with physics-informed synthesis pipeline and Mamba-based temporal modeling that outperforms existing approaches while preserving light sources and spatiotemporal consistency.

DetailsMotivation: Video flare removal is more challenging than image flare removal due to complex independent motion of flares, light sources, and scene content, causing flicker and artifacts. Existing methods focus on images, leaving video flare spatiotemporal characteristics unexplored.

Method: 1) Physics-informed dynamic flare synthesis pipeline using optical flow for light source motion and modeling temporal behaviors of scattering/reflective flares. 2) Video flare removal network with attention module for spatial flare suppression and Mamba-based temporal modeling for long-range spatiotemporal dependencies without multi-frame alignment.

Result: Created first video flare dataset with synthetic paired videos and real-world videos. Method consistently outperforms existing video restoration and image flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and spatiotemporal consistency.

Conclusion: Proposed approach successfully addresses video flare removal challenges by modeling motion-independent spatiotemporal representations, eliminating alignment needs, and reducing temporal aliasing, establishing a new benchmark for video flare removal.

Abstract: Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.

[97] FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation

Yixuan Zhang, Qing Xu, Yue Li, Xiangjian He, Qian Zhang, Mainul Haque, Rong Qu, Wenting Duan, Zhen Chen

Main category: cs.CV

TL;DR: FreqDINO: A frequency-guided ultrasound image segmentation framework that enhances boundary perception by separating low-frequency structures and multi-scale high-frequency boundary details, addressing DINOv3’s limitations in handling ultrasound-specific boundary degradation.

DetailsMotivation: Ultrasound image segmentation faces challenges from speckle noise and imaging artifacts. While DINOv3 shows promise for medical image segmentation, it lacks sensitivity to ultrasound-specific boundary degradation due to being pre-trained on natural images.

Method: Proposes FreqDINO with three key components: 1) Multi-scale Frequency Extraction and Alignment (MFEA) to separate low-frequency structures and multi-scale high-frequency boundary details with learnable attention alignment; 2) Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components; 3) Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions.

Result: Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior performance and achieves remarkable generalization capability.

Conclusion: FreqDINO effectively addresses ultrasound-specific boundary degradation by leveraging frequency domain analysis, enhancing boundary perception and structural consistency for improved ultrasound image segmentation.

Abstract: Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at https://github.com/MingLang-FD/FreqDINO.

[98] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu

Main category: cs.CV

TL;DR: UFVideo is the first Video LLM with unified multi-grained cooperative understanding capabilities that can handle video understanding across global, pixel, and temporal scales within a single model, outperforming GPT-4o on specialized benchmarks.

DetailsMotivation: Existing Video LLMs are limited to specialized video understanding tasks and fail to achieve comprehensive, multi-grained video perception. There's a gap in models that can handle video understanding across different scales (global, pixel, temporal) in a unified manner.

Method: The authors introduce UFVideo with unified visual-language guided alignment that dynamically encodes visual and text inputs for different tasks and generates textual responses, temporal localization, or grounded masks. They also construct UFVideo-Bench with three distinct collaborative tasks across different scales.

Result: UFVideo demonstrates flexibility and advantages over GPT-4o on the UFVideo-Bench. The model is validated across 9 public benchmarks covering various common video understanding tasks, showing effectiveness in multi-grained video understanding.

Conclusion: UFVideo represents a significant advancement in Video LLMs by achieving unified multi-grained cooperative understanding capabilities, providing valuable insights for future Video LLM development and bridging the gap in comprehensive video perception.

Abstract: With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo’s flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

[99] Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

Fei Long, Yao Zhang, Jiaming Lv, Jiangtao Xie, Peihua Li

Main category: cs.CV

TL;DR: TS-FSAR improves few-shot action recognition by introducing a Ladder Side Network for efficient CLIP fine-tuning, Task-Specific Distance Correlation Matching for capturing complex inter-frame dependencies, and a regularization module to improve training under limited data.

DetailsMotivation: Existing few-shot action recognition methods have two key limitations: (1) set matching metrics rely on cosine similarity and instance-level information only, failing to capture complex nonlinear relationships and task-specific cues; (2) efficient CLIP adaptation via skip-fusion layers (side layers) is difficult to optimize under limited data conditions.

Method: TS-FSAR framework with three components: (1) Visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) Task-Specific Distance Correlation Matching (TS-DCM) using α-distance correlation to model both linear and nonlinear inter-frame dependencies with task prototypes; (3) Guiding LSN with Adapted CLIP (GLAC) module to regularize LSN using adapted frozen CLIP for better α-distance correlation estimation.

Result: Extensive experiments on five widely-used benchmarks demonstrate superior performance compared to prior state-of-the-art methods.

Conclusion: TS-FSAR effectively addresses limitations in existing few-shot action recognition by improving CLIP adaptation efficiency and capturing more complex inter-frame relationships through novel distance correlation matching and regularization techniques.

Abstract: Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.

[100] Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture

Tanu Singh, Pranamesh Chakraborty, Long T. Truong

Main category: cs.CV

TL;DR: Proposes a transformer-based accident detection model using a comprehensive dataset and motion cues, achieving 88.3% accuracy with RGB+optical flow features.

DetailsMotivation: Road traffic accidents are a major global mortality cause with rising rates. Traditional computer vision methods have limitations in spatiotemporal understanding and cross-domain generalization. Transformer architectures show promise but are hindered by small, non-diverse datasets. Motion cues are often neglected in existing approaches.

Method: Curated a comprehensive balanced dataset covering diverse traffic environments and accident types. Proposed transformer architecture using pre-extracted spatial video features with convolutional layers for local correlations and transformers for sequential-temporal dependencies. Evaluated multiple methods for incorporating motion cues, with RGB+optical flow concatenation performing best.

Result: Concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. Results were compared with vision language models (GPT, Gemini, LLaVA-NeXT-Video) to assess effectiveness.

Conclusion: The proposed transformer-based approach with comprehensive dataset and motion cue integration effectively addresses limitations of traditional methods, demonstrating superior performance for automated traffic accident detection.

Abstract: Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.

[101] A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection

Qinghan Hu, Haijiang Zhu, Na Sun, Lei Chen, Zhengqiang Fan, Zhiqing Li

Main category: cs.CV

TL;DR: Developed a multi-mode underwater structured light 3D imaging system (UW-SLD) with multi-source information fusion for autonomous pipeline defect detection, achieving superior accuracy and robustness.

DetailsMotivation: Underwater pipelines are highly susceptible to corrosion, posing safety risks and shortening service life. Intelligent real-time imaging systems are needed as more reliable alternatives to manual inspection for precise defect characterization.

Method: 1) Rapid distortion correction (FDC) for underwater image rectification; 2) Factor graph-based parameter optimization for sensor calibration; 3) Multi-mode 3D imaging strategy for pipeline geometric variability; 4) Multi-source information fusion with adaptive extended Kalman filter (AEKF) for stable pose estimation; 5) Edge detection-based ICP (ED-ICP) algorithm combining pipeline edge detection network with enhanced point cloud registration.

Result: Extensive experiments under different operation modes, velocities, and depths demonstrate the system achieves superior accuracy, adaptability, and robustness for autonomous underwater pipeline detection.

Conclusion: The developed UW-SLD system provides a solid foundation for autonomous underwater pipeline detection by overcoming environmental challenges through multi-source information fusion and adaptive algorithms, enabling robust high-fidelity reconstruction of defect structures even under variable motion conditions.

Abstract: Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.

[102] Flowception: Temporally Expansive Flow Matching for Video Generation

Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen

Main category: cs.CV

TL;DR: Flowception is a non-autoregressive video generation framework that combines discrete frame insertions with continuous denoising, offering improved efficiency and quality over autoregressive and full-sequence methods.

DetailsMotivation: To address limitations of existing video generation approaches: autoregressive methods suffer from error accumulation/drift, while full-sequence flows are computationally expensive and less flexible for variable-length video generation.

Method: Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. This hybrid approach allows joint learning of video length and content, reduces computational costs, and enables local attention mechanisms.

Result: Quantitative results show improved FVD and VBench metrics over autoregressive and full-sequence baselines. The method reduces FLOPs for training by three-fold compared to full-sequence flows.

Conclusion: Flowception provides an efficient and flexible framework for variable-length video generation that can handle long-term context, integrates multiple tasks (image-to-video, interpolation), and outperforms existing approaches in both quality and efficiency.

Abstract: We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.

[103] Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Meng-Li Shih, Ying-Huan Chen, Yu-Lun Liu, Brian Curless

Main category: cs.CV

TL;DR: A fully automatic pipeline for dynamic scene reconstruction from monocular RGB videos that enhances priors for Dynamic Gaussian Splatting using video segmentation, epipolar-error maps, and multiple novel losses to improve reconstruction quality.

DetailsMotivation: To create a fully automatic system for reconstructing dynamic scenes from casually captured monocular RGB videos without requiring specialized scene representations, focusing on improving reconstruction quality through better priors.

Method: Combines video segmentation with epipolar-error maps to generate object-level masks, uses these masks to guide object-depth loss and skeleton-based sampling with mask-guided re-identification for 2D tracks, and introduces virtual-view depth loss and scaffold-projection loss to embed refined priors in reconstruction.

Result: The system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings.

Conclusion: The proposed pipeline successfully enhances Dynamic Gaussian Splatting priors through automated segmentation and multiple novel loss functions, achieving state-of-the-art dynamic scene reconstruction from monocular videos.

Abstract: We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings

[104] Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke

Main category: cs.CV

TL;DR: Skeleton-Cache is a training-free test-time adaptation framework for skeleton-based zero-shot action recognition that uses a retrieval-based cache with structured skeleton representations and LLM-guided semantic priors.

DetailsMotivation: To improve model generalization to unseen actions during inference without requiring additional training or access to training data, addressing the challenge of zero-shot action recognition.

Method: Reformulates inference as lightweight retrieval over a non-parametric cache storing structured skeleton representations (global + local descriptors), using LLMs to assign class-specific importance weights for descriptor fusion.

Result: Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II show consistent performance boosts for various SZAR backbones in both zero-shot and generalized zero-shot settings.

Conclusion: Skeleton-Cache effectively adapts to unseen actions at test time without training, demonstrating the value of structured skeleton representations combined with LLM semantic reasoning for zero-shot action recognition.

Abstract: We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.

[105] Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts

Mohammad Sadegh Gholizadeh, Amir Arsalan Rezapour, Hamidreza Shayegh, Ehsan Pazouki

Main category: cs.CV

TL;DR: Transfer learning with Faster R-CNN enables efficient rice seedling detection in UAV imagery, showing robust performance across different temporal conditions.

DetailsMotivation: Crop detection via UAVs is challenging due to small target sizes and environmental variability, requiring efficient solutions for precision agriculture scaling.

Method: Uses Faster R-CNN architecture with transfer learning initialization, trained on a curated UAV dataset, and evaluated across three distinct temporal test sets.

Result: Transfer learning enables rapid model convergence and maintains consistent performance despite domain shifts in image acquisition conditions.

Conclusion: Transfer learning with Faster R-CNN is effective for small-scale crop detection in agricultural UAV imagery, demonstrating robustness to temporal variations.

Abstract: Efficient crop detection via Unmanned Aerial Vehicles is critical for scaling precision agriculture, yet it remains challenging due to the small scale of targets and environmental variability. This paper addresses the detection of rice seedlings in paddy fields by leveraging a Faster R-CNN architecture initialized via transfer learning. To overcome the specific difficulties of detecting minute objects in high-resolution aerial imagery, we curate a significant UAV dataset for training and rigorously evaluate the model’s generalization capabilities. Specifically, we validate performance across three distinct test sets acquired at different temporal intervals, thereby assessing robustness against varying imaging conditions. Our empirical results demonstrate that transfer learning not only facilitates the rapid convergence of object detection models in agricultural contexts but also yields consistent performance despite domain shifts in image acquisition.

[106] Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang

Main category: cs.CV

TL;DR: MetaCanvas is a lightweight framework that enables multimodal LLMs to reason and plan directly in spatial/spatiotemporal latent spaces for precise control over image/video generation with diffusion models.

DetailsMotivation: Current multimodal LLMs have strong reasoning capabilities for understanding complex scenes but are reduced to simple text encoders in generation tasks, creating a gap between multimodal understanding and precise generation control.

Method: MetaCanvas lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces, tightly interfacing with diffusion generators as latent-space planners rather than just global text encoders.

Result: MetaCanvas outperforms global-conditioning baselines across six tasks including text-to-image, text/image-to-video generation, editing, and in-context video generation, demonstrating improved layout precision, attribute binding, and reasoning-intensive control.

Conclusion: Treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation, enabling more precise and structured control in visual content creation.

Abstract: Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

[107] Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection

Kuan Wang, Yanjun Qin, Mengge Lu, Liejun Wang, Xiaoming Tao

Main category: cs.CV

TL;DR: ARNet-v2 proposes a novel architecture for Camouflaged Object Detection with Channel Information Interaction Module and collaborative decoding guided by boundary/region priors to address cross-channel interaction and boundary-region co-modeling limitations.

DetailsMotivation: Current COD methods have two critical issues: 1) insufficient cross-channel information interaction within same-layer features limiting feature expressiveness, and 2) inability to effectively co-model boundary and region information, making accurate reconstruction of complete regions and sharp boundaries difficult.

Method: 1) Channel Information Interaction Module (CIIM) with horizontal-vertical integration mechanism for cross-channel feature reorganization and interaction. 2) Collaborative decoding architecture with Boundary Extraction (BE) and Region Extraction (RE) modules to generate boundary priors and object localization maps, then using hybrid attention to collaboratively calibrate decoded features. 3) Multi-scale Enhancement (MSE) module to enrich contextual feature representations.

Result: Extensive experiments on four COD benchmark datasets validate state-of-the-art performance. The model also demonstrates strong adaptability across downstream tasks including Salient Object Detection, polyp segmentation, transparent object detection, and industrial/road defect detection.

Conclusion: The proposed ARNet-v2 effectively addresses key limitations in current COD methods through innovative channel interaction and collaborative boundary-region modeling, achieving superior performance on COD tasks while demonstrating strong transferability to various downstream applications.

Abstract: Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: https://github.com/akuan1234/ARNet-v2.

[108] Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty

Arnold Brosch, Abdelrahman Eldesokey, Michael Felsberg, Kira Maag

Main category: cs.CV

TL;DR: A novel evidence segmentation framework using Wasserstein loss improves out-of-distribution object recognition and segmentation for safety-critical applications like autonomous driving.

DetailsMotivation: Deep neural networks for semantic segmentation fail when encountering unknown objects in open-world scenarios, which is critical for safety applications like automated driving where recognizing out-of-distribution objects is essential.

Method: Proposes an evidence segmentation framework using Wasserstein loss to capture distributional distances while respecting probability simplex geometry, combined with Kullback-Leibler regularization and Dice structural consistency terms.

Result: The approach leads to improved out-of-distribution segmentation performance compared to uncertainty-based approaches.

Conclusion: The evidence segmentation framework with Wasserstein loss effectively addresses the limitation of predefined class sets in semantic segmentation, enhancing safety in open-world applications by better recognizing unknown objects.

Abstract: Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.

[109] The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

Main category: cs.CV

TL;DR: The paper introduces the N-Body Problem: given one egocentric video, determine how N individuals could parallelize the observed tasks while respecting real-world constraints, and proposes a structured prompting approach for VLMs to generate feasible parallel executions.

DetailsMotivation: Humans can intuitively parallelize complex activities, but current models struggle to learn this from observing a single person. The challenge is to maximize speed-up while avoiding physically impossible scenarios like object conflicts or spatial collisions that occur with naive parallelization approaches.

Method: Formalizes the N-Body Problem and proposes evaluation metrics for both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts, causal constraints). Introduces a structured prompting strategy that guides Vision-Language Models to reason about 3D environment, object usage, and temporal dependencies to produce viable parallel executions.

Result: On 100 videos from EPIC-Kitchens and HD-EPIC, the method for N=2 boosts action coverage by 45% over baseline prompts for Gemini 2.5 Pro, while reducing collision rates by 55%, object conflicts by 45%, and causal conflicts by 55%.

Conclusion: The proposed structured prompting approach enables VLMs to generate feasible parallel task executions from single egocentric videos, significantly improving both performance metrics and constraint satisfaction compared to naive methods.

Abstract: Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

[110] FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

Yilei Jiang, Zhen Wang, Yanghao Wang, Jun Yu, Yueting Zhuang, Jun Xiao, Long Chen

Main category: cs.CV

TL;DR: FlowDC is a novel text-based image editing method that handles complex editing with multiple targets by decoupling them into parallel sub-editing effects and decomposing velocity to preserve source structure.

DetailsMotivation: Current text-to-image editing models excel at simple editing with single targets but struggle with complex editing containing multiple targets. Existing solutions (single-round and multi-round editing) face issues with long text following and cumulative inconsistency, failing to balance semantic alignment and source consistency.

Method: FlowDC decouples complex editing into multiple sub-editing effects and superposes them in parallel during editing. It also decomposes velocity into components and decays the orthogonal part to better preserve source structure.

Result: FlowDC shows superior performance on two benchmarks, including the newly constructed Complex-PIE-Bench for complex editing evaluation. Ablation studies confirm the effectiveness of the proposed module designs.

Conclusion: FlowDC effectively addresses complex text-based image editing by parallel decoupling of multiple editing targets and velocity decomposition, achieving better balance between semantic alignment and source consistency than existing methods.

Abstract: With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency. In this paper, we propose \textbf{FlowDC}, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency. To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.

[111] Multi-temporal Calving Front Segmentation

Marcel Dreier, Nora Gourmelon, Dakota Pyles, Fei Wu, Matthias Braun, Thorsten Seehaus, Andreas Maier, Vincent Christlein

Main category: cs.CV

TL;DR: Proposes a multi-frame temporal information exchange approach for calving front delineation in SAR imagery to address seasonal condition challenges, achieving SOTA performance on CaFFe benchmark.

DetailsMotivation: Current deep learning models for calving front delineation in SAR imagery struggle with seasonal conditions like ice mélange or snow-covered surfaces, which affects accuracy and requires improved temporal consistency.

Method: Process multiple frames from satellite image time series in parallel and exchange temporal information between corresponding feature maps to stabilize predictions, integrated into the Tyrion architecture.

Result: Achieves new state-of-the-art performance on CaFFe benchmark: Mean Distance Error of 184.4 m and mean Intersection over Union of 83.6.

Conclusion: Temporal information exchange across multiple frames improves calving front delineation accuracy and robustness to seasonal variations, advancing automated glacier monitoring capabilities.

Abstract: The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier’s mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.

[112] Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection

Qishan Wang, Haofeng Wang, Shuyong Gao, Jia Guo, Li Xiong, Jiaqi Li, Dengxuan Bai, Wenqiang Zhang

Main category: cs.CV

TL;DR: CRR framework transforms reconstruction to repairation for multi-class anomaly detection, addressing identity mapping issues by repairing synthesized anomalies and using feature-level masking to achieve SOTA performance.

DetailsMotivation: To address the identity mapping problem in conventional reconstruction-based networks for multi-class industrial anomaly detection, where models directly copy input features regardless of normality, leading to detection failures.

Method: Collaborative Reconstruction and Repair (CRR) framework: 1) Optimizes decoder to reconstruct normal samples while repairing synthesized anomalies, 2) Implements feature-level random masking for sufficient local information, 3) Trains segmentation network supervised by synthetic anomaly masks to minimize detection errors.

Result: Extensive experiments show CRR effectively mitigates identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.

Conclusion: CRR successfully transforms reconstruction to repairation, addressing fundamental limitations of conventional approaches and demonstrating superior performance in challenging multi-class industrial anomaly detection settings.

Abstract: Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder’s output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.

[113] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

Main category: cs.CV

TL;DR: JoyAvatar: A real-time audio-driven avatar generation model using autoregressive diffusion with progressive step bootstrapping, motion condition injection, and unbounded RoPE for infinite-length video generation.

DetailsMotivation: Existing DiT-based audio-driven avatar methods have high computational overhead and cannot synthesize long videos. Autoregressive methods exist but suffer from error accumulation and quality degradation.

Method: Progressive Step Bootstrapping (PSB) allocates more denoising steps to initial frames; Motion Condition Injection (MCI) uses noise-corrupted previous frames as motion condition; Unbounded RoPE via Cache-Resetting (URCR) enables infinite-length generation.

Result: 1.3B-parameter causal model achieves 16 FPS on single GPU with competitive results in visual quality, temporal consistency, and lip synchronization.

Conclusion: JoyAvatar enables real-time inference and infinite-length video generation for audio-driven avatars while addressing error accumulation and quality degradation issues in existing methods.

Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

[114] YawDD+: Frame-level Annotations for Accurate Yawn Prediction

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

Main category: cs.CV

TL;DR: Enhanced yawning detection for driver fatigue monitoring using improved dataset labeling and achieving high accuracy on edge hardware.

DetailsMotivation: Driver fatigue causes 24% of road accidents, with yawning as an early indicator. Existing ML approaches suffer from noisy video annotations that limit accuracy.

Method: Developed semi-automated labeling pipeline with human-in-the-loop verification to create YawDD+ dataset. Trained MNasNet classifier and YOLOv11 detector on this improved dataset.

Result: Achieved 99.34% classification accuracy and 95.69% detection mAP, improving frame accuracy by 6% and mAP by 5% over video-level supervision. Runs at 59.8 FPS on NVIDIA Jetson Nano edge hardware.

Conclusion: Enhanced data quality enables accurate on-device yawning monitoring without server computation, providing practical solution for driver fatigue detection.

Abstract: Driver fatigue remains a leading cause of road accidents, with 24% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.

[115] DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation

Mohamed Abdelsamad, Michael Ulrich, Bin Yang, Miao Zhang, Yakov Miron, Abhinav Valada

Main category: cs.CV

TL;DR: DOS is a self-supervised learning framework for 3D point clouds that distills semantic relevance softmaps only at observable points, using Zipfian prototypes and Zipf-Sinkhorn algorithm to address unbalanced semantics, achieving state-of-the-art performance across multiple benchmarks.

DetailsMotivation: SSL for 3D point clouds faces challenges: irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. Current methods suffer from information leakage from masked regions and lack rich supervision beyond discrete token assignments.

Method: DOS self-distills semantic relevance softmaps only at observable (unmasked) points to prevent information leakage. Introduces Zipfian prototypes with a power-law prior and Zipf-Sinkhorn algorithm to handle unbalanced semantics in unsupervised settings.

Result: Outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200 benchmarks without extra data or annotations.

Conclusion: Observable-point softmaps distillation provides a scalable and effective paradigm for learning robust 3D representations, addressing key challenges in 3D SSL through careful design of distillation targets and prototype handling.

Abstract: Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

[116] CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian

Main category: cs.CV

TL;DR: CADMorph is a framework for geometry-driven parametric CAD editing that uses pretrained foundation models to synchronize geometric shape changes with underlying parametric sequences while preserving structure, ensuring validity, and maintaining shape fidelity.

DetailsMotivation: When editing CAD models, geometric shape adjustments require synchronized edits to the underlying parametric construction sequences. This geometry-driven parametric editing must preserve the original sequence structure, ensure semantic validity of edits, and maintain high shape fidelity to target shapes, all with scarce training data.

Method: CADMorph uses an iterative plan-generate-verify framework with two pretrained domain-specific foundation models: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. Planning uses P2S cross-attention maps to identify segments needing modification. Generation uses MPP to infill masks with valid edits. Verification uses P2S to embed candidates in shape-latent space and select the closest to target shape.

Result: CADMorph surpasses GPT-4o and specialized CAD baselines, supports downstream applications like iterative editing and reverse-engineering enhancement, and works without requiring scarce editing triplet data for training.

Conclusion: The framework successfully addresses the three key challenges of geometry-driven parametric CAD editing by leveraging pretrained priors’ geometric consciousness and design knowledge, while bypassing data-scarcity bottlenecks through foundation models trained without triplet data.

Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence’s structure, 2) ensuring each edit’s semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

[117] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Kai Yao, Marc Juarez

Main category: cs.CV

TL;DR: First systematic security evaluation of model fingerprint detection for AI-generated images reveals significant vulnerabilities to adversarial attacks, with removal attacks being highly effective and forgery attacks varying in success.

DetailsMotivation: While model fingerprint detection techniques show promise for attributing AI-generated images to their source models, their robustness under adversarial conditions remains largely unexplored, creating a critical gap in understanding their security.

Method: Formalized threat models covering white- and black-box access with two attack goals (fingerprint removal and forgery). Implemented five attack strategies and evaluated 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators.

Result: Significant gap between clean and adversarial performance: removal attacks achieve >80% success in white-box and >50% in black-box settings; forgery success varies across models. Methods with highest attribution accuracy are often most vulnerable, revealing a utility-robustness trade-off.

Conclusion: No existing technique achieves both high robustness and accuracy across all threat models, highlighting the need for balanced approaches and identifying promising directions for advancing robust fingerprint detection.

Abstract: Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.

[118] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg

Main category: cs.CV

TL;DR: VLM2GeoVec is a single-encoder vision-language model that unifies multimodal remote sensing tasks by embedding interleaved inputs (images, text, bounding boxes, coordinates) into one joint vector space, enabling both scalable retrieval and region-level reasoning.

DetailsMotivation: Current remote sensing approaches are fragmented: dual-encoder models excel at cross-modal search but can't interleave modalities, while generative assistants support region-level interpretation but lack scalable retrieval. Satellite imagery's unique characteristics (aerial view, high resolution, scale variations, small objects) require both region-level spatial reasoning and holistic scene understanding.

Method: Proposes VLM2GeoVec, an instruction-following single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, geographic coordinates) in a unified vector space. Uses a single encoder that interleaves all inputs into one joint embedding trained with contrastive loss, eliminating multi-stage pipelines and task-specific modules.

Result: Achieves 26.6% P@1 on region-caption retrieval (+25 percentage points vs dual-encoder baselines), 32.5% P@1 on referring-expression retrieval (+19 pp), and 17.8% P@1 on semantic geo-localization retrieval (over 3× prior best). Matches or exceeds specialized baselines on conventional tasks like scene classification and cross-modal retrieval. Introduces RSMEB benchmark covering key remote-sensing embedding applications.

Conclusion: VLM2GeoVec successfully unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. The single-encoder approach eliminates fragmentation between retrieval models and generative assistants, providing a versatile solution for satellite imagery analysis.

Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

[119] TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition

Yanan Liu, Jun Liu, Hao Zhang, Dan Xu, Hossein Rahmani, Mohammed Bennamoun, Qiuhong Ke

Main category: cs.CV

TL;DR: TSkel-Mamba: A hybrid Transformer-Mamba framework for skeleton-based action recognition that combines Spatial Transformer for spatial features with enhanced Mamba for temporal modeling, achieving SOTA performance with low inference time.

DetailsMotivation: Current skeleton-based action recognition methods need better modeling of both spatial and temporal dynamics. While Mamba shows promise for temporal sequences, it has limitations in modeling inter-channel dependencies which are crucial for skeleton data.

Method: Proposes TSkel-Mamba with Spatial Transformer for spatial feature learning and enhanced Mamba for temporal modeling. Introduces Temporal Dynamic Modeling (TDM) block with Multi-scale Temporal Interaction (MTI) module using multi-scale Cycle operators to capture cross-channel temporal interactions.

Result: Achieves state-of-the-art performance on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets while maintaining low inference time.

Conclusion: TSkel-Mamba effectively captures both spatial and temporal dynamics in skeleton-based action recognition, offering an efficient and highly effective solution that addresses Mamba’s limitations in modeling inter-channel dependencies.

Abstract: Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.

[120] SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design

Mianjie Zheng, Xinquan Yang, Along He, Xuguang Li, Feilie Zhong, Xuefen Liu, Kun Tang, Zhicheng Zhang, Linlin Shen

Main category: cs.CV

TL;DR: SSA³D: A self-supervised assisted framework for automatic dental abutment design that eliminates separate pre-training/fine-tuning, reduces training time by half, and achieves state-of-the-art accuracy.

DetailsMotivation: Manual abutment design is tedious, AI automation is limited by lack of annotated datasets, and traditional self-supervised learning requires costly pre-training/fine-tuning processes.

Method: Dual-branch architecture with reconstruction branch (learns from masked intraoral scans) and regression branch (predicts abutment parameters). Includes Text-Conditioned Prompt module to incorporate clinical information like implant location and system.

Result: Saves half of training time compared to traditional SSL methods, achieves higher accuracy, and shows state-of-the-art performance on collected dataset.

Conclusion: SSA³D framework significantly improves accuracy and efficiency of automated abutment design by eliminating separate pre-training/fine-tuning while incorporating clinical guidance through text prompts.

Abstract: Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.

[121] Particulate: Feed-Forward 3D Object Articulation

Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

Main category: cs.CV

TL;DR: Particulate is a feed-forward transformer network that directly infers articulated structure from a single static 3D mesh, predicting parts, kinematics, and motion constraints in seconds without per-object optimization.

DetailsMotivation: Current approaches for extracting articulated 3D models from static meshes require time-consuming per-object optimization. There's a need for a fast, feed-forward method that can infer complete articulated structure (parts, kinematics, constraints) from single static meshes, especially for everyday objects and AI-generated assets.

Method: Uses a transformer network called Part Articulation Transformer that processes input mesh point clouds with a flexible, scalable architecture. The network is trained end-to-end on diverse articulated 3D assets from public datasets. During inference, predictions are lifted to the input mesh to create fully articulated 3D models.

Result: Significantly outperforms state-of-the-art approaches in quantitative and qualitative evaluations. Can process objects in seconds (vs. optimization-based methods), accurately infer articulated structure of AI-generated 3D assets, and enables full articulated 3D extraction from single images when combined with image-to-3D generators.

Conclusion: Particulate provides a fast, accurate feed-forward solution for articulated 3D structure estimation that works on both real and synthetic 3D assets, with applications in 3D content creation and manipulation.

Abstract: We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network’s feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.

[122] On Geometric Understanding and Learned Data Priors in VGGT

Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht

Main category: cs.CV

TL;DR: VGGT is a 3D foundation model that implicitly learns geometric concepts like correspondence matching and epipolar geometry through global attention, despite being trained without explicit geometric constraints.

DetailsMotivation: To understand whether VGGT builds upon geometric concepts like traditional multi-view methods or relies primarily on learned appearance-based data-driven priors, and to analyze its internal mechanisms for geometric understanding.

Method: Systematic analysis including probing intermediate features, analyzing attention patterns, performing interventions, spatial input masking, and perturbation experiments to examine VGGT’s internal mechanisms and robustness.

Result: VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. It shows robustness to occlusions, appearance variations, and camera configurations.

Conclusion: VGGT internalizes geometric structure while using learned data-driven priors, demonstrating that modern transformer-based 3D foundation models can implicitly learn geometric concepts without explicit geometric constraints.

Abstract: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT’s internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT’s dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.

[123] Reconstruction as a Bridge for Event-Based Visual Question Answering

Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi

Main category: cs.CV

TL;DR: The paper proposes methods to integrate event cameras with Multimodal LLMs using reconstruction techniques, introduces the EvQA benchmark, and achieves SOTA performance.

DetailsMotivation: Event cameras offer advantages for scene understanding in challenging visual conditions, but integrating them with frame-based MLLMs requires balancing event data preservation with model compatibility.

Method: Two methods: 1) Frame-based Reconstruction and Tokenization (FRT) - straightforward reconstruction approach; 2) Adaptive Reconstruction and Tokenization (ART) - leverages event sparsity for efficiency.

Result: Achieved state-of-the-art performance on EvQA benchmark (1,000 event-Q&A pairs from 22 datasets), demonstrating significant potential of MLLMs in event-based vision.

Conclusion: Reconstruction serves as an effective bridge for integrating event cameras with MLLMs, with the proposed methods enabling robust event-based scene understanding while maintaining compatibility with existing models.

Abstract: Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

[124] Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France

Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells

Main category: cs.CV

TL;DR: THREASURE-Net is a deep learning framework for tree height regression and super-resolution using Sentinel-2 time series and LiDAR-derived height data, producing high-resolution canopy height maps without needing pretrained models or very high resolution optical imagery.

DetailsMotivation: Fine-scale forest monitoring is crucial for understanding canopy structure dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning can effectively integrate spectral, temporal, and spatial signals to reflect canopy structure.

Method: THREASURE-Net is an end-to-end framework for Tree Height Regression And Super-Resolution. It’s trained on Sentinel-2 time series using LiDAR-derived height metrics at multiple spatial resolutions. The model learns solely from LiDAR height information without pretrained models or very high resolution optical imagery, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution.

Result: The approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods using very high resolution imagery. It achieves mean absolute errors of 2.62 m, 2.72 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution respectively.

Conclusion: THREASURE-Net demonstrates potential for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data, enabling high-precision annual canopy-height map generation.

Abstract: Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.62 m, 2.72 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

[125] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: Benchmarking six T2I models shows Infinity-8B achieves best compositional alignment, while VAR models like Infinity-2B match/exceed larger diffusion models in efficiency-performance trade-offs.

DetailsMotivation: Compositional alignment (objects, attributes, spatial relationships) remains a core challenge for T2I models, and while diffusion models have been studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is largely unexamined.

Method: Benchmark six diverse T2I systems (SDXL, PixArt-α, Flux-Dev, Flux-Schnell, Infinity-2B, Infinity-8B) across full T2I-CompBench++ and GenEval suites, evaluating alignment in color/attribute binding, spatial relations, numeracy, and complex multi-object prompts.

Result: Infinity-8B achieves strongest overall compositional alignment across both benchmarks. Infinity-2B matches or exceeds larger diffusion models in several categories, showing favorable efficiency-performance trade-offs. SDXL and PixArt-α show persistent weaknesses in attribute-sensitive and spatial tasks.

Conclusion: Provides first systematic comparison of VAR and diffusion approaches to compositional alignment, establishes unified baselines for future T2I model development, and demonstrates VAR models’ competitive performance with better efficiency trade-offs.

Abstract: Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

[126] SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2

Zhendi Gong, Xin Chen

Main category: cs.CV

TL;DR: SSL-MedSAM2: A novel semi-supervised learning framework combining training-free few-shot learning with SAM2 for pseudo label generation and nnUNet for iterative refinement, achieving state-of-the-art liver segmentation results with limited annotations.

DetailsMotivation: Medical image annotation is time-consuming and costly, hindering clinical applications of deep learning models. Semi-supervised learning offers a solution by reducing labeling requirements while maintaining performance.

Method: Two-branch framework: 1) TFFS-MedSAM2 - training-free few-shot learning using pretrained SAM2 foundation model for initial pseudo label generation; 2) FSL-nnUNet - iterative fully-supervised learning with nnUNet for pseudo label refinement through multiple training cycles.

Result: Achieved outstanding performance on MICCAI2025 CARE-LiSeg challenge: Dice scores of 0.9710 (GED4) and 0.9648 (T1 MRI); Hausdorff distances of 20.07 and 21.97 respectively, demonstrating superior liver segmentation with limited annotations.

Conclusion: SSL-MedSAM2 effectively combines foundation models with iterative refinement to achieve state-of-the-art semi-supervised medical image segmentation, significantly reducing annotation requirements while maintaining high performance for clinical applications.

Abstract: Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via https://github.com/naisops/SSL-MedSAM2/tree/main.

[127] Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou

Main category: cs.CV

TL;DR: The paper introduces a new benchmark (HalluSegBench) and method (RobustSeg) to address pixel-grounding hallucinations in segmentation VLMs, using counterfactual reasoning to distinguish vision- vs language-driven failures.

DetailsMotivation: Current segmentation VLMs suffer from pixel-grounding hallucinations but existing evaluations only check label matching, missing spatial footprint and severity analysis, especially for vision-driven hallucinations which are more challenging and prevalent.

Method: Formalizes Counterfactual Segmentation Reasoning (CSR) task, creates HalluSegBench benchmark with visual counterfactuals, introduces new evaluation metrics, and proposes RobustSeg model with counterfactual fine-tuning (CFT) to learn when to segment vs abstain.

Result: RobustSeg reduces hallucinations by 30% while improving segmentation performance on FP-RefCOCO(+/g) benchmarks.

Conclusion: The work addresses critical limitations in current segmentation VLM evaluation and provides both a benchmark and method to better understand and mitigate pixel-grounding hallucinations through counterfactual reasoning.

Abstract: Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

[128] 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

Zhiguo Lu, Jianwen Lou, Mingjun Ma, Hairong Jin, Youyi Zheng, Kun Zhou

Main category: cs.CV

TL;DR: 3DTeethSAM adapts SAM2 for 3D teeth segmentation using 2D rendering, 2D-3D projection, and three lightweight modules to improve segmentation quality and classification.

DetailsMotivation: 3D teeth segmentation is critical but challenging due to complex real-world dentition. Existing methods need improvement in accuracy and robustness for practical dental applications.

Method: Adapt SAM2 by rendering 3D teeth models into 2D images from multiple views, using SAM2 for 2D segmentation, then reconstructing 3D results via 2D-3D projections. Added three learnable modules: prompt embedding generator, mask refiner, and mask classifier, plus Deformable Global Attention Plugins (DGAP) in the encoder.

Result: Achieved 91.90% IoU on 3DTeethSeg benchmark, establishing new state-of-the-art for high-resolution 3D teeth mesh segmentation.

Conclusion: 3DTeethSAM successfully adapts SAM2 for 3D dental segmentation, achieving superior performance through innovative lightweight modules and attention enhancements, advancing digital dentistry capabilities.

Abstract: 3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2’s performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2’s initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2’s image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.

[129] Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, Lívia Baxová, Cees G. M. Snoek, Mohammadreza Salehi

Main category: cs.CV

TL;DR: A novel benchmark for evaluating foundation models’ intrinsic 3D spatial understanding without finetuning, using in-context learning on multi-view images to test segmentation performance across varying viewpoint shifts.

DetailsMotivation: Existing evaluations rely on downstream finetuning with task-specific decoders, which makes it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. There's a need for benchmarks that directly probe dense visual features without requiring model adaptation.

Method: Extends the Hummingbird framework for 2D scene understanding to 3D using the Multi-View ImageNet (MVImgNet) dataset. Given images of objects from specific angles (keys), the benchmark tests segmentation performance on novel views (queries), categorized into easy, medium, hard, and extreme based on key-query view contrast differences.

Result: Benchmarked 8 state-of-the-art foundation models, finding that DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments to perform well.

Conclusion: The proposed benchmark provides a more direct way to evaluate intrinsic 3D spatial understanding of foundation models without finetuning, revealing important insights about model capabilities and limitations in handling viewpoint variations.

Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .

[130] In-Context Learning for Seismic Data Processing

Fabian Fuchs, Mario Ruben Fernandez, Norman Ettrich, Janis Keuper

Main category: cs.CV

TL;DR: ContextSeisNet uses in-context learning for seismic demultiple processing, conditioning predictions on spatially related example pairs to achieve better lateral consistency and user control than traditional methods and U-Net baselines.

DetailsMotivation: Traditional seismic processing methods face challenges with noisy data and manual parameter tuning, while existing deep learning approaches suffer from spatially inconsistent results across neighboring seismic gathers and lack user control.

Method: ContextSeisNet is an in-context learning model that conditions predictions on a support set of spatially related example pairs - neighboring common-depth point gathers and their corresponding labels. This allows task-specific learning at inference time without retraining.

Result: On synthetic data, ContextSeisNet outperforms U-Net baseline quantitatively with enhanced spatial coherence. On field data, it achieves superior lateral consistency compared to traditional Radon demultiple and U-Net, with improved near-offset performance and more complete multiple removal. It achieves comparable performance despite being trained on 90% less data.

Conclusion: ContextSeisNet establishes a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks, offering flexibility through user-defined examples and improved data efficiency.

Abstract: Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.

[131] Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward

Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han, Byoung-Ki Jeon

Main category: cs.CV

TL;DR: RL-QR is an annotation-free reinforcement learning framework for query rewriting in RAG systems that works across diverse modal indices without human-annotated data.

DetailsMotivation: Optimizing queries for RAG systems is challenging across diverse modal indices, and existing approaches require costly human-annotated data which limits applicability.

Method: Uses reinforcement learning with verifiable search rewards derived from index-aligned synthetic queries, eliminating need for human annotations.

Result: Achieves up to 3.9× improvement on lexical retrievers and 3.5× on semantic retrievers on MTEB VIDORE V2 benchmark, plus 5-10% improvements on MS MARCO v2.1 and industrial datasets.

Conclusion: RL-QR provides a robust, annotation-free framework for query rewriting that works across modalities and domains, overcoming human-annotation dependencies.

Abstract: Optimizing queries for Retrieval-Augmented Generation (RAG) systems poses a significant challenge, particularly across diverse modal indices. We introduce RL-QR, a novel annotation-free reinforcement learning framework for query rewriting that eliminates the need for costly human-annotated data. By leveraging verifiable search rewards derived from index-aligned synthetic queries, RL-QR overcomes human-annotation dependencies, extending its applicability to various modalities and index domains. Experimental results demonstrate the framework’s robustness, achieving substantial retrieval performance gains of up to 3.9$\times$ on lexical retrievers and 3.5$\times$ on semantic retrievers on the MTEB VIDORE V2 benchmark for unstructured visual documents, along with consistent 5% to 10% improvements on MS MARCO v2.1 and internal industrial datasets.

[132] Using GUI Agent for Electronic Design Automation

Chunyi Li, Longfei Li, Zicheng Zhang, Xiaohong Liu, Min Tang, Weisi Lin, Guangtao Zhai

Main category: cs.CV

TL;DR: First systematic study deploying GUI agents for EDA workflows, introducing GUI-EDA dataset, comprehensive benchmark, and specialized EDAgent that outperforms EE Ph.D. students.

DetailsMotivation: Existing GUI agents are evaluated primarily on commodity software (Word, Excel), but professional CAD suites promise much higher economic return yet remain the weakest performance domain for agents, far from replacing expert EDA engineers.

Method: Created GUI-EDA dataset with 5 CAD tools across 5 physical domains (2,000+ screenshot-answer-action pairs from real-world component design). Developed comprehensive benchmark evaluating 30+ mainstream GUI agents. Built EDA-specialized metric EDAgent with reflection mechanism.

Result: EDA tasks constitute a major unsolved challenge for existing agents. EDAgent achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majoring in Electrical Engineering.

Conclusion: Extends GUI agents from generic office automation to specialized, high-value engineering domains, offering new avenue for advancing EDA productivity. Dataset will be publicly released.

Abstract: Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: https://github.com/aiben-ch/GUI-EDA.

[133] Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling

Maik Dannecker, Steven Jia, Nil Stolt-Ansó, Nadine Girard, Guillaume Auzias, François Rousseau, Daniel Rueckert

Main category: cs.CV

TL;DR: Proposes Gaussian-based explicit representations instead of neural implicit representations for 3D medical image reconstruction, achieving 5-10× speed-up while maintaining quality.

DetailsMotivation: High-resolution 3D reconstruction from motion-corrupted 2D acquisitions is crucial for fetal MRI diagnosis, but current implicit neural representations suffer from computational bottlenecks due to expensive Monte Carlo sampling for PSF approximation.

Method: Shifts from neural implicit representations to Gaussian explicit representations, parameterizing 3D volume as anisotropic Gaussian primitives. Leverages mathematical property that Gaussians are closed under convolution to derive closed-form analytical solution for forward model, reducing acquisition integral to exact covariance addition.

Result: Matches reconstruction quality of state-of-the-art SVR frameworks while achieving 5-10× speed-up on neonatal and fetal data, with convergence often reached in under 30 seconds.

Conclusion: Gaussian-based explicit representation enables fast, high-quality 3D reconstruction from sparse 2D medical images, paving way for clinical translation of real-time fetal 3D MRI.

Abstract: Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbfΣ_{obs} = \mathbfΣ_{HR} + \mathbfΣ_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$–10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {https://github.com/m-dannecker/Gaussian-Primitives-for-Fast-SVR}.

[134] FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

Jiapeng Tang, Kai Li, Chengxiang Yin, Liuhao Ge, Fei Jiang, Jiu Xu, Matthias Nießner, Christian Häne, Timur Bagautdinov, Egor Zakharov, Peihong Guo

Main category: cs.CV

TL;DR: FactorPortrait is a video diffusion method for portrait animation that transfers facial expressions and head movements from a driving video to a single portrait image while enabling novel view synthesis from arbitrary camera viewpoints.

DetailsMotivation: The paper aims to create a controllable portrait animation system that can generate lifelike videos from a single portrait image by transferring facial expressions and head movements while allowing novel viewpoint synthesis, addressing limitations in existing portrait animation methods.

Method: Uses a pre-trained image encoder to extract facial expression latents from driving videos, which are injected into a video diffusion transformer via an expression controller. For camera and head pose control, employs Plücker ray maps and normal maps from 3D body mesh tracking. Trained on a large-scale synthetic dataset with diverse camera viewpoints, head poses, and facial expressions.

Result: Extensive experiments show that FactorPortrait outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency, demonstrating superior performance in portrait animation tasks.

Conclusion: FactorPortrait presents an effective video diffusion method for controllable portrait animation that successfully disentangles and controls facial expressions, head movement, and camera viewpoints, enabling high-quality portrait animation with novel view synthesis capabilities.

Abstract: We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

[135] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Luca Cazzola, Ahed Alboody

Main category: cs.CV

TL;DR: KineMIC adapts Text-to-Motion models for Human Activity Recognition by using CLIP text embeddings to bridge the domain gap, enabling few-shot action synthesis that improves HAR accuracy by 23.1%.

DetailsMotivation: Large annotated motion datasets are expensive to acquire for skeletal-based HAR. While T2M models can generate synthetic data, they focus on artistic motion rather than kinematically precise, class-discriminative actions needed for HAR, creating a domain gap.

Method: KineMIC adapts T2M diffusion models to HAR via transfer learning. It uses CLIP text embeddings to establish semantic correspondences between sparse HAR labels and T2M source data, providing soft supervision for kinematic distillation. This transforms generalist T2M models into specialized few-shot Action-to-Motion generators.

Result: Using HumanML3D as source and NTU RGB+D 120 subset as target (with only 10 samples per class), KineMIC generates significantly more coherent motions, providing robust data augmentation that delivers +23.1% accuracy improvement.

Conclusion: KineMIC effectively bridges the domain gap between generalist T2M models and HAR requirements, enabling few-shot action synthesis that substantially improves HAR performance through synthetic data augmentation.

Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR’s requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).

[136] Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li

Main category: cs.CV

TL;DR: CLV-Net is a visual prompt-guided multimodal image understanding method for remote sensing that uses bounding box cues to generate correlated segmentation masks and captions, with context-aware modeling of inter-object relationships and cross-modal alignment.

DetailsMotivation: Existing multimodal reasoning methods in remote sensing struggle with user intent alignment when only simple text prompts are available, and face challenges with visually similar objects and complex inter-object relationships in large-scale aerial imagery.

Method: CLV-Net allows users to provide bounding box visual cues to indicate regions of interest. It features a Context-Aware Mask Decoder that models inter-object relationships, and a Semantic and Relationship Alignment module with Cross-modal Semantic Consistency Loss for fine-grained discrimination and Relationship Consistency Loss for aligning textual relations with visual interactions.

Result: Comprehensive experiments on two benchmark datasets show CLV-Net outperforms existing methods and establishes new state-of-the-art results, effectively capturing user intent and producing precise, intention-aligned multimodal outputs.

Conclusion: CLV-Net successfully addresses the challenges of user intent alignment and complex object relationships in remote sensing by combining visual prompting with context-aware multimodal learning, demonstrating superior performance in generating accurate segmentation masks and captions.

Abstract: Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.

[137] Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection

Qiushi Guo

Main category: cs.CV

TL;DR: Depth Copy Paste: A multimodal depth-aware augmentation framework that generates realistic face detection training samples by copying full-body persons into semantically compatible backgrounds using semantic coherence assessment, precise segmentation, and depth-guided placement.

DetailsMotivation: Traditional copy-paste augmentation produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics, limiting face detection robustness under challenging conditions like occlusion and illumination variation.

Method: Uses BLIP and CLIP to assess semantic/visual coherence for background retrieval; integrates SAM3 for precise segmentation and Depth-Anything to extract non-occluded person regions; introduces depth-guided sliding window placement mechanism that searches background depth maps for optimal depth continuity and scale alignment.

Result: Extensive experiments show Depth Copy Paste generates more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared to traditional copy-paste and depth-free augmentation methods.

Conclusion: Depth Copy Paste effectively addresses limitations of traditional augmentation by ensuring physical consistency through multimodal depth-aware techniques, producing natural composites that enhance face detection robustness across challenging conditions.

Abstract: Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.

[138] Text images processing system using artificial intelligence models

Aya Kaysan Bahjat

Main category: cs.CV

TL;DR: A text image classifier device that identifies textual content in images and categorizes them into Invoice, Form, Letter, or Report categories using DBNet++ for text detection and BART for classification, achieving 94.62% recognition rate.

DetailsMotivation: To address practical challenges in document image processing including changing lighting conditions, random orientation, curvature/partial text coverage, low resolution, and barely visible text in real-world scenarios.

Method: Four-step pipeline: 1) Image acquisition/preprocessing, 2) Text detection using DBNet++ (Differentiable Binarization Network Plus), 3) Text classification using BART (Bidirectional Auto-Regressive Transformers), 4) Results presentation via Python/PyQt5 UI. Supports gallery mode (browsing files) and live mode (camera feeds).

Result: Achieved 94.62% text recognition rate when tested over 10 hours on the Total-Text dataset containing high-resolution images with various challenging conditions.

Conclusion: The system effectively handles mixed-source text categorization in uncontrolled imaging conditions, demonstrating practical applicability for real-world document processing tasks.

Abstract: This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.

[139] Referring Change Detection in Remote Sensing Imagery

Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel

Main category: cs.CV

TL;DR: The paper introduces Referring Change Detection (RCD), a novel approach that uses natural language prompts to detect specific change types in remote sensing images, overcoming limitations of traditional methods that detect all changes indiscriminately or rely on rigid semantic class definitions.

DetailsMotivation: Traditional change detection methods identify all changes without distinguishing types, while semantic change detection methods use rigid class definitions that limit dataset mixing and model reuse. There's a need for flexible, user-specified change detection that can target specific change types of interest.

Method: Proposes a two-stage framework: (1) RCDNet - a cross-modal fusion network for referring change detection that integrates language prompts with visual analysis, and (2) RCDGen - a diffusion-based synthetic data generation pipeline that creates realistic post-change images and change maps from pre-change images without needing semantic segmentation masks.

Result: Experiments across multiple datasets demonstrate that the framework enables scalable and targeted change detection, effectively addressing data scarcity and class imbalance challenges in referring change detection tasks.

Conclusion: The proposed Referring Change Detection approach with its two-stage framework (RCDNet + RCDGen) provides a flexible, scalable solution for targeted change detection in remote sensing, overcoming limitations of traditional methods and enabling user-specified change type detection through natural language prompts.

Abstract: Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

[140] Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

Yan Zhang, Han Zou, Lincong Feng, Cong Xie, Ruiqi Yu, Zhenpeng Zhan

Main category: cs.CV

TL;DR: Music-to-dance generation reframed as music-token-conditioned multi-channel image synthesis using DiT-style architecture with temporal indexing and reference-pose conditioning.

DetailsMotivation: Existing pose-to-video models can translate 2D poses to realistic dance videos, but generating temporally coherent, rhythm-aligned 2D poses from music is challenging, especially for complex in-the-wild distributions.

Method: 1) Encode 2D pose sequences as one-hot images and compress with pretrained image VAE; 2) Use DiT-style backbone for modeling; 3) Introduce time-shared temporal indexing to synchronize music tokens and pose latents; 4) Implement reference-pose conditioning for subject-specific body proportions and long-horizon generation.

Result: Consistent improvements over existing music-to-dance methods on in-the-wild 2D dance corpus and AIST++2D benchmark in both pose- and video-space metrics and human preference. Ablations validate contributions of representation, temporal indexing, and reference conditioning.

Conclusion: Reframing music-to-dance as image synthesis with proper temporal alignment and conditioning enables better handling of complex pose distributions and generates higher quality, more coherent dance sequences from music.

Abstract: Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io

[141] Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images

Lin Bai, Xiaoyang Li, Liqiang Huang, Quynh Nguyen, Hien Van Nguyen, Saurabh Prasad, Dragan Maric, John Redell, Pramod Dash, Badrinath Roysam

Main category: cs.CV

TL;DR: Weak-to-strong generalization method for automated training of multi-head Mask-RCNN with efficient channel attention for segmenting overlapping cell nuclei in multiplex IF whole-slide images, enabling learning from new instruments/protocols without human annotations.

DetailsMotivation: Need for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent whole-slide images without requiring human annotations, especially for new instruments/protocols, and to enable automated quality assessment in production environments where manual proofreading is impractical.

Method: Weak to strong generalization methodology with multi-head extension of Mask-RCNN with efficient channel attention, featuring pseudo-label correction and coverage expansion mechanisms for automated training on new image classes from new instruments/protocols.

Result: Method outperformed five current widely used methods in benchmarks, provides automated self-diagnosis metrics for segmentation quality, and includes open-source code, sample images, and high-resolution results for community adoption.

Conclusion: The approach enables fully automated, annotation-free segmentation of overlapping cell nuclei in multiplex IF WSI across new instruments/protocols, with superior performance to existing methods and practical deployment capabilities including automated quality assessment.

Abstract: We present a weak to strong generalization methodology for fully automated training of a multi-head extension of the Mask-RCNN method with efficient channel attention for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent (IF) whole-slide images (WSI), and present evidence for pseudo-label correction and coverage expansion, the key phenomena underlying weak to strong generalization. This method can learn to segment de novo a new class of images from a new instrument and/or a new imaging protocol without the need for human annotations. We also present metrics for automated self-diagnosis of segmentation quality in production environments, where human visual proofreading of massive WSI images is unaffordable. Our method was benchmarked against five current widely used methods and showed a significant improvement. The code, sample WSI images, and high-resolution segmentation results are provided in open form for community adoption and adaptation.

[142] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: SVG-T2I scales the SVG framework to enable high-quality text-to-image synthesis directly in Visual Foundation Model (VFM) feature space, achieving competitive performance and validating VFMs’ representational power for generative tasks.

DetailsMotivation: Visual generation grounded in VFM representations offers a unified pathway for integrating visual understanding, perception, and generation, but training large-scale text-to-image diffusion models entirely within VFM representation space remains largely unexplored.

Method: Scales the SVG (Self-supervised representations for Visual Generation) framework to propose SVG-T2I, which supports high-quality text-to-image synthesis directly in the VFM feature domain using a standard text-to-image diffusion pipeline.

Result: Achieves competitive performance with 0.75 on GenEval and 85.78 on DPG-Bench, validating the intrinsic representational power of VFMs for generative tasks.

Conclusion: The work demonstrates the viability of representation-driven visual generation, with full open-sourcing of the project including autoencoder, generation model, training/inference/evaluation pipelines, and pre-trained weights to facilitate further research.

Abstract: Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

[143] Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting

Mohammad Dehghanmanshadi, Wallapak Tavanapong

Main category: cs.CV

TL;DR: InST-based style transfer for synthetic microscopy images improves cell counting accuracy by reducing domain gap between synthetic and real data.

DetailsMotivation: Traditional domain adaptation struggles with complex textures in microscopy images; need realistic synthetic data for training deep learning models in label-scarce environments like cell counting.

Method: Adapt Inversion-Based Style Transfer (InST) framework to biomedical microscopy, combining latent-space Adaptive Instance Normalization with stochastic inversion in diffusion models to transfer style from real to synthetic images while preserving content structure.

Result: Models trained with InST-synthesized images achieve 37% lower MAE than hard-coded synthetic data, 52% reduction compared to Cell200-s (53.70 to 25.95 MAE), and outperform real data alone (25.95 vs. 27.74 MAE). Further improvements with DACS+CutMix.

Conclusion: InST-based style transfer effectively reduces domain gap between synthetic and real microscopy data, offering scalable enhancement for cell counting while minimizing manual labeling effort.

Abstract: Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: https://github.com/MohammadDehghan/InST-Microscopy.

[144] MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao

Main category: cs.CV

TL;DR: Proposes Matting Quality Evaluator (MQE) for video matting quality assessment without ground truth, enabling creation of VMReal dataset (28K clips, 2.4M frames) and achieving SOTA performance with MatAnyone 2.

DetailsMotivation: Video matting is limited by dataset scale and realism. Existing methods using segmentation data lack effective boundary supervision, resulting in segmentation-like mattes without fine details.

Method: Introduces learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality without ground truth. Uses MQE for: 1) online training feedback to suppress errors, 2) offline data curation to create VMReal dataset. Also introduces reference-frame training strategy for long videos.

Result: Created VMReal dataset with 28K clips and 2.4M frames. MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing all prior methods across all metrics.

Conclusion: The MQE enables large-scale video matting dataset creation and quality assessment without ground truth. The proposed approach effectively handles long videos and achieves superior matting performance through comprehensive supervision and data curation.

Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

[145] Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs

Wentao Jiang, Vamsi Varra, Caitlin Perez-Stable, Harrison Zhu, Meredith Apicella, Nicole Nyamongo

Main category: cs.CV

TL;DR: A trustworthy, frequency-aware segmentation framework for quantifying vitiligo extent in clinical photos, combining domain-adaptive pre-training, architectural refinements with high-frequency spectral gating, and clinical trust mechanisms with uncertainty maps.

DetailsMotivation: Accurate quantification of vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response, requiring a trustworthy automated assessment system.

Method: Three synergistic pillars: (1) Data-efficient training with domain-adaptive pre-training on ISIC 2019 dataset and ROI-constrained dual-task loss; (2) Architectural refinement using ConvNeXt V2-based encoder with High-Frequency Spectral Gating module and stem-skip connections; (3) Clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation for uncertainty maps.

Result: Superior performance with Dice score of 85.05%, significantly reduced boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming CNN and Transformer baselines. Zero catastrophic failures and interpretable entropy maps for ambiguous regions.

Conclusion: The proposed framework establishes a robust and reliable standard for automated vitiligo assessment, providing trustworthy segmentation with uncertainty quantification for clinical review.

Abstract: Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response. We propose a trustworthy, frequency-aware segmentation framework built on three synergistic pillars: (1) a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-constrained dual-task loss to suppress background noise; (2) an architectural refinement via a ConvNeXt V2-based encoder enhanced with a novel High-Frequency Spectral Gating (HFSG) module and stem-skip connections to capture subtle textures; and (3) a clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation (TTA) to generate pixel-wise uncertainty maps. Extensive validation on an expert-annotated clinical cohort demonstrates superior performance, achieving a Dice score of 85.05% and significantly reducing boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming strong CNN (ResNet-50 and UNet++) and Transformer (MiT-B5) baselines. Notably, our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review. Our approach suggests that the proposed framework establishes a robust and reliable standard for automated vitiligo assessment.

[146] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu

Main category: cs.CV

TL;DR: SAM2VideoX distills structure-preserving motion priors from autoregressive video tracking (SAM2) into bidirectional video diffusion (CogVideoX) to generate realistic motion for articulated/deformable objects.

DetailsMotivation: Current video models struggle to generate physically plausible motion for articulated/deformable objects like humans/animals. Scaling training data hasn't solved this, and existing approaches rely on noisy motion representations from imperfect external models.

Method: Two innovations: (1) bidirectional feature fusion module extracts global structure-preserving motion priors from recurrent model SAM2, (2) Local Gram Flow loss aligns how local features move together. Distills motion priors into CogVideoX diffusion model.

Result: SAM2VideoX achieves +2.60% improvement on VBench (95.51% vs 92.91% for REPA), 21-22% lower FVD scores, and 71.4% human preference over prior baselines.

Conclusion: Distilling structure-preserving motion priors from autoregressive tracking models into diffusion models effectively addresses physically implausible motion generation for articulated/deformable objects.

Abstract: Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60% on VBench, 21-22% lower FVD, and 71.4% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51%, surpassing REPA (92.91%) by 2.60%, and reduce FVD to 360.57, a 21.20% and 22.46% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

[147] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang

Main category: cs.CV

TL;DR: V-RGBX is the first end-to-end framework for intrinsic-aware video editing that unifies video inverse rendering, synthesis from intrinsic representations, and keyframe-based editing of scene properties like albedo, normal, material, and irradiance.

DetailsMotivation: Current large-scale video generation models lack a closed-loop framework that jointly understands intrinsic scene properties, leverages them for video synthesis, and supports editable intrinsic representations. There's an unexplored need for systems that can perform intrinsic-aware video editing with physically grounded manipulations.

Method: V-RGBX uses an interleaved conditioning mechanism that enables intuitive video editing through user-selected keyframes. It supports flexible manipulation of any intrinsic modality (albedo, normal, material, irradiance) and propagates edits across sequences in a physically plausible manner.

Result: Extensive qualitative and quantitative results show V-RGBX produces temporally consistent, photorealistic videos. It surpasses prior methods in applications like object appearance editing and scene-level relighting, demonstrating physically plausible propagation of keyframe edits across video sequences.

Conclusion: V-RGBX represents a significant advancement in intrinsic-aware video editing, providing the first end-to-end framework that unifies inverse rendering, synthesis, and editing capabilities while maintaining physical plausibility and temporal consistency in video manipulation.

Abstract: Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

[148] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

Jan U. Müller, Robin Tim Landsgesell, Leif Van Holland, Patrick Stotko, Reinhard Klein

Main category: cs.CV

TL;DR: Extends 3D Gaussian Splatting with moment-based transmittance computation for accurate rendering of overlapping semi-transparent objects without ray tracing or per-pixel sorting.

DetailsMotivation: 3DGS has limitations in rendering complex overlapping semi-transparent objects due to simplified alpha blending and coarse density approximations in its rasterizer.

Method: Uses moment-based order-independent transparency to characterize density distribution along camera rays with statistical moments. Computes per-pixel moments from all contributing 3D Gaussians, reconstructs continuous transmittance function for each ray, and samples within each Gaussian independently.

Result: Bridges gap between rasterization and physical accuracy by modeling light attenuation in complex translucent media, significantly improving reconstruction and rendering quality.

Conclusion: Proposes a novel method for high-fidelity transmittance computation in 3D Gaussian representations that avoids ray tracing or per-pixel sorting while enabling accurate rendering of complex semi-transparent objects.

Abstract: The recent success of 3D Gaussian Splatting (3DGS) has reshaped novel view synthesis by enabling fast optimization and real-time rendering of high-quality radiance fields. However, it relies on simplified, order-dependent alpha blending and coarse approximations of the density integral within the rasterizer, thereby limiting its ability to render complex, overlapping semi-transparent objects. In this paper, we extend rasterization-based rendering of 3D Gaussian representations with a novel method for high-fidelity transmittance computation, entirely avoiding the need for ray tracing or per-pixel sample sorting. Building on prior work in moment-based order-independent transparency, our key idea is to characterize the density distribution along each camera ray with a compact and continuous representation based on statistical moments. To this end, we analytically derive and compute a set of per-pixel moments from all contributing 3D Gaussians. From these moments, a continuous transmittance function is reconstructed for each ray, which is then independently sampled within each Gaussian. As a result, our method bridges the gap between rasterization and physical accuracy by modeling light attenuation in complex translucent media, significantly improving overall reconstruction and rendering quality.

[149] Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion

Jiqing Wu, Ingrid Berg, Yawei Li, Ender Konukoglu, Viktor H. Koelzer

Main category: cs.CV

TL;DR: Tera-MIND is a generative framework that creates tera-scale 3D virtual mouse brains using spatial gene expression as input, enabling analysis of spatial molecular interactions and translational applications to human brain samples.

DetailsMotivation: Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions, but existing tera-scale volumetric atlases pose computational challenges for modeling intricate brain structures within native spatial context.

Method: A patch-based and boundary-aware diffusion model that takes spatial gene expression as conditional input to generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale.

Result: The framework successfully generates tera-scale virtual mouse brains, identifies spatial molecular interactions for key transcriptomic pathways (including glutamatergic and dopaminergic neuronal systems) through 3D gene-gene self-attention, and demonstrates translational applicability on previously unseen human brain samples.

Conclusion: Tera-MIND offers efficient generative modeling of whole virtual organisms, paving the way for integrative applications in biomedical research by enabling comprehensive 3D brain modeling at unprecedented scale and resolution.

Abstract: Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions. Using emerging tissue profiling technologies, researchers charted comprehensive atlases of mammalian brain with sub-cellular resolution and spatially resolved transcriptomic data. However, these tera-scale volumetric atlases pose computational challenges for modeling intricate brain structures within the native spatial context. We propose \textbf{Tera-MIND}, a novel generative framework capable of simulating \textbf{Tera}-scale \textbf{M}ouse bra\textbf{IN}s in 3D using a patch-based and boundary-aware \textbf{D}iffusion model. Taking spatial gene expression as conditional input, we generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale. Through the lens of 3D \textit{gene}-\textit{gene} self-attention, we identify spatial molecular interactions for key transcriptomic pathways, including glutamatergic and dopaminergic neuronal systems. Lastly, we showcase the translational applicability of Tera-MIND on previously unseen human brain samples. Tera-MIND offers an efficient generative modeling of whole virtual organisms, paving the way for integrative applications in biomedical research. Project website: https://musikisomorphie.github.io/Tera-MIND.html

[150] Efficient Action Counting with Dynamic Queries

Xiaoxuan Ma, Zishi Li, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang

Main category: cs.CV

TL;DR: DeTRC introduces a novel temporal repetition counting method using action query representation with linear complexity, featuring dynamic query updates and inter-query contrastive learning to handle open-set actions and background noise.

DetailsMotivation: Existing temporal repetition counting methods rely on similarity correlation matrices with quadratic computational complexity, limiting scalability for long videos. There's a need for more efficient approaches that can handle open-set actions and distinguish between actions of interest and background noise.

Method: Proposes an action query representation for localizing repeated action cycles with linear complexity. Introduces two key components: 1) dynamic update scheme on action queries that embeds video features dynamically for open-set counting, and 2) inter-query contrastive learning to regularize video representations and distinguish actions from background noise.

Result: Significantly outperforms previous methods, especially on long video sequences, unseen actions, and actions at various speeds. On RepCountA benchmark: 26.5% improvement in OBO accuracy over state-of-the-art TransRAC, 22.7% mean error decrease, and 94.1% computational burden reduction.

Conclusion: The proposed DeTRC method with action query representation and dynamic learning components achieves superior performance in temporal repetition counting while dramatically reducing computational complexity, making it scalable for practical applications.

Abstract: Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Based on this representation, we further develop two key components to tackle the essential challenges of temporal repetition counting. Firstly, to facilitate open-set action counting, we propose the dynamic update scheme on action queries. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Secondly, to distinguish between actions of interest and background noise actions, we incorporate inter-query contrastive learning to regularize the video representations corresponding to different action queries. As a result, our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds. On the challenging RepCountA benchmark, we outperform the state-of-the-art method TransRAC by 26.5% in OBO accuracy, with a 22.7% mean error decrease and 94.1% computational burden reduction. Code is available at https://github.com/lizishi/DeTRC.

[151] SpecDETR: A transformer-based hyperspectral point object detection network

Zhaoxu Li, Wei An, Gaowei Guo, Longguang Wang, Yingqian Wang, Zaiping Lin

Main category: cs.CV

TL;DR: The paper proposes SpecDETR, a novel Transformer-based network for hyperspectral multi-class point object detection, addressing limitations of traditional per-pixel HTD methods by leveraging spatial-spectral joint features.

DetailsMotivation: Existing hyperspectral target detection (HTD) methods use per-pixel binary classification, which ignores the 3D cube structure of hyperspectral images that integrates both spatial and spectral dimensions. This limitation prevents joint expression of spatial-spectral features that are synergistically present in HSIs.

Method: The paper introduces hyperspectral point object detection as a new task framework and proposes SpecDETR - a specialized Transformer network with multi-layer encoder and self-excited subpixel-scale attention modules to directly extract deep spatial-spectral joint features from hyperspectral cubes without relying on pre-trained backbones.

Result: The authors created the SPOD benchmark dataset for evaluation and demonstrated that SpecDETR outperforms state-of-the-art visual object detection networks and HTD methods in hyperspectral point object detection through extensive experiments.

Conclusion: SpecDETR successfully addresses the limitations of traditional HTD methods by leveraging spatial-spectral synergistic representation, establishing a new framework for hyperspectral point object detection that better captures the 3D structure of hyperspectral imagery.

Abstract: Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect extremely small-sized objects, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, neglecting the three-dimensional cube structure of hyperspectral images (HSIs) that integrates both spatial and spectral dimensions. The synergistic existence of spatial and spectral features in HSIs enable objects to simultaneously exhibit both, yet the per-pixel HTD framework limits the joint expression of these features. In this paper, we rethink HTD from the perspective of spatial-spectral synergistic representation and propose hyperspectral point object detection as an innovative task framework. We introduce SpecDETR, the first specialized network for hyperspectral multi-class point object detection, which eliminates dependence on pre-trained backbone networks commonly required by vision-based object detectors. SpecDETR uses a multi-layer Transformer encoder with self-excited subpixel-scale attention modules to directly extract deep spatial-spectral joint features from hyperspectral cubes. We develop a simulated hyperspectral point object detection benchmark termed SPOD, and for the first time, evaluate and compare the performance of visual object detection networks and HTD methods on hyperspectral point object detection. Extensive experiments demonstrate that our proposed SpecDETR outperforms SOTA visual object detection networks and HTD methods. Our code and dataset are available at https://github.com/ZhaoxuLi123/SpecDETR.

[152] Visual-Friendly Concept Protection via Selective Adversarial Perturbations

Xiaoyue Mi, Fan Tang, You Wu, Juan Cao, Peng Li, Yang Liu

Main category: cs.CV

TL;DR: VCPro is a framework that protects key concepts in images from unauthorized AI personalization using less perceptible adversarial perturbations, balancing protection effectiveness with visual quality.

DetailsMotivation: Previous adversarial protection methods degrade visual quality with noticeable perturbations. There's a need for concept protection that maintains better visual quality while still preventing unauthorized AI personalization.

Method: VCPro uses a relaxed optimization objective with Lagrangian multiplier method to find minimally perceptible yet effective adversarial perturbations that protect owner-chosen key concepts in images.

Result: VCPro achieves better trade-off between perturbation visibility and protection effectiveness compared to previous methods, effectively protecting target concepts with less noticeable alterations.

Conclusion: The proposed framework successfully prioritizes protection of key concepts while maintaining better visual quality, addressing both privacy/intellectual property concerns and visual degradation issues.

Abstract: Personalized concept generation by tuning diffusion models with a few images raises potential legal and ethical concerns regarding privacy and intellectual property rights. Researchers attempt to prevent malicious personalization using adversarial perturbations. However, previous efforts have mainly focused on the effectiveness of protection while neglecting the visibility of perturbations. They utilize global adversarial perturbations, which introduce noticeable alterations to original images and significantly degrade visual quality. In this work, we propose the Visual-Friendly Concept Protection (VCPro) framework, which prioritizes the protection of key concepts chosen by the image owner through adversarial perturbations with lower perceptibility. To ensure these perturbations are as inconspicuous as possible, we introduce a relaxed optimization objective to identify the least perceptible yet effective adversarial perturbations, solved using the Lagrangian multiplier method. Qualitative and quantitative experiments validate that VCPro achieves a better trade-off between the visibility of perturbations and protection effectiveness, effectively prioritizing the protection of target concepts in images with less perceptible perturbations.

[153] Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua

Main category: cs.CV

TL;DR: Any2Caption is a framework that uses multimodal LLMs to interpret diverse inputs (text, images, videos, region/motion/camera cues) into structured captions for better guidance in controllable video generation.

DetailsMotivation: To address the bottleneck of accurate user intent interpretation in current video generation systems, enabling better controllability through diverse input conditions.

Method: Decouples condition interpretation from video synthesis, uses MLLMs to interpret various inputs into dense structured captions, and introduces Any2CapIns dataset (337K instances, 407K conditions) for instruction tuning.

Result: Significant improvements in controllability and video quality across various aspects of existing video generation models, demonstrated through comprehensive evaluations.

Conclusion: Any2Caption effectively bridges the gap between diverse user inputs and video generation by providing structured caption guidance, enhancing both control and quality in video synthesis.

Abstract: To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs–text, images, videos, and specialized cues such as region, motion, and camera poses–into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

[154] Conditional Text-to-Image Generation with Reference Guidance

Taewook Kim, Ze Wang, Zhengyuan Yang, Jiang Wang, Lijuan Wang, Zicheng Liu, Qiang Qiu

Main category: cs.CV

TL;DR: This paper introduces expert plugins for Stable Diffusion that use visual reference conditions to improve text rendering accuracy, enabling better English/multilingual scene-text and logo generation with minimal parameters.

DetailsMotivation: Text-to-image diffusion models struggle with precise subject rendering, especially text spelling. Text tokenizers have vocabulary limitations that can't adequately represent certain visual elements, and models have difficulty with non-English text generation.

Method: Develop small-scale expert plugins that add reference conditions to Stable Diffusion. Each plugin uses auxiliary networks and customized loss functions for specific applications: English scene-text generation, multilingual scene-text generation, and logo-image generation.

Result: The expert plugins demonstrate superior performance over existing methods on all tasks, with each plugin containing only 28.55M trainable parameters, making them efficient additions to the base model.

Conclusion: Using visual reference conditions through specialized plugins effectively addresses text-to-image models’ limitations in precise subject rendering, extending capabilities to handle vocabulary limitations and novel applications like non-English text generation.

Abstract: Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model’s generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.

[155] Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer

Main category: cs.CV

TL;DR: SADis is a tuning-free approach for disentangled stylized image generation that separates color and texture attributes using CLIP embeddings and regularized transformations.

DetailsMotivation: Current diffusion-based methods struggle with fine-grained style customization and controlling multiple style attributes (color and texture) independently. There's a need for disentangled stylized image generation (DisIG) without requiring model tuning.

Method: Leverages Image-Prompt Additivity in CLIP embedding space to extract Color-Texture Embeddings (CTE) from reference images. Uses whitening and coloring transformation for color consistency, and introduces noise term in Regularized Whitening and Coloring Transformation (RegWCT) to prevent texture loss from signal-leak bias.

Result: Experiments on WikiArt and StyleDrop datasets show SADis surpasses state-of-the-art methods both qualitatively and quantitatively for the DisIG task.

Conclusion: SADis provides a precise, customizable, and tuning-free solution for disentangled stylized image generation, enabling independent control over color and texture attributes.

Abstract: Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code is released at https://deepffff.github.io/sadis.github.io/.

[156] COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations

Pandula Thennakoon, Avishka Ranasinghe, Mario De Silva, Buwaneka Epakanda, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath

Main category: cs.CV

TL;DR: The paper introduces a new activation function for Implicit Neural Representations (INRs) that addresses spectral bias and improves performance across multiple vision tasks by modulating activations with complex sinusoidal terms and using regularized deep priors.

DetailsMotivation: INR performance is highly sensitive to activation function choice, but theoretical understanding is lacking. Key limitations include spectral bias (reduced high-frequency sensitivity), limited noise robustness, and difficulty capturing both local and global structure simultaneously.

Method: 1) Analyze INR signal representation using harmonic analysis and Chebyshev polynomials; 2) Prove that modulating activation functions with complex sinusoidal terms yields richer spectral support; 3) Introduce new activation function tailored to INRs; 4) Use regularized deep priors from task-specific models to adapt activations for improved convergence.

Result: Significant performance gains across multiple tasks: +5.67 dB PSNR for image reconstruction, +0.46 dB for denoising, +0.64 dB for 6X super-resolution over nearest SOTA, plus improvements in inpainting and 3D shape reconstruction. The activation consistently outperforms existing alternatives.

Conclusion: The proposed activation function, theoretically grounded in harmonic analysis and enhanced with regularized deep priors, effectively addresses INR limitations like spectral bias and improves performance across diverse vision tasks, offering a more robust and efficient representation framework.

Abstract: Implicit neural representations (INRs) are a powerful paradigm for modeling data, offering a continuous alternative to discrete signal representations. Their ability to compactly encode complex signals has led to strong performance in many vision tasks. Prior work shows INR performance is highly sensitive to the choice of activation function in the underlying multilayer perceptron, yet the theoretical reasons remain unclear. Key limitations also persist, including spectral bias (reduced sensitivity to high-frequency content), limited robustness to noise, and difficulty capturing local and global structure jointly. We analyze INR signal representation using harmonic analysis and Chebyshev polynomials. We prove that modulating activation functions with a complex sinusoidal term yields richer and more complete spectral support throughout the network. Building on this, we introduce a new activation function tailored to INRs and validate our theory using Chebyshev analysis and extensive experiments. We additionally use a regularized deep prior, extracted from a task-specific model, to adapt the activations, further improving convergence speed and stability. Across image reconstruction (average PSNR gain of +5.67 dB over the nearest counterpart on a diverse dataset), denoising (+0.46 dB PSNR), super-resolution (+0.64 dB over the nearest SOTA method for 6X upscaling), inpainting, and 3D shape reconstruction, our activation consistently outperforms existing state-of-the-art alternatives.

[157] MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing

Tong Zhang, Juan C Leon Alcazar, Victor Escorcia, Bernard Ghanem

Main category: cs.CV

TL;DR: MoCA-Video is a training-free framework for semantic mixing in videos using latent space manipulation of frozen video diffusion models with class-agnostic segmentation and momentum-based correction for temporal stability.

DetailsMotivation: The paper aims to achieve high-quality semantic mixing in videos without retraining diffusion models, addressing challenges of temporal stability and semantic alignment when editing videos under semantic shifts beyond the trained data distribution.

Method: Operates in latent space of frozen video diffusion model; uses class-agnostic segmentation with diagonal denoising scheduler to localize/track objects; introduces momentum-based correction to approximate novel hybrid distributions; includes light gamma residual module to smooth visual artifacts.

Result: Outperforms both training-free and trained baselines; achieves superior semantic mixing and temporal coherence without retraining; demonstrates controllable, high-quality video editing under semantic shifts through structured manipulation of diffusion noise trajectories.

Conclusion: Structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts, establishing MoCA-Video as an effective training-free framework for semantic mixing in videos.

Abstract: We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model’s performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts.

[158] Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi

Main category: cs.CV

TL;DR: HiVE-MIL is a hierarchical vision-language framework for few-shot WSI classification that addresses limitations in existing VLM-MIL methods by modeling intra-modal scale interactions and cross-modal alignment through unified graph construction and text-guided filtering.

DetailsMotivation: Existing VLM-integrated MIL methods for few-shot WSI classification have two key limitations: (1) insufficient modeling of interactions within same modalities across different scales (e.g., 5x and 20x), and (2) inadequate alignment between visual and textual modalities on the same scale.

Method: HiVE-MIL constructs a unified graph with parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, plus heterogeneous intra-scale edges linking visual and textual nodes on the same scale. It incorporates a two-stage, text-guided dynamic filtering mechanism to remove weakly correlated patch-text pairs, and uses hierarchical contrastive loss to align textual semantics across scales.

Result: Extensive experiments on TCGA breast, lung, and kidney cancer datasets show HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings.

Conclusion: The results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data.

Abstract: Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL.

[159] Temporal In-Context Fine-Tuning with Temporal Reasoning for Versatile Control of Video Diffusion Models

Kinam Kim, Junha Hyung, Jaegul Choo

Main category: cs.CV

TL;DR: TIC-FT is an efficient fine-tuning method for video diffusion models that concatenates condition and target frames temporally with noise-buffered transitions, enabling few-shot conditional generation without architectural changes.

DetailsMotivation: Existing fine-tuning methods for controllable video generation require large datasets, external encoders, or architectural modifications, and are limited to spatially aligned conditioning, making them inflexible and computationally expensive.

Method: Temporal In-Context Fine-Tuning (TIC-FT) concatenates condition and target frames along the temporal axis, inserting intermediate buffer frames with progressively increasing noise levels to enable smooth transitions and align with the pretrained model’s temporal dynamics.

Result: TIC-FT achieves strong performance with only 10-30 training samples, outperforms existing baselines in condition fidelity and visual quality, and works efficiently with large-scale models like CogVideoX-5B and Wan-14B across tasks including image-to-video and video-to-video generation.

Conclusion: TIC-FT provides an efficient, versatile, and scalable approach for adapting pretrained video diffusion models to diverse conditional generation tasks without architectural modifications, enabling high-quality controllable video synthesis with minimal data and compute.

Abstract: Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model’s temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/

[160] Open-World Object Counting in Videos

Niki Amini-Naieni, Andrew Zisserman

Main category: cs.CV

TL;DR: Open-world object counting in videos: CountVid model counts unique instances of target objects specified by text or image prompts, outperforming baselines on new VideoCount dataset.

DetailsMotivation: The paper addresses the challenging problem of counting unique object instances in videos, especially in crowded scenes with occlusions and similar-looking objects. Current methods struggle with avoiding double counting and identifying reappearances of objects.

Method: CountVid combines an image-based counting model with a promptable video segmentation and tracking model. This enables automated open-world object counting across video frames using text descriptions or image examples as prompts.

Result: The authors introduce VideoCount, a new dataset built from TAO, MOT20 tracking datasets, plus penguin and metal alloy crystallization videos. CountVid demonstrates accurate object counting and significantly outperforms strong baselines on this dataset.

Conclusion: CountVid effectively solves the open-world object counting task in videos, providing accurate counts of unique object instances. The model, dataset, and code are publicly available for further research.

Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and objects of similar appearance, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model, to enable automated open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for this novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://www.robots.ox.ac.uk/~vgg/research/countvid/.

[161] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang

Main category: cs.CV

TL;DR: FlowDirector is a training-free, inversion-free video editing framework that models editing as direct evolution in data space using ODEs, avoiding inaccurate inversion steps with three flow correction strategies for better appearance, motion, and stability.

DetailsMotivation: Existing training-free video editing methods rely on inversion-editing paradigms that map videos to latent spaces, but the inversion process is imperfect and compromises appearance fidelity and motion consistency.

Method: FlowDirector models editing as direct evolution in data space using ordinary differential equations (ODEs) to guide video transitions along spatio-temporal manifolds. It introduces three flow correction strategies: 1) Direction-aware flow correction for structural/textural changes, 2) Motion-appearance decoupling for consistency, and 3) Differential averaging guidance for stability.

Result: Extensive experiments across various editing tasks and benchmarks show FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation.

Conclusion: FlowDirector establishes an efficient new paradigm for coherent video editing without inversion, offering better appearance fidelity and motion consistency than inversion-based approaches.

Abstract: Text-driven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency. To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion-appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory. Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.

[162] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, Theo Gevers, Martin R. Oswald, Danda Pani Paudel

Main category: cs.CV

TL;DR: The paper introduces a large-scale benchmark for evaluating Language Gaussian Splatting methods in 3D space, covering 1060 scenes across indoor/outdoor datasets, and proposes GaussianWorld-49K dataset to demonstrate generalizable approaches’ advantages.

DetailsMotivation: Current Language Gaussian Splatting methods are mostly evaluated on rendered 2D views with limited scenes and viewpoints, which restricts understanding of holistic 3D scene understanding capabilities.

Method: Proposes the first large-scale benchmark for systematic 3D evaluation of three Language Gaussian Splatting approaches (per-scene optimization-based, optimization-free, and generalizable), tested on 1060 scenes across multiple datasets, and introduces GaussianWorld-49K dataset with ~49K diverse scenes.

Result: Benchmark results show clear advantage of generalizable paradigm over other approaches, particularly in relaxing scene-specific limitations, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance.

Conclusion: The proposed benchmark and dataset demonstrate that generalizable approaches can harness strong data priors and provide better 3D scene understanding capabilities compared to scene-specific methods.

Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets are released at https://scenesplatpp.gaussianworld.ai/.

[163] Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Jun Li, Hongjian Dou, Zhenyu Zhang, Kai Li, Shaoguo Liu, Tingting Gao

Main category: cs.CV

TL;DR: PMTFR framework improves Composed Image Retrieval using Pyramid Matching Model with Training-Free Refinement, outperforming SOTA methods on CIR benchmarks.

DetailsMotivation: Existing CIR methods often require additional training for ranking models, and Chain-of-Thought techniques have limited application in CIR - either compressing visual info to text or needing elaborate prompts. Current approaches only work for zero-shot CIR, struggling with supervised CIR despite well-trained models.

Method: Proposed PMTFR framework with Pyramid Matching Model enhanced by Pyramid Patcher module for multi-granular visual understanding. Uses representation engineering to extract representations from CoT data and inject them into LVLMs for Training-Free Refinement, obtaining refined retrieval scores without explicit textual reasoning.

Result: Extensive experiments on CIR benchmarks show PMTFR surpasses state-of-the-art methods in supervised CIR tasks.

Conclusion: PMTFR effectively addresses CIR challenges by enhancing visual understanding at different granularities and enabling training-free refinement, achieving superior performance in supervised CIR without additional model training.

Abstract: Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.

[164] Exploring Diffusion with Test-Time Training on Efficient Image Restoration

Rongchang Lu, Tianduo Luo, Yunzhi Jiang, Conghan Yue, Pei Yang, Guibao Liu, Changyang Gu

Main category: cs.CV

TL;DR: DiffRWKVIR: A novel image restoration framework combining Test-Time Training with efficient diffusion, featuring omni-scale 2D state evolution, chunk-optimized flash processing, and prior-guided efficient diffusion for superior performance and efficiency.

DetailsMotivation: Address challenges in image restoration including ineffective feature fusion, computational bottlenecks, and inefficient diffusion processes that limit performance and practical deployment.

Method: Three key innovations: (1) Omni-Scale 2D State Evolution extends RWKV’s location-dependent parameterization to hierarchical multi-directional 2D scanning with linear complexity; (2) Chunk-Optimized Flash Processing accelerates intra-chunk parallelism via contiguous chunk processing; (3) Prior-Guided Efficient Diffusion extracts compact Image Prior Representation in only 5-20 steps.

Result: Outperforms SwinIR, HAT, and MambaIR/v2 across super-resolution and inpainting benchmarks (Set5, Set14, BSD100, Urban100, Places365) in PSNR, SSIM, LPIPS, and efficiency metrics. Achieves 45% faster training/inference than DiffIR.

Conclusion: Establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization, solving computational inefficiency in denoising while maintaining superior restoration quality.

Abstract: Image restoration faces challenges including ineffective feature fusion, computational bottlenecks and inefficient diffusion processes. To address these, we propose DiffRWKVIR, a novel framework unifying Test-Time Training (TTT) with efficient diffusion. Our approach introduces three key innovations: (1) Omni-Scale 2D State Evolution extends RWKV’s location-dependent parameterization to hierarchical multi-directional 2D scanning, enabling global contextual awareness with linear complexity O(L); (2) Chunk-Optimized Flash Processing accelerates intra-chunk parallelism by 3.2x via contiguous chunk processing (O(LCd) complexity), reducing sequential dependencies and computational overhead; (3) Prior-Guided Efficient Diffusion extracts a compact Image Prior Representation (IPR) in only 5-20 steps, proving 45% faster training/inference than DiffIR while solving computational inefficiency in denoising. Evaluated across super-resolution and inpainting benchmarks (Set5, Set14, BSD100, Urban100, Places365), DiffRWKVIR outperforms SwinIR, HAT, and MambaIR/v2 in PSNR, SSIM, LPIPS, and efficiency metrics. Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.

[165] MADrive: Memory-Augmented Driving Scene Modeling

Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk

Main category: cs.CV

TL;DR: MADrive is a memory-augmented reconstruction framework that enhances autonomous driving scene reconstruction by replacing observed vehicles with similar 3D assets from a large memory bank, enabling photorealistic synthesis of altered driving scenarios.

DetailsMotivation: Current 3D Gaussian splatting methods for autonomous driving scene reconstruction are limited to original observations and cannot support photorealistic synthesis of significantly altered or novel driving scenarios.

Method: Introduces MADrive with MAD-Cars dataset (~70K 360° car videos), a retrieval module that finds similar car instances, reconstructs 3D assets from video, and integrates them into target scenes through orientation alignment and relighting.

Result: The framework provides complete multi-view representations of vehicles, enabling photorealistic synthesis of substantially altered configurations as demonstrated in experiments.

Conclusion: MADrive extends existing scene reconstruction capabilities by leveraging external memory banks to enable realistic synthesis of modified driving scenarios beyond original observations.

Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360° car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

[166] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Huy Le, Nhat Chung, Tung Kieu, Jingkang Yang, Ngan Le

Main category: cs.CV

TL;DR: UNO is a unified single-stage framework for both box-level and pixel-level Video Scene Graph Generation that uses extended slot attention and temporal consistency learning.

DetailsMotivation: Prior VidSGG approaches require separate architectures for different granularity levels (box-level vs pixel-level), leading to task-specific designs and multi-stage pipelines. The authors aim to create a unified framework that can handle both tasks efficiently.

Method: UNO uses extended slot attention to decompose visual features into object and relation slots. It incorporates object temporal consistency learning to maintain consistent object representations across frames without explicit tracking, and a dynamic triplet prediction module to link relation slots to object pairs for capturing evolving interactions.

Result: UNO achieves competitive performance on both box-level and pixel-level VidSGG benchmarks while offering improved efficiency through its unified, object-centric design.

Conclusion: The proposed UNO framework successfully unifies box-level and pixel-level VidSGG tasks in a single-stage architecture with minimal task-specific modifications, demonstrating that a unified object-centric approach can achieve competitive performance across different visual granularity levels.

Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

[167] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Zhenghui Zhao, Chen Wu, Xiangyong Cao, Di Wang, Hongruixuan Chen, Datao Tang, Liangpei Zhang, Zhuo Zheng

Main category: cs.CV

TL;DR: ChangeBridge is a conditional spatiotemporal image generation model for remote sensing that generates post-event scenes from pre-event images and multimodal event controls, handling both event-driven changes and cross-temporal variations through a drift-asynchronous diffusion bridge approach.

DetailsMotivation: Existing change generation methods only handle event-driven changes (like new buildings) but fail to model cross-temporal variations (like seasonal shifts). There's a need for a model that can generate both spatially and temporally coherent post-event scenes in remote sensing applications.

Method: ChangeBridge uses a drift-asynchronous diffusion bridge with three main modules: 1) Composed bridge initialization (starting diffusion from composed pre-event state instead of noise), 2) Asynchronous Drift Diffusion (using pixel-wise drift maps to assign different drift magnitudes to event and temporal evolution), and 3) Drift-Aware Denoising (embedding drift maps into denoising network for guided reconstruction).

Result: Experiments show ChangeBridge generates better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. It demonstrates great potential for land-use planning and as a data generation engine for change detection tasks.

Conclusion: ChangeBridge successfully addresses the limitations of existing methods by modeling both event-driven changes and cross-temporal variations through its novel drift-asynchronous diffusion bridge approach, enabling more realistic spatiotemporal image generation for remote sensing applications.

Abstract: Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event scenes that are both spatially and temporally coherent. The core idea is a drift-asynchronous diffusion bridge. Specifically, it consists of three main modules: a) Composed bridge initialization, which replaces noise initialization. It starts the diffusion from a composed pre-event state, modeling a diffusion bridge process. b) Asynchronous Drift Diffusion, which uses a pixel-wise drift map, assigning different drift magnitudes to event and temporal evolution. This enables differentiated generation during the pre-to-post transition. c) Drift-Aware Denoising, which embeds the drift map into the denoising network, guiding drift-aware reconstruction. Experiments show that ChangeBridge can generate better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. Additionally, ChangeBridge shows great potential for land-use planning and as a data generation engine for a series of change detection tasks.

[168] Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

Ricardo Cardoso, Plinio Moreno

Main category: cs.CV

TL;DR: Combines sparse point-cloud from depth images with RGB to estimate object mass, using synthetic data generation and outperforms existing benchmarks.

DetailsMotivation: Inertial mass is crucial for robotic applications like grasping and manipulation, but vision-based mass estimation is underexplored. Accurate mass estimation before interaction can enhance robotic task performance.

Method: Proposes combining sparse point-cloud data from depth images with RGB images. Uses synthetic dataset from ShapeNetSem 3D models with Kinect camera simulation. Trains image generation model for dense depth maps to augment existing mass-paired image dataset. Evaluates point-cloud processing architectures and RGB-only methods.

Result: Significantly outperforms existing benchmarks across all evaluated metrics.

Conclusion: The approach successfully combines vision sensors for mass estimation, with synthetic data generation overcoming training data limitations. All code and data generation tools are publicly available.

Abstract: Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object’s mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.

[169] Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection

Xingyu Feng, Hebei Gao, Hong Li

Main category: cs.CV

TL;DR: COLER is a zero-shot unsupervised approach for instance segmentation and object detection that uses CutOnce to generate coarse pseudo labels and then trains a detector on them without complex loss functions.

DetailsMotivation: To create a simple yet effective unsupervised method for instance segmentation and object detection that doesn't rely on clustering methods or complex mask post-processing, opening new directions for Normalized Cut algorithms in multi-object segmentation.

Method: Uses CutOnce (applies Normalized Cut only once) to generate multiple object masks from self-supervised features, then trains a detector on these coarse pseudo labels with simple modules and self-training refinement.

Result: Achieves state-of-the-art performance on multiple benchmarks as a zero-shot unsupervised model, outperforming previous methods without requiring specially designed loss functions for pseudo labels.

Conclusion: COLER demonstrates a novel and effective approach for unsupervised object localization that advances the field by showing Normalized Cut can be effectively used for multi-object segmentation without clustering dependencies.

Abstract: We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut (NCut) only once and does not rely on any clustering methods (e.g., K-Means), but it can generate multiple object masks in an image. Our work opens a new direction for NCut algorithm in multi-object segmentation. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised model, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple benchmarks. We believe our method can help advance the field of unsupervised object localization. Code is available at: https://github.com/Quantumcraft616/COLER.

[170] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Yanghao Wang, Long Chen

Main category: cs.CV

TL;DR: NoOp: A noise optimization method for Diffusion Classifiers that learns matching “good noises” to address noise instability, replacing random sampling with optimized dataset-specific and image-specific noises.

DetailsMotivation: Existing Diffusion Classifiers suffer from noise instability - different random noises lead to significant performance variations, requiring hundreds of noise samples for stable results which severely reduces classification speed.

Method: Proposes NoOp with two principles: Frequency Matching (optimizes dataset-specific noise) and Spatial Matching (trains Meta-Network to output image-specific noise offset). Combines optimized noise and offset to replace random noise in DC.

Result: Extensive ablations on various datasets demonstrate effectiveness of NoOp in achieving stable classification performance without needing to ensemble hundreds of noise samples.

Conclusion: NoOp successfully addresses noise instability in Diffusion Classifiers by learning matching “good noises” through frequency and spatial matching principles, enabling stable and faster classification.

Abstract: Although today’s pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises’’ that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.

[171] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

Andrew Lee, Ian Chuang, Dechen Gao, Kai Fukazawa, Iman Soltani

Main category: cs.CV

TL;DR: Gaze on the Prize introduces a learnable foveal attention mechanism for visual RL that uses return differences to guide attention toward task-relevant visual features, achieving 2.52x sample efficiency improvement.

DetailsMotivation: Visual RL agents waste exploration and computational resources on irrelevant pixels in high-dimensional image data, leading to sample-inefficient and unstable learning. Human visual foveation provides inspiration for focusing attention on task-relevant features.

Method: The framework augments visual RL with a learnable foveal attention mechanism (Gaze) guided by return differences (the Prize). It uses return-guided contrastive learning: similar visual representations are grouped into positives/negatives based on return differences, creating contrastive triplets that train the attention to distinguish features relevant to success vs failure.

Result: Achieves up to 2.52x improvement in sample efficiency and can solve challenging tasks from the ManiSkill3 benchmark that baseline methods fail to learn, without modifying the underlying algorithm or hyperparameters.

Conclusion: By leveraging return differences to guide attention toward task-relevant visual features through contrastive learning, the method significantly improves sample efficiency and enables solving previously unsolvable visual RL tasks.

Abstract: Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent’s experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.52x improvement in sample efficiency and can solve challenging tasks from the ManiSkill3 benchmark that the baseline fails to learn, without modifying the underlying algorithm or hyperparameters.

[172] Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting

Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Cuiqun Chen, Zhuo Zheng, Bo Du, Liangpei Zhang

Main category: cs.CV

TL;DR: AdvCP addresses co-occurring noise in weakly-supervised change detection by using adversarial prompting to identify background variations misclassified as object changes, then rectifying them via online global prototypes.

DetailsMotivation: WSCD methods often misclassify background variations (light, weather, seasonal changes) as object changes due to weak image-level supervision, leading to co-occurring noise problems in complex remote-sensing scenarios.

Method: Two-phase approach: 1) Adversarial Prompt Mining - uses incorrect one-hot labels to activate erroneous feature mappings and identify background variations likely misclassified; 2) Adversarial Sample Rectification - integrates these samples via online global prototype built from exponentially weighted moving average of current and historical data.

Result: Significant performance improvements on ConvNet, Transformer, and SAM-based baselines; generalizable to other multi-class weakly-supervised dense prediction scenarios; no additional inference cost.

Conclusion: AdvCP effectively addresses co-occurring noise in WSCD by identifying and rectifying background variations misclassified as object changes, enhancing performance while maintaining efficiency and generalizability.

Abstract: Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP

[173] MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning

Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo

Main category: cs.CV

TL;DR: Lightweight streaming image captioning model using compact 125M-parameter language component with multimodal self-refinement framework inspired by human visual processing.

DetailsMotivation: Existing MLLMs for streaming image captioning are computationally expensive, hindering practical applications like video chatbots and navigation robots. Need for lightweight alternatives.

Method: 1) Replace large language component in MLLMs with compact 125M-parameter model. 2) Propose multimodal self-refinement framework inspired by human visual processing: coarse-to-fine approach where model first generates global understanding, then refines using features from salient regions identified from previous caption.

Result: Compact model achieves comparable performance to MLLMs despite 93x size reduction. Self-refinement framework improves reliability and shows superiority in single-sentence and detailed captioning, extending to long-range video QA tasks.

Conclusion: Factual image captioning doesn’t require complex reasoning abilities of large LLMs. Lightweight models with proper refinement mechanisms can achieve strong performance while being computationally efficient for practical applications.

Abstract: Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application. This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model. Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.

[174] Fine-grained Defocus Blur Control for Generative Image Models

Ayush Shrivastava, Connelly Barnes, Xuaner Zhang, Lingzhi Zhang, Andrew Owens, Sohrab Amirghodsi, Eli Shechtman

Main category: cs.CV

TL;DR: A text-to-image diffusion framework that uses camera EXIF metadata to generate controllable lens blur effects by simulating physical image formation process.

DetailsMotivation: Current text-to-image diffusion models struggle to incorporate fine-grained camera metadata like precise aperture settings, limiting their ability to generate realistic lens blur effects based on actual photographic parameters.

Method: The method mimics physical image formation: 1) generates all-in-focus image, 2) estimates monocular depth, 3) predicts focus distance using novel focus distance transformer, 4) forms defocused image using differentiable lens blur model. Gradients flow backward through entire process for unsupervised learning.

Result: The model enables superior fine-grained control over defocus effects while preserving scene contents, achieving precise interactive user control over lens blur based on EXIF data that existing diffusion models cannot match.

Conclusion: The framework successfully integrates camera metadata into text-to-image generation, allowing realistic and controllable lens blur effects that mimic physical photography processes, overcoming limitations of current diffusion models.

Abstract: Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.

[175] MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression

Hai Dang Nguyen, Nguyen Dang Huy Pham, The Minh Duc Nguyen, Dac Thai Nguyen, Hang Thi Nguyen, Duong M. Nguyen

Main category: cs.CV

TL;DR: MMAP is a novel framework that improves spatial transcriptomics prediction from H&E images using multi-magnification patches for local detail and prototype embeddings for global context, outperforming existing methods.

DetailsMotivation: Predicting spatial gene expression from histological images is challenging due to the modality gap between visual features and molecular signals. Existing methods have insufficient local feature granularity and inadequate global spatial context coverage.

Method: MMAP (Multi-MAgnification and Prototype-enhanced architecture) uses multi-magnification patch representations to capture fine-grained local details and learns latent prototype embeddings to represent slide-level global context.

Result: MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

Conclusion: The proposed MMAP framework successfully addresses both local granularity and global context limitations in spatial transcriptomics prediction, demonstrating superior performance over existing approaches.

Abstract: Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

[176] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model

Fei Kong

Main category: cs.CV

TL;DR: Generalized Denoising Diffusion Codebook Models (gDDCM) extend DDCM to work with modern continuous-time diffusion variants like Flow Matching and Consistency Models, improving sampling efficiency and tokenization quality.

DetailsMotivation: Existing DDCM only works with traditional discrete-time DDPM architecture, failing to adapt to modern continuous-time diffusion variants and suffering from inefficient sampling in high-noise regions.

Method: Proposes gDDCM with unified theoretical framework and “De-noise and Back-trace” sampling strategy that combines deterministic ODE denoising with residual-aligned noise injection, plus backtracking parameter p for enhanced tokenization.

Result: gDDCM achieves comprehensive compatibility with mainstream diffusion variants and significantly outperforms DDCM in reconstruction quality and perceptual fidelity on CIFAR10 and LSUN Bedroom datasets.

Conclusion: gDDCM successfully addresses DDCM’s limitations by enabling compatibility with modern diffusion architectures while improving tokenization performance and sampling efficiency.

Abstract: Denoising diffusion models have emerged as a dominant paradigm in image generation. Discretizing image data into tokens is a critical step for effectively integrating images with Transformer and other architectures. Although the Denoising Diffusion Codebook Models (DDCM) pioneered the use of pre-trained diffusion models for image tokenization, it strictly relies on the traditional discrete-time DDPM architecture. Consequently, it fails to adapt to modern continuous-time variants-such as Flow Matching and Consistency Models-and suffers from inefficient sampling in high-noise regions. To address these limitations, this paper proposes the Generalized Denoising Diffusion Codebook Models (gDDCM). We establish a unified theoretical framework and introduce a generic “De-noise and Back-trace” sampling strategy. By integrating a deterministic ODE denoising step with a residual-aligned noise injection step, our method resolves the challenge of adaptation. Furthermore, we introduce a backtracking parameter $p$ and significantly enhance tokenization ability. Extensive experiments on CIFAR10 and LSUN Bedroom datasets demonstrate that gDDCM achieves comprehensive compatibility with mainstream diffusion variants and significantly outperforms DDCM in terms of reconstruction quality and perceptual fidelity.

[177] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net

Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu Duong

Main category: cs.CV

TL;DR: ATTBHFA-Net: A novel few-shot learning method using Bhattacharyya-Hellinger distance aggregation for disaster image classification, addressing data scarcity and high visual similarity challenges.

DetailsMotivation: Disaster image analysis is crucial for rescue operations but faces data scarcity and high intra-class variation/inter-class similarity. Current few-shot learning methods rely on generic datasets and struggle with disaster imagery's unique challenges.

Method: Proposes ATTBHFA-Net that combines Bhattacharyya coefficient (for inter-class separability) and Hellinger distance (for same-class alignment) to aggregate feature probability distributions. Uses a novel contrastive loss based on these distances alongside categorical cross-entropy.

Result: Superior performance on four FSL benchmarks and two disaster image datasets, demonstrating effective generalization and improved classification accuracy compared to existing approaches.

Conclusion: ATTBHFA-Net effectively addresses disaster image classification challenges through distribution-based feature aggregation, offering a robust solution for few-shot learning in disaster scenarios with limited data.

Abstract: The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.

[178] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai

Main category: cs.CV

TL;DR: VADER is an LLM-driven framework for video anomaly understanding that integrates object relation features with visual cues to generate detailed, causally grounded descriptions and support anomaly-related question answering.

DetailsMotivation: Traditional video anomaly methods only detect and localize anomalies without deeper causal understanding. Existing approaches neglect object interactions and causal relationships critical for understanding anomalous behaviors.

Method: VADER uses: 1) Anomaly Scorer for per-frame anomaly scores, 2) Context-Aware Sampling (CAES) to capture causal context, 3) Relation Feature Extractor and Contrastive Relation Encoder (CORE) to model object interactions, 4) Integration of visual/relational cues with LLMs for reasoning.

Result: Experiments on multiple real-world VAU benchmarks show VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks.

Conclusion: VADER advances explainable video anomaly analysis by providing detailed, causally grounded interpretations of anomalous events through integrated visual and relational reasoning.

Abstract: Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.

[179] The Finer the Better: Towards Granular-aware Open-set Domain Generalization

Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen

Main category: cs.CV

TL;DR: SeeCLIP improves Open-Set Domain Generalization by addressing the dilemma between known-class structural risk and unknown-class open-space risk through semantic enhancement, duplex contrastive learning, and semantic-guided diffusion for hard negative generation.

DetailsMotivation: Existing methods for Open-Set Domain Generalization face a dilemma between structural risk from known classes and open-space risk from unknown classes, particularly struggling with "hard unknowns" that share fine-grained visual similarities with known classes, leading to over-confidence issues.

Method: Proposes Semantic-enhanced CLIP (SeeCLIP) with three key components: 1) Semantic-aware prompt enhancement module that decomposes images into discriminative semantic tokens for nuanced vision-language alignment; 2) Duplex contrastive learning with complementary objectives (repulsion from known classes and cohesion for semantic proximity); 3) Semantic-guided diffusion module that synthesizes pseudo-unknowns by perturbing extracted semantic tokens to generate challenging hard negatives.

Result: Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

Conclusion: SeeCLIP effectively addresses the known-unknown dilemma in Open-Set Domain Generalization through semantic enhancement, enabling better handling of hard unknowns and achieving superior performance compared to existing methods.

Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

[180] Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM

Joe Kileel, Oscar Mickelin, Amit Singer, Sheng Xu

Main category: cs.CV

TL;DR: MoDM framework uses second-order moments from two datasets (uniform and unknown non-uniform orientation distributions) to uniquely determine molecular structures via convex optimization, enabling accurate cryo-EM reconstruction from statistical moments.

DetailsMotivation: Cryo-EM reconstruction typically requires extensive data collection and complex algorithms. The authors aim to develop a more efficient approach that leverages dataset diversity and uses only second-order statistical moments to reconstruct molecular structures, reducing computational complexity while maintaining accuracy.

Method: Method of Double Moments (MoDM) framework that fuses data from two cryo-EM datasets: one with uniform orientation distribution and another with unknown non-uniform orientation distribution. Uses only second-order moments of projection images, proves uniqueness of reconstruction (up to global rotation/reflection), and implements a convex-relaxation-based algorithm for structure recovery.

Result: Theoretical proof that second-order moments from two distinct orientation distributions generically uniquely determine molecular structures. Practical demonstration that the convex-relaxation algorithm achieves accurate recovery using only second-order statistics, showing improved reconstruction quality through dataset diversity.

Conclusion: Collecting and modeling multiple datasets under different experimental conditions can substantially enhance cryo-EM reconstruction quality. The MoDM framework demonstrates that leveraging dataset diversity and using only second-order moments enables efficient and accurate molecular structure determination.

Abstract: Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions: one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.

[181] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi

Main category: cs.CV

TL;DR: Two methods (MapReduce LoRA and RaTE) for multi-preference alignment that avoid alignment tax by training preference-specific experts and embeddings, achieving SOTA results across image, video, and language generation tasks.

DetailsMotivation: RLHF with reward models improves alignment to human preferences but suffers from alignment tax when optimizing multiple rewards - improving one dimension degrades others. Need methods to align models to multiple preferences without this trade-off.

Method: 1) MapReduce LoRA: Trains preference-specific LoRA experts in parallel, then iteratively merges them to refine a shared base model. 2) Reward-aware Token Embedding (RaTE): Learns reward-specific token embeddings that compose at inference for flexible preference control.

Result: Text-to-Image: 36.1%, 4.6%, 55.7% improvements on GenEval, PickScore, OCR for Stable Diffusion 3.5; 32.7%, 4.3%, 67.1% for FLUX.1-dev. Text-to-Video: 48.1% visual quality, 90.0% motion quality improvements. Language: 43.4% helpfulness, 136.7% harmlessness improvements with Llama-2 7B.

Conclusion: The framework sets new SOTA for multi-preference alignment across modalities, effectively addressing alignment tax by enabling joint optimization of multiple rewards without performance degradation.

Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

[182] Wukong’s 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Minghao Yin, Yukang Cao, Kai Han

Main category: cs.CV

TL;DR: WUKONG is a training-free framework for high-fidelity textured 3D morphing using flow-based transformers, achieving smooth shape transitions and faithful texture preservation without manual correspondence matching.

DetailsMotivation: Conventional 3D morphing methods rely on manual correspondence matching and deformation trajectory estimation, which limits generalization and requires costly preprocessing. There's a need for a more efficient, high-fidelity approach that can handle diverse geometry and texture variations.

Method: WUKONG leverages flow-based transformers’ generative prior for 3D transitions. It formulates morphing as an optimal transport barycenter problem for smooth shape transitions, uses sequential initialization to prevent geometric distortions, and employs similarity-guided semantic consistency for texture preservation with selective high-frequency detail retention.

Result: Extensive evaluations show WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations with high-fidelity transitions and rich texture details.

Conclusion: WUKONG provides an effective training-free framework for high-quality 3D morphing that overcomes limitations of conventional methods, offering better generalization, reduced preprocessing costs, and superior fidelity in both shape and texture transitions.

Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods – which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) – WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

[183] Defense That Attacks: How Robust Models Become Better Attackers

Mohamed Awad, Mahmoud Akrm, Walid Gomaa

Main category: cs.CV

TL;DR: Adversarial training paradoxically increases transferability of adversarial attacks, creating new ecosystem risks.

DetailsMotivation: While adversarial training improves model robustness, its effect on attack transferability is underexplored. The paper investigates whether adversarial training unintentionally makes adversarial examples more transferable between models.

Method: Trained 36 diverse models (CNNs and ViTs) and conducted comprehensive transferability experiments to test whether adversarially trained models produce more transferable adversarial examples.

Result: Revealed a clear paradox: adversarially trained models produce perturbations that transfer more effectively than those from standard models, introducing new ecosystem risks.

Conclusion: Robustness evaluations should assess both a model’s resistance to transferred attacks and its propensity to produce transferable adversarial examples. All models, code, and experimental scripts are released for reproducibility.

Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.

[184] Equivariant symmetry-aware head pose estimation for fetal MRI

Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Polina Golland, Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose is a novel pose estimation method that jointly models rotation equivariance and object symmetry for robust fetal head pose estimation in MRI scans, enabling automatic adaptive prescription of 2D diagnostic slices.

DetailsMotivation: The paper addresses the challenging problem of accounting for fetal head motion during diagnostic MRI scans. The goal is to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by rapidly acquired 3D MRI volumes before each 2D slice.

Method: E(3)-Pose explicitly models rotation equivariance and object symmetry by construction. It captures anatomical symmetries and rigid pose equivariance to yield robust estimates of fetal head pose, addressing pose ambiguities induced by inherent anatomical symmetries.

Result: Experiments on publicly available and representative clinical fetal MRI datasets demonstrate superior robustness and generalization across domains. The method achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation.

Conclusion: E(3)-Pose provides a robust solution for fetal head pose estimation in MRI by explicitly modeling rotation equivariance and object symmetry, overcoming limitations of existing methods that struggle with pose ambiguities from anatomical symmetries and clinical imaging challenges.

Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.

[185] InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Lan Xu, Jingyi Yu, Jingya Wang

Main category: cs.CV

TL;DR: InterAgent is the first end-to-end framework for text-driven physics-based multi-agent humanoid control that generates coherent, physically plausible social behaviors from text prompts.

DetailsMotivation: Existing humanoid control methods are largely confined to single-agent scenarios and overlook physically plausible interplay essential for multi-agent interactions, creating a gap in modeling complex social coordination.

Method: Proposes an autoregressive diffusion transformer with multi-stream blocks that decouples proprioception, exteroception, and action to mitigate cross-modal interference. Introduces an interaction graph exteroception representation capturing joint-to-joint spatial dependencies, and a sparse edge-based attention mechanism that dynamically prunes redundant connections while emphasizing critical inter-agent relations.

Result: InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance in generating coherent, physically plausible, and semantically faithful multi-agent behaviors from text prompts.

Conclusion: InterAgent successfully bridges the gap in physics-based multi-agent humanoid control, enabling realistic social interactions from text descriptions, with code and data to be released for future research.

Abstract: Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.

[186] MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, Yue Ma

Main category: cs.CV

TL;DR: MultiMotion: A unified framework using Mask-aware Attention Motion Flow (AMF) with SAM2 masks to disentangle and control motion for multiple objects in Diffusion Transformers, plus RectPC solver for efficient sampling and a new benchmark dataset.

DetailsMotivation: Multi-object video motion transfer is challenging for DiT architectures due to motion entanglement and lack of object-level control. Existing methods struggle with precise, independent motion control for multiple distinct objects in videos.

Method: 1) Mask-aware Attention Motion Flow (AMF) uses SAM2 masks to explicitly disentangle and control motion features for multiple objects within DiT pipeline. 2) RectPC - a high-order predictor-corrector solver for efficient and accurate sampling, especially for multi-entity generation. 3) Created first benchmark dataset for DiT-based multi-object motion transfer evaluation.

Result: MultiMotion achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects while maintaining DiT’s high quality and scalability. The framework overcomes motion entanglement limitations.

Conclusion: MultiMotion provides a unified solution for multi-object video motion transfer in DiT architectures, enabling explicit object-level motion control through mask-aware attention mechanisms and efficient sampling, with demonstrated effectiveness on a new benchmark dataset.

Abstract: Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT’s high quality and scalability. The code is in the supp.

[187] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani

Main category: cs.CV

TL;DR: Large-scale automated framework discovers and explains visual concept representations across human cortex using unsupervised pattern discovery and automated explanation generation.

DetailsMotivation: Understanding how the human brain represents visual concepts and where these representations are encoded remains challenging due to brain signal complexity and vast concept space, with most studies being small-scale, manual, and lacking systematic validation.

Method: Two-stage framework: 1) Discover candidate interpretable patterns in fMRI activity through unsupervised data-driven decomposition methods; 2) Explain each pattern by identifying natural images that most strongly elicit it and generating natural-language descriptions. Automated pipeline tests multiple candidate explanations, assigns reliability scores, and selects most consistent descriptions.

Result: Framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

Conclusion: The automated large-scale framework enables systematic discovery and explanation of visual representations across human cortex, overcoming limitations of traditional small-scale, manual approaches.

Abstract: Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

[188] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

Yihao Liu, Chenyu Gao, Lianrui Zuo, Michael E. Kim, Brian D. Boyd, Lisa L. Barnes, Walter A. Kukull, Lori L. Beason-Held, Susan M. Resnick, Timothy J. Hohman, Warren D. Taylor, Bennett A. Landman

Main category: cs.CV

TL;DR: MetaVoxel is a joint diffusion modeling framework that learns a single diffusion process spanning both imaging data and clinical metadata, enabling flexible zero-shot inference across multiple medical AI tasks without task-specific retraining.

DetailsMotivation: Current deep learning methods for medical imaging typically train separate conditional models for specific predictive tasks (disease classification, biomarker estimation, image generation). This approach is limited as it requires multiple specialized models and lacks flexibility for arbitrary inference tasks.

Method: MetaVoxel learns a joint diffusion process over both imaging data and clinical metadata, modeling the complete joint distribution rather than conditional distributions. This single diffusion framework spans all variables (images and metadata), enabling flexible zero-shot inference using arbitrary subsets of inputs without retraining.

Result: Using over 10,000 T1-weighted MRI scans with clinical metadata from nine datasets, MetaVoxel performs image generation, age estimation, and sex prediction with performance comparable to established task-specific baselines. Additional experiments demonstrate its flexible inference capabilities.

Conclusion: Joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability by supporting flexible zero-shot inference across multiple tasks with a single model.

Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference. Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.

[189] An Efficient Test-Time Scaling Approach for Image Generation

Vignesh Sundaresha, Akash Haridas, Vikram Appia, Lav R. Varshney

Main category: cs.CV

TL;DR: The paper proposes Verifier-Threshold method for efficient test-time compute allocation in diffusion/flow image generation models, achieving 2-4x speedup over state-of-the-art methods.

DetailsMotivation: Current methods for allocating non-uniform inference-compute budgets across denoising steps in diffusion/flow models use greedy algorithms that allocate compute budget ineffectively, leading to suboptimal efficiency.

Method: Verifier-Threshold method that automatically reallocates test-time compute across different denoising steps, improving computational efficiency in image generation models.

Result: Achieves 2-4x reduction in computational time over state-of-the-art methods while maintaining the same performance on the GenEval benchmark.

Conclusion: The proposed method effectively solves the inefficient compute allocation problem in diffusion/flow models, delivering substantial efficiency improvements for image generation tasks.

Abstract: Image generation has emerged as a mainstream application of large generative AI models. Just as test-time compute and reasoning have helped language models improve their capabilities, similar benefits have also been observed with image generation models. In particular, searching over noise samples for diffusion and flow models has shown to scale well with test-time compute. While recent works have explored allocating non-uniform inference-compute budgets across different denoising steps, they rely on greedy algorithms and allocate the compute budget ineffectively. In this work, we study this problem and propose solutions to fix it. We propose the Verifier-Threshold method which automatically reallocates test-time compute and delivers substantial efficiency improvements. For the same performance on the GenEval benchmark, we achieve a 2-4x reduction in computational time over the state-of-the-art method.

[190] Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized

Er Jin, Yang Zhang, Yongli Mou, Yanfei Dong, Stefan Decker, Kenji Kawaguchi, Johannes Stegmaier

Main category: cs.CV

TL;DR: UniForget uses model pruning to suppress copyrighted content generation in diffusion models without targeting specific concepts, preserving general generative capabilities while being complementary to existing unlearning methods.

DetailsMotivation: Generated images often resemble training data, leading to copyright infringement, portrait rights violations, and trademark issues. Existing methods have computational overhead or limited scalability, focusing only on specific target concepts.

Method: Identifies specific model parts responsible for copyrighted content generation and applies model pruning to suppress probability of generating copyrighted content without targeting specific concepts.

Result: Effectively reduces memorization of copyrighted content while preserving general generative capabilities. The approach is orthogonal and complementary to existing unlearning methods.

Conclusion: UniForget offers a scalable solution to memorization problems in generative models through targeted pruning, providing a foundation for improving current unlearning and de-memorization techniques.

Abstract: Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.

[191] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Cheng Min, Yu Hu

Main category: cs.CV

TL;DR: MaGRoad introduces a path-centric framework for robust off-road road extraction, addressing limitations of node-centric methods, and releases WildRoad dataset for off-road environments.

DetailsMotivation: Current deep learning models for vectorized road extraction fail in off-road environments due to domain gaps, lack of large-scale vectorized datasets, and structural weaknesses in node-centric methods that are fragile to occlusions and ambiguous junctions.

Method: Two complementary approaches: 1) Release WildRoad dataset with interactive annotation tool for efficient road-network labeling, 2) Introduce MaGRoad (Mask-aware Geodesic Road network extractor) - a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.

Result: MaGRoad achieves state-of-the-art performance on the challenging WildRoad benchmark while generalizing well to urban datasets, with roughly 2.5x faster inference speed through a streamlined pipeline.

Conclusion: The combination of the WildRoad dataset and path-centric paradigm provides a stronger foundation for mapping roads in wild terrains, addressing previous limitations in off-road road extraction.

Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors. This work addresses these limitations in two complementary ways. First, we release WildRoad, a global off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly. Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild. We release both the dataset and code at https://github.com/xiaofei-guan/MaGRoad.

[192] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

Main category: cs.CV

TL;DR: MiSI-Bench is a benchmark for evaluating Vision-Language Models on microscopic spatial intelligence, revealing current VLMs lag behind humans but fine-tuned models show promise in basic spatial tasks while struggling with scientific domain knowledge.

DetailsMotivation: The paper addresses the need to assess Vision-Language Models' capability in microscopic spatial intelligence - the ability to perceive and reason about spatial relationships of invisible microscopic entities, which is crucial for scientific discovery but currently lacks systematic evaluation.

Method: Proposed MiSI-Bench benchmark framework with over 163,000 question-answer pairs and 587,000 images derived from ~4,000 molecular structures, covering nine complementary tasks evaluating abilities from elementary spatial transformations to complex relational identifications.

Result: Current state-of-the-art VLMs perform significantly below human level on the benchmark. However, a fine-tuned 7B model shows substantial potential, surpassing humans in spatial transformation tasks but performing poorly in scientifically-grounded tasks like hydrogen bond recognition.

Conclusion: The benchmark reveals VLMs’ current limitations in microscopic spatial intelligence and demonstrates that while fine-tuned models can excel at basic spatial tasks, integrating explicit domain knowledge is essential for progress toward scientific AGI.

Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

cs.AI

[193] CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound

Akhil S Anand, Elias Aarekol, Martin Mziray Dalseg, Magnus Stalhane, Sebastien Gros

Main category: cs.AI

TL;DR: CORL framework uses reinforcement learning to fine-tune MILP models end-to-end for better real-world operational performance, treating MILP solving as a differentiable stochastic policy.

DetailsMotivation: Traditional MILP models for combinatorial sequential decision making often perform suboptimally in real-world applications due to difficulty in accurately modeling stochastic problems. Existing machine learning approaches rely on supervised learning with access to true optimal decisions and use surrogate gradients.

Method: CORL framework casts MILP solved by branch-and-bound as a differentiable stochastic policy compatible with reinforcement learning, enabling end-to-end fine-tuning on real-world data to maximize operational performance.

Result: The method is validated in a simple illustrative combinatorial sequential decision making example as a proof of concept.

Conclusion: CORL provides a novel approach to optimize MILP models directly for decision quality using reinforcement learning rather than relying on supervised learning with surrogate gradients.

Abstract: Combinatorial sequential decision making problems are typically modeled as mixed integer linear programs (MILPs) and solved via branch and bound (B&B) algorithms. The inherent difficulty of modeling MILPs that accurately represent stochastic real world problems leads to suboptimal performance in the real world. Recently, machine learning methods have been applied to build MILP models for decision quality rather than how accurately they model the real world problem. However, these approaches typically rely on supervised learning, assume access to true optimal decisions, and use surrogates for the MILP gradients. In this work, we introduce a proof of concept CORL framework that end to end fine tunes an MILP scheme using reinforcement learning (RL) on real world data to maximize its operational performance. We enable this by casting an MILP solved by B&B as a differentiable stochastic policy compatible with RL. We validate the CORL method in a simple illustrative combinatorial sequential decision making example.

[194] Deep Learning–Accelerated Multi-Start Large Neighborhood Search for Real-time Freight Bundling

Haohui Zhang, Wouter van Heeswijk, Xinyu Hu, Neil Yorke-Smith, Martijn Mes

Main category: cs.AI

TL;DR: A hybrid learning-search pipeline combining Transformer-based constructive policy with Multi-Start Large Neighborhood Search solves online freight bundling as m1-PDSTSP with <2% optimality gap.

DetailsMotivation: Online freight exchanges need efficient combinatorial bundling of transportation jobs under sub-second latency, but current methods struggle with coupling bundle selection with pickup-and-delivery routing.

Method: Learning-accelerated hybrid search pipeline: Transformer Neural Network constructive policy + Multi-Start Large Neighborhood Search metaheuristic within rolling-horizon scheme, repeatedly freezing marketplace snapshots.

Result: Outperforms state-of-the-art neural combinatorial optimization and metaheuristic baselines with <2% optimality gap in total revenue relative to best exact baseline, with comparable time.

Conclusion: First work showing Deep Neural Network-based constructors can reliably provide high-quality seeds for multi-start improvement heuristics, applicable to broad class of selective TSP and pickup-delivery problems.

Abstract: Online Freight Exchange Systems (OFEX) play a crucial role in modern freight logistics by facilitating real-time matching between shippers and carrier. However, efficient combinatorial bundling of transporation jobs remains a bottleneck. We model the OFEX combinatorial bundling problem as a multi-commodity one-to-one pickup-and-delivery selective traveling salesperson problem (m1-PDSTSP), which optimizes revenue-driven freight bundling under capacity, precedence, and route-length constraints. The key challenge is to couple combinatorial bundle selection with pickup-and-delivery routing under sub-second latency. We propose a learning–accelerated hybrid search pipeline that pairs a Transformer Neural Network-based constructive policy with an innovative Multi-Start Large Neighborhood Search (MSLNS) metaheuristic within a rolling-horizon scheme in which the platform repeatedly freezes the current marketplace into a static snapshot and solves it under a short time budget. This pairing leverages the low-latency, high-quality inference of the learning-based constructor alongside the robustness of improvement search; the multi-start design and plausible seeds help LNS to explore the solution space more efficiently. Across benchmarks, our method outperforms state-of-the-art neural combinatorial optimization and metaheuristic baselines in solution quality with comparable time, achieving an optimality gap of less than 2% in total revenue relative to the best available exact baseline method. To our knowledge, this is the first work to establish that a Deep Neural Network-based constructor can reliably provide high-quality seeds for (multi-start) improvement heuristics, with applicability beyond the \textit{m1-PDSTSP} to a broad class of selective traveling salesperson problems and pickup and delivery problems.

[195] FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Dongwon Jung, Peng Shi, Yi Zhang

Main category: cs.AI

TL;DR: FutureWeaver: A framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets, using modularized collaboration and dual-level planning.

DetailsMotivation: While test-time computation scaling improves single-agent LLM performance, there's no principled way to allocate compute across multiple agents in collaborative systems or distribute compute under budget constraints for multi-agent interactions.

Method: FutureWeaver introduces modularized collaboration (callable functions encapsulating reusable multi-agent workflows) derived through self-play reflection, and employs a dual-level planning architecture that optimizes compute allocation by reasoning over current task state while speculating on future steps.

Result: Experiments on complex agent benchmarks show FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

Conclusion: FutureWeaver successfully addresses the gap in principled test-time compute allocation for multi-agent systems, enabling effective collaboration optimization under budget constraints through modular workflows and dual-level planning.

Abstract: Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

[196] A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

Hong Je-Gal, Chan-Bin Yi, Hyun-Suk Lee

Main category: cs.AI

TL;DR: A-LAMP is an LLM-based framework that automatically converts natural language task descriptions into formal MDP formulations and trained policies, outperforming single LLM models while maintaining correctness.

DetailsMotivation: Applying RL to real-world tasks requires converting informal descriptions to formal MDPs, implementing environments, and training policies. This process is challenging due to modeling errors, fragile code, and misaligned objectives that impede policy training.

Method: Introduces A-LAMP, an agentic LLM-based framework that decomposes modeling, coding, and training into verifiable stages. It translates free-form natural language task descriptions into MDP formulations and trained policies while ensuring semantic alignment throughout the pipeline.

Result: A-LAMP consistently achieves higher policy generation capability than single state-of-the-art LLM models across classic control and custom RL domains. Even its lightweight variant with smaller language models approaches the performance of much larger models. Failure analysis reveals why improvements occur, and a case study confirms generated environments and policies preserve task optimality.

Conclusion: A-LAMP provides a reliable, automated framework for RL task formulation and policy generation that outperforms single LLM approaches while maintaining correctness and semantic alignment throughout the pipeline.

Abstract: Applying reinforcement learning (RL) to real-world tasks requires converting informal descriptions into a formal Markov decision process (MDP), implementing an executable environment, and training a policy agent. Automating this process is challenging due to modeling errors, fragile code, and misaligned objectives, which often impede policy training. We introduce an agentic large language model (LLM)-based framework for automated MDP modeling and policy generation (A-LAMP), that automatically translates free-form natural language task descriptions into an MDP formulation and trained policy. The framework decomposes modeling, coding, and training into verifiable stages, ensuring semantic alignment throughout the pipeline. Across both classic control and custom RL domains, A-LAMP consistently achieves higher policy generation capability than a single state-of-the-art LLM model. Notably, even its lightweight variant, which is built on smaller language models, approaches the performance of much larger models. Failure analysis reveals why these improvements occur. In addition, a case study also demonstrates that A-LAMP generates environments and policies that preserve the task’s optimality, confirming its correctness and reliability.

[197] TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning

Yuxing Chen, Basem Suleiman, Qifan Chen

Main category: cs.AI

TL;DR: TriFlow is a multi-agent framework for trip planning that combines structured reasoning with LLM flexibility to handle spatial, temporal, and budget constraints while satisfying user preferences.

DetailsMotivation: Existing LLM-based trip planning agents struggle with constraint satisfaction, tool coordination, and efficiency, often producing infeasible or costly itineraries that don't meet real-world requirements.

Method: Three-stage progressive pipeline: retrieval, planning, and governance. Uses rule-LLM collaboration to assemble constraint-consistent itineraries and performs bounded iterative refinement for global feasibility.

Result: Achieved state-of-the-art results with 91.1% final pass rate on TravelPlanner and 97% on TripTailor benchmarks, with over 10x runtime efficiency improvement compared to current SOTA.

Conclusion: TriFlow successfully addresses limitations of existing LLM-based trip planning by unifying structured reasoning with language flexibility, enabling efficient generation of feasible, personalized itineraries.

Abstract: Real-world trip planning requires transforming open-ended user requests into executable itineraries under strict spatial, temporal, and budgetary constraints while aligning with user preferences. Existing LLM-based agents struggle with constraint satisfaction, tool coordination, and efficiency, often producing infeasible or costly plans. To address these limitations, we present TriFlow, a progressive multi-agent framework that unifies structured reasoning and language-based flexibility through a three-stage pipeline of retrieval, planning, and governance. By this design, TriFlow progressively narrows the search space, assembles constraint-consistent itineraries via rule-LLM collaboration, and performs bounded iterative refinement to ensure global feasibility and personalisation. Evaluations on TravelPlanner and TripTailor benchmarks demonstrated state-of-the-art results, achieving 91.1% and 97% final pass rates, respectively, with over 10x runtime efficiency improvement compared to current SOTA.

[198] CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving

Jianyi Zhang, Ziyin Zhou, Xu Ji, Shizhao Liu, Zhangchi Zhao

Main category: cs.AI

TL;DR: CAPTURE is the first comprehensive CAPTCHA benchmark specifically designed for Large Visual Language Models (LVLMs), covering 4 main types and 25 sub-types from 31 vendors to thoroughly evaluate LVLM performance on CAPTCHA solving tasks.

DetailsMotivation: Existing CAPTCHA benchmarks are limited and not comprehensive - they were customized for specific research objectives and don't cover all CAPTCHA types. There's a lack of dedicated benchmarks specifically designed for LVLMs, despite their growing capabilities in multi-modal alignment and visual reasoning.

Method: Created CAPTURE (CAPTCHA for Testing Under Real-world Experiments) benchmark with extensive coverage: 4 main CAPTCHA types, 25 sub-types collected from 31 different vendors. The benchmark features large-scale data, diverse class variety, and unique labels specifically tailored for LVLM evaluation.

Result: When evaluated using the CAPTURE benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs, indicating that despite their strong multi-modal alignment capabilities, they still struggle with comprehensive CAPTCHA challenges.

Conclusion: CAPTURE fills important gaps in CAPTCHA evaluation by providing comprehensive coverage and LVLM-specific labeling. The benchmark reveals significant limitations in current LVLMs’ CAPTCHA-solving abilities, highlighting the need for further research and improvement in this area.

Abstract: Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.

[199] Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

Gonca Gürsun

Main category: cs.AI

TL;DR: A framework for making LLM agents more reliable in multi-turn tasks by integrating task profiling, verifiable reasoning, and constraint-compliant generation within RL environments.

DetailsMotivation: LLMs show strong reasoning abilities but lack reliability and verifiability in multi-turn tasks, needing more trustworthy behavior in interactive environments.

Method: Three-component framework: 1) lightweight task profiler for strategy selection, 2) reasoning module learning verifiable observation-action mappings, 3) generation module enforcing constraint-compliant outputs via validation or deterministic synthesis.

Result: The components co-evolve as the agent interacts with the environment, yielding trustworthy behavior in RL environments with defined observation, action, and reward signals.

Conclusion: The framework enables LLM-based agents to act under explicit behavioral guidance, improving reliability and verifiability in multi-turn interactive tasks.

Abstract: Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior.

[200] AgentBalance: Backbone-then-Topology Design for Cost-Effective Multi-Agent Systems under Budget Constraints

Shuowei Cai, Yansong Ning, Hao Liu

Main category: cs.AI

TL;DR: AgentBalance is a framework for building cost-effective multi-agent systems under explicit token-cost and latency budgets, using a backbone-then-topology approach instead of traditional topology-first designs.

DetailsMotivation: Current LLM-based multi-agent systems focus on communication topologies and agent backbones but don't optimize under explicit token-cost and latency budgets, leading to suboptimal cost-effectiveness when budgets are binding constraints.

Method: Two-stage approach: 1) Backbone-oriented agent generation (LLM pool construction, pool selection, role-backbone matching) to create heterogeneous agents, 2) Adaptive MAS topology generation (agent representation learning, gating, latency-aware topology synthesis) to guide inter-agent communication.

Result: Achieves up to 10% performance gains under matched token-cost budgets and 22% gains under latency budgets, with strong AUC on performance-versus-budget curves across benchmarks. Works as plug-in for existing MAS and generalizes to unseen LLMs.

Conclusion: AgentBalance provides a practical framework for budget-aware deployment of LLM-based multi-agent systems, addressing the gap between theoretical designs and real-world cost constraints.

Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) are becoming indispensable building blocks for web-scale applications such as web search, social network analytics, and online customer support, where cost-effectiveness is increasingly the primary constraint for large-scale deployment. While recent work improves MAS cost-effectiveness by shaping inter-agent communication topologies and selecting agent backbones, it rarely models and optimizes under explicit token-cost and latency budgets that reflect deployment constraints. This often leads to topology-first designs and suboptimal cost-effectiveness when budgets are binding. We present AgentBalance, a framework for constructing cost-effective MAS under explicit token-cost and latency budgets via a backbone-then-topology design. AgentBalance first performs backbone-oriented agent generation, constructing agents with heterogeneous backbones through LLM pool construction, pool selection, and role-backbone matching. It then performs adaptive MAS topology generation, guiding inter-agent communication via agent representation learning, gating, and latency-aware topology synthesis. Experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance achieves up to 10% and 22% performance gains under matched token-cost and latency budgets, respectively, and yields strong AUC on performance-versus-budget curves across benchmarks. AgentBalance also functions as a plug-in for existing MAS, improving performance under the same token-cost and latency constraints, and it generalizes well to unseen LLMs for practical, budget-aware deployment. Code: https://github.com/usail-hkust/AgentBalance

[201] Back to the Baseline: Examining Baseline Effects on Explainability Metrics

Agustin Martin Picard, Thibaut Boissin, Varshini Subhash, Rémi Cadène, Thomas Fel

Main category: cs.AI

TL;DR: The paper reveals a critical flaw in popular XAI evaluation metrics (Insertion/Deletion) - baseline choice biases results, favoring different attribution methods. The authors propose a new model-dependent baseline that better balances information removal and avoiding out-of-distribution images.

DetailsMotivation: Current XAI evaluation using Insertion/Deletion metrics has a fundamental problem: baseline selection inherently biases which attribution methods appear optimal, leading to contradictory results even with simple linear models. This undermines reliable comparison of attribution methods.

Method: The authors analyze baseline properties through two criteria: (1) information removal capability, and (2) avoiding production of overly out-of-distribution images. They test existing baselines against these criteria and introduce a novel model-dependent baseline using feature visualization techniques to create a baseline that removes information without generating OOD images.

Result: Existing baselines fail to satisfy both desirable properties, showing a trade-off: they either remove information effectively OR avoid OOD images, but not both. The proposed model-dependent baseline improves this trade-off by removing information while staying closer to the data distribution.

Conclusion: Baseline choice in XAI evaluation metrics is not neutral and significantly impacts attribution method rankings. The proposed model-dependent baseline offers a better solution by balancing information removal and distribution preservation, providing more reliable evaluation of attribution methods.

Abstract: Attribution methods are among the most prevalent techniques in Explainable Artificial Intelligence (XAI) and are usually evaluated and compared using Fidelity metrics, with Insertion and Deletion being the most popular. These metrics rely on a baseline function to alter the pixels of the input image that the attribution map deems most important. In this work, we highlight a critical problem with these metrics: the choice of a given baseline will inevitably favour certain attribution methods over others. More concerningly, even a simple linear model with commonly used baselines contradicts itself by designating different optimal methods. A question then arises: which baseline should we use? We propose to study this problem through two desirable properties of a baseline: (i) that it removes information and (ii) that it does not produce overly out-of-distribution (OOD) images. We first show that none of the tested baselines satisfy both criteria, and there appears to be a trade-off among current baselines: either they remove information or they produce a sequence of OOD images. Finally, we introduce a novel baseline by leveraging recent work in feature visualisation to artificially produce a model-dependent baseline that removes information without being overly OOD, thus improving on the trade-off when compared to other existing baselines. Our code is available at https://github.com/deel-ai-papers/Back-to-the-Baseline

[202] Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes

Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Minsu Ha, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon

Main category: cs.AI

TL;DR: 12.7B parameter model bridges gap between open and proprietary models in reasoning and long-context tasks through comprehensive training recipe addressing model collapse and instability.

DetailsMotivation: Address the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding, while solving common challenges of model collapse and training instability in reasoning adaptation.

Method: Comprehensive training recipe with: 1) Memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel optimizations, 2) Two-stage SFT curriculum with verified synthetic data to mitigate distribution mismatch, 3) Robust RLFT pipeline with difficulty-aware data filtering and mixed-policy trajectory reuse.

Result: Achieves performance comparable to significantly larger models across mathematics, coding, and agentic benchmarks, offering competitive open model and practical blueprint for scaling reasoning under realistic compute constraints.

Conclusion: Motif-2-12.7B-Reasoning successfully bridges the performance gap between open and proprietary models in reasoning tasks through a reproducible training approach that addresses key technical challenges, providing both a competitive model and methodology for the community.

Abstract: We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

[203] Three methods, one problem: Classical and AI approaches to no-three-in-line

Pranav Ramanathan, Thomas Prellberg, Matthew Lewis, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.AI

TL;DR: First systematic comparison of classical optimization (ILP) vs AI methods (PatternBoost transformer, PPO) for the No-Three-In-Line combinatorial geometry problem on n×n grids.

DetailsMotivation: The No-Three-In-Line problem is a famous combinatorial geometry challenge where classical methods like ILP face exponential scaling with grid size. Recent AI advances offer promising alternatives for pattern-based approximation, but no systematic comparison exists between classical and AI approaches for this problem.

Method: Applied three approaches: 1) Classical Integer Linear Programming (ILP) for provably optimal solutions; 2) PatternBoost transformer learning for pattern-based approximation; 3) Reinforcement learning with PPO algorithm. Compared performance across different grid sizes up to 19×19.

Result: ILP achieved provably optimal solutions up to 19×19 grids. PatternBoost matched optimal performance up to 14×14 grids with 96% test loss reduction. PPO achieved perfect solutions on 10×10 grids but failed at 11×11 due to constraint violations preventing valid configurations.

Conclusion: Classical optimization remains essential for exact solutions, while AI methods offer competitive performance on smaller instances. Hybrid approaches combining classical and AI methods present the most promising direction for scaling to larger problem sizes.

Abstract: The No-Three-In-Line problem asks for the maximum number of points that can be placed on an n by n grid with no three collinear, representing a famous problem in combinatorial geometry. While classical methods like Integer Linear Programming (ILP) guarantee optimal solutions, they face exponential scaling with grid size, and recent advances in machine learning offer promising alternatives for pattern-based approximation. This paper presents the first systematic comparison of classical optimization and AI approaches to this problem, evaluating their performance against traditional algorithms. We apply PatternBoost transformer learning and reinforcement learning (PPO) to this problem for the first time, comparing them against ILP. ILP achieves provably optimal solutions up to 19 by 19 grids, while PatternBoost matches optimal performance up to 14 by 14 grids with 96% test loss reduction. PPO achieves perfect solutions on 10 by 10 grids but fails at 11 by 11 grids, where constraint violations prevent valid configurations. These results demonstrate that classical optimization remains essential for exact solutions while AI methods offer competitive performance on smaller instances, with hybrid approaches presenting the most promising direction for scaling to larger problem sizes.

[204] General-purpose AI models can generate actionable knowledge on agroecological crop protection

Kris A. G. Wyckhuys

Main category: cs.AI

TL;DR: LLMs like DeepSeek and ChatGPT show promise for agroecological knowledge generation but have accuracy issues including hallucinations and data omissions.

DetailsMotivation: To explore the potential of generative AI for democratizing scientific knowledge in agri-food science, specifically for agroecological crop protection, and to compare the performance of different LLMs in this domain.

Method: Compared web-grounded (DeepSeek) vs non-grounded (free-tier ChatGPT) LLMs on nine globally limiting pests, weeds, and plant diseases. Assessed factual accuracy, data consistency, and breadth of knowledge (data completeness) of generated scientific knowledge.

Result: DeepSeek screened 4.8-49.7× larger literature corpus and reported 1.6-2.4× more biological control agents/management solutions than ChatGPT. DeepSeek showed 21.6% higher efficacy estimates, better lab-to-field consistency, and more realistic effects. Both models hallucinated (fabricated agents/references), reported implausible ecological interactions, confused nomenclatures, and omitted key data.

Conclusion: LLMs can correctly report low-resolution efficacy trends and, with rigorous human oversight, may be powerful tools for farm-level decision-making and scientific creativity in agroecological crop protection.

Abstract: Generative artificial intelligence (AI) offers potential for democratizing scientific knowledge and converting this to clear, actionable information, yet its application in agri-food science remains unexplored. Here, we verify the scientific knowledge on agroecological crop protection that is generated by either web-grounded or non-grounded large language models (LLMs), i.e., DeepSeek versus the free-tier version of ChatGPT. For nine globally limiting pests, weeds, and plant diseases, we assessed the factual accuracy, data consistency, and breadth of knowledge or data completeness of each LLM. Overall, DeepSeek consistently screened a 4.8-49.7-fold larger literature corpus and reported 1.6-2.4-fold more biological control agents or management solutions than ChatGPT. As a result, DeepSeek reported 21.6% higher efficacy estimates, exhibited greater laboratory-to-field data consistency, and showed more realistic effects of pest identity and management tactics. However, both models hallucinated, i.e., fabricated fictitious agents or references, reported on implausible ecological interactions or outcomes, confused old and new scientific nomenclatures, and omitted data on key agents or solutions. Despite these shortcomings, both LLMs correctly reported low-resolution efficacy trends. Overall, when paired with rigorous human oversight, LLMs may pose a powerful tool to support farm-level decision-making and unleash scientific creativity.

[205] BAID: A Benchmark for Bias Assessment of AI Detectors

Priyam Basu, Yunfeng Zhang, Vipul Raheja

Main category: cs.AI

TL;DR: BAID is a comprehensive evaluation framework that systematically assesses bias in AI text detectors across 7 sociolinguistic categories using 200k+ samples, revealing consistent performance disparities against underrepresented groups.

DetailsMotivation: AI-generated text detectors are increasingly used in education and professional settings, but prior research only identified isolated cases of bias (particularly against English Language Learners). There's a lack of systematic evaluation across broader sociolinguistic factors, creating a need for comprehensive bias assessment before deployment.

Method: Developed BAID framework with 200k+ samples spanning 7 bias categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. Created synthetic versions of each sample using carefully crafted prompts to preserve content while reflecting subgroup-specific writing styles. Evaluated four open-source state-of-the-art AI text detectors using this dataset.

Result: Found consistent disparities in detection performance across all evaluated detectors, with particularly low recall rates for texts from underrepresented groups. The framework revealed systematic bias patterns that weren’t apparent in isolated evaluations.

Conclusion: BAID provides a scalable, transparent approach for auditing AI text detectors and demonstrates the critical need for bias-aware evaluation before deploying these tools for public use. The systematic framework reveals biases that isolated testing misses.

Abstract: AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.

[206] EmeraldMind: A Knowledge Graph-Augmented Framework for Greenwashing Detection

Georgios Kaoukis, Ioannis Aris Koufopoulos, Psaroudaki Eleni, Danae Pla Karidi, Evaggelia Pitoura, George Papastefanatos, Panayiotis Tsaparas

Main category: cs.AI

TL;DR: EmeraldMind is a fact-centric framework using domain-specific knowledge graphs and retrieval-augmented generation to detect corporate greenwashing in ESG reports with transparent evidence-based justifications.

DetailsMotivation: As AI systems increasingly influence decision-making, there's a critical need to design intelligent systems that support sustainability while combating misinformation. Greenwashing (misleading corporate sustainability claims) poses a major challenge to environmental progress, requiring automated detection methods.

Method: EmeraldMind integrates a domain-specific knowledge graph (EmeraldGraph) built from diverse corporate ESG reports with retrieval-augmented generation. The framework surfaces verifiable evidence often missing in generic knowledge bases, supports LLMs in claim assessment, and delivers justification-centric classifications with transparent, evidence-backed verdicts while responsibly abstaining from unverifiable claims.

Result: Experiments on a new greenwashing claims dataset show EmeraldMind achieves competitive accuracy, greater coverage, and superior explanation quality compared to generic LLMs, without requiring fine-tuning or retraining.

Conclusion: EmeraldMind provides an effective framework for automated greenwashing detection that combines domain-specific knowledge with LLM capabilities, offering transparent, evidence-based assessments that can help combat sustainability misinformation in corporate reporting.

Abstract: As AI and web agents become pervasive in decision-making, it is critical to design intelligent systems that not only support sustainability efforts but also guard against misinformation. Greenwashing, i.e., misleading corporate sustainability claims, poses a major challenge to environmental progress. To address this challenge, we introduce EmeraldMind, a fact-centric framework integrating a domain-specific knowledge graph with retrieval-augmented generation to automate greenwashing detection. EmeraldMind builds the EmeraldGraph from diverse corporate ESG (environmental, social, and governance) reports, surfacing verifiable evidence, often missing in generic knowledge bases, and supporting large language models in claim assessment. The framework delivers justification-centric classifications, presenting transparent, evidence-backed verdicts and abstaining responsibly when claims cannot be verified. Experiments on a new greenwashing claims dataset demonstrate that EmeraldMind achieves competitive accuracy, greater coverage, and superior explanation quality compared to generic LLMs, without the need for fine-tuning or retraining.

[207] AI-MASLD Metabolic Dysfunction and Information Steatosis of Large Language Models in Unstructured Clinical Narratives

Yuan Shen, Xiaojun Wu, Linghua Yu

Main category: cs.AI

TL;DR: LLMs show functional decline in noisy clinical scenarios similar to metabolic dysfunction, with Qwen3-Max performing best and GPT-4o making serious PE risk misjudgments.

DetailsMotivation: To evaluate LLMs' ability to extract core medical information from noisy patient complaints and test if they exhibit functional decline analogous to MASLD in clinical settings.

Method: Cross-sectional analysis using 20 standardized medical probes across 5 dimensions to simulate clinical communication. Four LLMs (GPT-4o, Gemini 2.5, DeepSeek 3.1, Qwen3-Max) were evaluated with gold-standard answers assessed via double-blind inverse rating by clinicians.

Result: All models showed functional defects, with Qwen3-Max performing best overall and Gemini 2.5 worst. Most models collapsed under extreme noise. GPT-4o made severe misjudgment in PE risk assessment secondary to DVT.

Conclusion: First empirical confirmation that LLMs exhibit metabolic dysfunction-like features when processing clinical information, introducing “AI-MASLD” concept. LLMs should only be used as auxiliary tools under human supervision due to significant gap between theoretical knowledge and practical application.

Abstract: This study aims to simulate real-world clinical scenarios to systematically evaluate the ability of Large Language Models (LLMs) to extract core medical information from patient chief complaints laden with noise and redundancy, and to verify whether they exhibit a functional decline analogous to Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD). We employed a cross-sectional analysis design based on standardized medical probes, selecting four mainstream LLMs as research subjects: GPT-4o, Gemini 2.5, DeepSeek 3.1, and Qwen3-Max. An evaluation system comprising twenty medical probes across five core dimensions was used to simulate a genuine clinical communication environment. All probes had gold-standard answers defined by clinical experts and were assessed via a double-blind, inverse rating scale by two independent clinicians. The results show that all tested models exhibited functional defects to varying degrees, with Qwen3-Max demonstrating the best overall performance and Gemini 2.5 the worst. Under conditions of extreme noise, most models experienced a functional collapse. Notably, GPT-4o made a severe misjudgment in the risk assessment for pulmonary embolism (PE) secondary to deep vein thrombosis (DVT). This research is the first to empirically confirm that LLMs exhibit features resembling metabolic dysfunction when processing clinical information, proposing the innovative concept of “AI-Metabolic Dysfunction-Associated Steatotic Liver Disease (AI-MASLD)”. These findings offer a crucial safety warning for the application of Artificial Intelligence (AI) in healthcare, emphasizing that current LLMs must be used as auxiliary tools under human expert supervision, as there remains a significant gap between their theoretical knowledge and practical clinical application.

[208] AI Benchmark Democratization and Carpentry

Gregor von Laszewski, Wesley Brewer, Jeyan Thiyagalingam, Juri Papay, Armstrong Foundjem, Piotr Luszczek, Murali Emani, Shirley V. Moore, Vijay Janapa Reddi, Matthew D. Sinclair, Sebastian Lobentanzer, Sujata Goswami, Benjamin Hawks, Marco Colombo, Nhan Tran, Christine R. Kirkpatrick, Abdulkareem Alsudais, Gregg Barrett, Tianhao Li, Kirsten Morehouse, Shivaram Venkataraman, Rutwik Jain, Kartik Mathur, Victor Lu, Tejinder Singh, Khojasteh Z. Mirza, Kongtao Chen, Sasidhar Kunapuli, Gavin Farrell, Renato Umeton, Geoffrey C. Fox

Main category: cs.AI

TL;DR: The paper argues that traditional static AI benchmarks are inadequate for modern AI systems and proposes a shift toward dynamic, adaptive benchmarking frameworks and education in “AI Benchmark Carpentry” to address real-world deployment needs.

DetailsMotivation: Current AI benchmarks are increasingly complex and static, causing a gap between benchmark results and real-world performance. Large language models can memorize static benchmarks, and traditional approaches emphasize peak performance on top-tier hardware without addressing diverse deployment scenarios. There's a need for benchmarks that align with evolving AI systems and practical application contexts.

Method: The paper proposes a paradigm shift from static to dynamic, adaptive benchmarking frameworks that incorporate evolving models, updated data, and heterogeneous platforms. It introduces the concept of “AI Benchmark Carpentry” - systematic education and skill development for benchmark design and use, drawing from experiences with MLCommons, educational initiatives, and programs like the DOE’s Trillion Parameter Consortium.

Result: Identifies key barriers including high resource demands, limited hardware access, lack of benchmark design expertise, and uncertainty in relating results to application domains. Current benchmarks fail to provide guidance for diverse real-world scenarios, necessitating a more inclusive and application-relevant approach.

Conclusion: Benchmarking must become dynamic and inclusive to keep pace with AI evolution, supporting responsible, reproducible, and accessible AI deployment. Community efforts should focus on building sustained expertise through AI Benchmark Carpentry education and technical innovation to enable context-sensitive decision making.

Abstract: Benchmarks are a cornerstone of modern machine learning, enabling reproducibility, comparison, and scientific progress. However, AI benchmarks are increasingly complex, requiring dynamic, AI-focused workflows. Rapid evolution in model architectures, scale, datasets, and deployment contexts makes evaluation a moving target. Large language models often memorize static benchmarks, causing a gap between benchmark results and real-world performance. Beyond traditional static benchmarks, continuous adaptive benchmarking frameworks are needed to align scientific assessment with deployment risks. This calls for skills and education in AI Benchmark Carpentry. From our experience with MLCommons, educational initiatives, and programs like the DOE’s Trillion Parameter Consortium, key barriers include high resource demands, limited access to specialized hardware, lack of benchmark design expertise, and uncertainty in relating results to application domains. Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios. Benchmarking must become dynamic, incorporating evolving models, updated data, and heterogeneous platforms while maintaining transparency, reproducibility, and interpretability. Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use. Benchmarks should support application-relevant comparisons, enabling informed, context-sensitive decisions. Dynamic, inclusive benchmarking will ensure evaluation keeps pace with AI evolution and supports responsible, reproducible, and accessible AI deployment. Community efforts can provide a foundation for AI Benchmark Carpentry.

[209] Causal Inference in Energy Demand Prediction

Chutian Ma, Grigorii Pomazkin, Giacinto Paolo Saggese, Paul Smith

Main category: cs.AI

TL;DR: A structural causal model for energy demand prediction that incorporates weather and calendar factors, with a Bayesian model achieving state-of-the-art 3.84% MAPE.

DetailsMotivation: Energy demand prediction is critical for grid operations but complex due to causal interdependencies between weather conditions, calendar information, and human activity patterns that simple correlation-based methods can't adequately capture.

Method: Proposed a structural causal model to explain causal relationships between variables, performed full causal analysis, then built a Bayesian model using causal insights as prior knowledge.

Result: Achieved state-of-the-art performance with 3.84% MAPE on test set, with strong robustness shown by 3.88% average MAPE in cross-validation across two years of data. Causal analysis revealed season-dependent temperature sensitivity and lower winter variance due to decoupling effects.

Conclusion: The structural causal modeling approach successfully captures complex causal relationships in energy demand, leading to highly accurate and robust predictions that outperform traditional correlation-based methods.

Abstract: Energy demand prediction is critical for grid operators, industrial energy consumers, and service providers. Energy demand is influenced by multiple factors, including weather conditions (e.g. temperature, humidity, wind speed, solar radiation), and calendar information (e.g. hour of day and month of year), which further affect daily work and life schedules. These factors are causally interdependent, making the problem more complex than simple correlation-based learning techniques satisfactorily allow for. We propose a structural causal model that explains the causal relationship between these variables. A full analysis is performed to validate our causal beliefs, also revealing important insights consistent with prior studies. For example, our causal model reveals that energy demand responds to temperature fluctuations with season-dependent sensitivity. Additionally, we find that energy demand exhibits lower variance in winter due to the decoupling effect between temperature changes and daily activity patterns. We then build a Bayesian model, which takes advantage of the causal insights we learned as prior knowledge. The model is trained and tested on unseen data and yields state-of-the-art performance in the form of a 3.84 percent MAPE on the test set. The model also demonstrates strong robustness, as the cross-validation across two years of data yields an average MAPE of 3.88 percent.

[210] MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

Main category: cs.AI

TL;DR: TxAgent is an AI system for therapeutic decision-making that uses iterative RAG with a fine-tuned Llama-3.1-8B model to dynamically access biomedical tools, evaluated in the CURE-Bench NeurIPS 2025 Challenge where it won Excellence Award in Open Science.

DetailsMotivation: Therapeutic decision-making requires robust multi-step reasoning with reliable biomedical knowledge, but general-purpose RAG systems don't meet medical safety constraints where reasoning trace and tool invocation accuracy are critical.

Method: TxAgent uses a fine-tuned Llama-3.1-8B model with iterative retrieval-augmented generation (RAG) that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse) integrating FDA Drug API, OpenTargets, and Monarch resources.

Result: The system was evaluated in CURE-Bench NeurIPS 2025 Challenge, analyzing how retrieval quality for tool calls influences performance, demonstrating performance gains through improved tool-retrieval strategies, and winning Excellence Award in Open Science.

Conclusion: Agentic AI methods like TxAgent address therapeutic reasoning challenges through iterative RAG with specialized biomedical tools, with evaluation protocols treating reasoning and tool-usage behaviors as explicit supervision signals for medical AI safety.

Abstract: Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

[211] Probability Bracket Notation: Multivariable Systems and Static Bayesian Networks

Xing M. Wang

Main category: cs.AI

TL;DR: The paper extends Probability Bracket Notation (PBN) to multivariable systems and Bayesian networks, providing an operator-driven framework for analyzing probabilistic models with applications in education and AI.

DetailsMotivation: Traditional probability notation lacks a unified operator-driven framework for analyzing complex probabilistic models. The authors aim to develop a symbolic framework inspired by quantum mechanics notation to simplify manipulation and analysis of probabilistic systems.

Method: The authors expand Probability Bracket Notation (PBN) to handle multivariable probability systems and Bayesian networks. They define joint, marginal, and conditional probability distributions, as well as marginal and conditional expectations within this framework. They demonstrate applications using the Student BN example and extend to continuous variables and linear Gaussian networks. They also introduce a customized Healthcare BN with mixed discrete/continuous variables and discrete-display nodes.

Result: PBN provides a unified operator-driven framework that simplifies analysis of probabilistic models. It enables algebraic manipulation of dependencies among multiple random variables and supports predictions, inferences (bottom-up and top-down), and expectations. The framework works with both discrete and continuous variables and can handle mixed-variable Bayesian networks.

Conclusion: Probability Bracket Notation offers a powerful alternative to traditional probability notation, unifying and simplifying probabilistic model analysis. It has potential as both an educational tool and practical platform for causal reasoning, inference, expectation, data analytics, machine learning, and artificial intelligence applications.

Abstract: We expand the Probability Bracket Notation (PBN), a symbolic framework inspired by the Dirac notation in quantum mechanics, to multivariable probability systems and static Bayesian networks (BNs). By defining joint, marginal, and conditional probability distributions (PDs), as well as marginal and conditional expectations, we demonstrate how to express dependencies among multiple random variables and manipulate them algebraically in PBN. Using the well-known Student BN as an example of probabilistic graphical models (PGMs), we illustrate how to apply PBN to analyze predictions, inferences (using both bottom-up and top-down approaches), and expectations. We then extend PBN to BNs with continuous variables. After reviewing linear Gaussian networks, we introduce a customized Healthcare BN that includes both continuous and discrete random variables, utilizes user-specific data, and provides tailored predictions via discrete-display (DD) nodes that proxy for their continuous-variable parents. Compared to traditional probability notation, PBN offers an operator-driven framework that unifies and simplifies the analysis of probabilistic models, with potential as both an educational tool and a practical platform for causal reasoning, inference, expectation, data analytics, machine learning, and artificial intelligence.

[212] AI and Jobs: Has the Inflection Point Arrived? Evidence from an Online Labor Platform

Dandan Qiao, Huaxia Rui, Qian Xiong

Main category: cs.AI

TL;DR: AI impacts online labor markets differently: displacement effects in translation (reduced work/earnings) vs productivity effects in web development (increased work/earnings), with an inflection point determining whether humans benefit or get replaced.

DetailsMotivation: To understand how AI (specifically ChatGPT) influences different online labor markets over time, examining both positive and negative effects on human workers across various domains.

Method: Difference-in-Differences analysis of ChatGPT’s impact, Cournot competition model to identify inflection points, and heterogeneous analysis comparing ChatGPT 3.5 to 4.0 effects across regions and experience levels.

Result: Two opposite scenarios: displacement in translation markets (reduced volume/earnings) vs productivity enhancement in web development (increased volume/earnings). Found inflection point where AI transitions from complement to substitute. U.S. web developers benefit more than others; experienced translators more likely to exit market.

Conclusion: AI impacts labor markets heterogeneously with both displacement and productivity effects, determined by market-specific inflection points. Policy should consider these differential impacts across occupations, regions, and experience levels.

Abstract: This study investigates how artificial intelligence (AI) influences various online labor markets (OLMs) over time. Employing the Difference-in-Differences method, we discovered two distinct scenarios following ChatGPT’s launch: displacement effects featuring reduced work volume and earnings, exemplified by translation & localization OLM; productivity effects featuring increased work volume and earnings, exemplified by web development OLM. To understand these opposite effects in a unified framework, we developed a Cournot competition model to identify an inflection point for each market. Before this point, human workers benefit from AI enhancements; beyond this point, human workers would be replaced. Further analyzing the progression from ChatGPT 3.5 to 4.0, we found three effect scenarios, reinforcing our inflection point conjecture. Heterogeneous analyses reveal that U.S. web developers tend to benefit more from ChatGPT’s launch compared to their counterparts in other regions. Experienced translators seem more likely to exit the market than less experienced translators.

[213] Grammar-Aligned Decoding

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, Loris D’Antoni

Main category: cs.AI

TL;DR: The paper introduces ASAp, a grammar-aligned decoding algorithm that ensures outputs are grammatical while maintaining proper alignment with the LLM’s probability distribution, addressing the distribution distortion problem in existing grammar-constrained decoding methods.

DetailsMotivation: Existing grammar-constrained decoding (GCD) techniques distort LLM probability distributions, leading to grammatical but low-quality outputs that don't properly reflect the model's likelihood estimates. There's a need for decoding methods that guarantee grammaticality while preserving the LLM's conditional probability distribution.

Method: Proposes Adaptive Sampling with Approximate Expected Futures (ASAp), a decoding algorithm that uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. The method guarantees grammatical outputs while provably matching the conditional probability of the LLM’s distribution conditioned on the grammar constraint.

Result: ASAp produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques while still enforcing grammatical constraints, as demonstrated in evaluations on code generation and structured NLP tasks.

Conclusion: The paper introduces grammar-aligned decoding (GAD) as a solution to the distribution distortion problem in constrained decoding, with ASAp providing a provably correct method for generating grammatical outputs that properly reflect the LLM’s probability distribution.

Abstract: Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM’s output must follow a given grammar. In this paper, we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM’s distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM’s distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.

[214] Rolling in the deep of cognitive and AI biases

Nicoleta Tantalaki, Athena Vakali

Main category: cs.AI

TL;DR: Paper proposes integrating human cognitive biases into AI fairness analysis, viewing AI as a sociotechnical system rather than just computational pipelines.

DetailsMotivation: Current AI fairness approaches focus too narrowly on computational aspects while overlooking human and societal factors that contribute to biased outcomes, despite AI being deployed in sensitive domains like healthcare and law enforcement.

Method: Radical new methodology that incorporates human cognitive biases as core entities in AI fairness analysis, mapping human heuristics to AI biases using cognitive science taxonomy, and identifying hidden pathways of human-to-AI bias transmission.

Result: Developed a new mapping framework that reveals how human cognitive biases influence the AI lifecycle, identifies fairness intensities and inter-dependencies, and exposes hidden bias pathways from human heuristics to AI systems.

Conclusion: AI must be understood as a sociotechnical system, and incorporating human cognitive biases into fairness analysis will enable deeper human-centric case studies and reveal hidden bias causes and effects in AI systems.

Abstract: Nowadays, we delegate many of our decisions to Artificial Intelligence (AI) that acts either in solo or as a human companion in decisions made to support several sensitive domains, like healthcare, financial services and law enforcement. AI systems, even carefully designed to be fair, are heavily criticized for delivering misjudged and discriminated outcomes against individuals and groups. Numerous work on AI algorithmic fairness is devoted on Machine Learning pipelines which address biases and quantify fairness under a pure computational view. However, the continuous unfair and unjust AI outcomes, indicate that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed. Although, the synergy of humans and machines seems imperative to make AI work, the significant impact of human and societal factors on AI bias is currently overlooked. We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview. Inspired by the cognitive science definition and taxonomy of human heuristics, we identify how harmful human actions influence the overall AI lifecycle, and reveal human to AI biases hidden pathways. We introduce a new mapping, which justifies the human heuristics to AI biases reflections and we detect relevant fairness intensities and inter-dependencies. We envision that this approach will contribute in revisiting AI fairness under deeper human-centric case studies, revealing hidden biases cause and effects.

[215] Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection

Zongxian Yang, Jiayu Qian, Zegao Peng, Haoyu Zhang, Yu-An Huang, KC Tan, Zhi-An Huang

Main category: cs.AI

TL;DR: Med-REFL is a novel framework that enables large reasoning models to self-correct in medical domains without human labels, using automated structural assessment of reasoning paths to generate preference data for reflection training.

DetailsMotivation: Large reasoning models struggle to self-correct in medicine because evaluating intermediate reasoning is cumbersome and expensive, creating a verification bottleneck that hinders reliable AI reasoners for high-stakes applications.

Method: Med-REFL introduces deterministic structural assessment of the reasoning space to automatically generate preference data for reflection. It globally evaluates all explored reasoning paths in a tree-of-thoughts, quantifies the value of corrective actions, and constructs direct preference optimization pairs to train models to recognize and amend reasoning fallacies.

Result: Med-REFL delivers robust gains across diverse model architectures and medical benchmarks, boosting Llama3.1-8B by +5.82% and state-of-the-art Huatuo-o1 by +4.13% on MedQA. Med-REFL-8B achieves SOTA among 7-8B models and competes with models twice its size. It also generalizes to other domains like logical reasoning and mitigates ‘fake reflection’ phenomenon.

Conclusion: The framework provides a scalable solution to the verification bottleneck, paving the way for more reliable AI reasoners in high-stakes domains like medicine, with publicly available implementation.

Abstract: Large reasoning models excel in domains like mathematics where intermediate reasoning is straightforward to verify, but struggle to self-correct in medicine fields where evaluating intermediate reasoning is cumbersome and expensive. This verification bottleneck hinders the development of reliable AI reasoners for high-stakes application. Here we propose Med-REFL, a novel framework that learns fine-grained reflection without human labels or model distillation. Med-REFL introduces a deterministic structural assessment of the reasoning space to automatically generate preference data for reflection. By globally evaluating all explored reasoning paths in a tree-of-thoughts, our method quantifies the value of corrective actions, enabling the automated construction of direct preference optimization pairs. This trains the model to recognize and amend its own reasoning fallacies. Extensive experiments show Med-REFL delivers robust gains across diverse models architectures and medical benchmarks, boosting a general-purpose Llama3.1-8B by +5.82% and the state-of-the-art Huatuo-o1 by +4.13% on the MedQA benchmark. Our Med-REFL-8B achieves state-of-the-art performance among 7-8B models while even competing with models twice its size. Crucially, targeted ablations prove its success generalizes to other domains such as logical reasoning and mitigates the `fake reflection’ phenomenon in LRMs. Ultimately, our framework provides a scalable solution to the verification bottleneck, paving the way for more reliable AI reasoners in high-stakes domains like medicine. Med-REFL has been made publicly available in https://github.com/TianYin123/Med-REFL.

[216] From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence

Zihao Wang, Junming Zhang

Main category: cs.AI

TL;DR: BusiAgent is a multi-agent LLM framework for enterprise decision-making that integrates CTMDP modeling, entropy optimization, Stackelberg games, and Thompson sampling to bridge operational analysis with strategic goals.

DetailsMotivation: Current LLM approaches for business applications struggle to reconcile detailed operational analyses with high-level strategic goals across diverse markets, leading to fragmented workflows and reduced organizational collaboration.

Method: Multi-agent framework with three core innovations: extended Continuous Time Markov Decision Process for dynamic agent modeling, generalized entropy measure for collaborative efficiency optimization, multi-level Stackelberg game for hierarchical decision processes, plus contextual Thompson sampling for prompt optimization and quality assurance system.

Result: Extensive empirical evaluations across diverse business scenarios validate BusiAgent’s efficacy in generating coherent, client-focused solutions that integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction.

Conclusion: BusiAgent represents a substantial step forward in AI-driven enterprise decision-making by fusing cutting-edge AI technologies with deep business insights, empowering organizations to navigate complex business landscapes more effectively.

Abstract: Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi-agent framework leveraging LLMs for advanced decision-making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi-level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent’s efficacy, demonstrating its capacity to generate coherent, client-focused solutions that smoothly integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting-edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI-driven enterprise decision-making, empowering organizations to navigate complex business landscapes more effectively.

[217] The Illusion of Readiness in Health AI

Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Hao Cheng, HoHin Lee, Praneeth Sanapathi, Sarah Hilado, Tristan Naumann, Javier Alvarez-Valle, Jiang Bian, Mu Wei, Khalil Malik, Lidong Zhou, Jianfeng Gao, Eric Horvitz, Matthew P. Lungren, Doug Burger, Eric Topol, Hoifung Poon, Paul Vozila

Main category: cs.AI

TL;DR: Medical AI benchmarks show brittleness under adversarial testing, revealing significant gaps between leaderboard performance and real-world healthcare readiness.

DetailsMotivation: Despite impressive performance on medical benchmarks, large language models may have hidden weaknesses in multimodal reasoning and robustness that aren't captured by standard evaluations, raising concerns about their real-world healthcare applicability.

Method: The authors conduct adversarial stress tests on leading medical AI models, using simple adversarial transformations to probe robustness. They employ clinician-guided rubrics to analyze what popular medical benchmarks actually measure versus what they should measure for real-world healthcare applications.

Result: The study reveals significant brittleness: models can guess answers with key inputs removed, get confused by slight prompt alterations, and fabricate convincing but flawed reasoning. Medical benchmarks vary widely in what they truly measure, exposing competency gaps in frontier AI systems.

Conclusion: Current medical AI systems lack real-world readiness for healthcare applications. To earn trust in healthcare, AI must be held accountable for robustness, sound reasoning, and alignment with real medical demands, going beyond leaderboard performance metrics.

Abstract: Large language models have demonstrated remarkable performance in a wide range of medical benchmarks. Yet underneath the seemingly promising results lie salient growth areas, especially in cutting-edge frontiers such as multimodal reasoning. In this paper, we introduce a series of adversarial stress tests to systematically assess the robustness of flagship models and medical benchmarks. Our study reveals prevalent brittleness in the presence of simple adversarial transformations: leading systems can guess the right answer even with key inputs removed, yet may get confused by the slightest prompt alterations, while fabricating convincing yet flawed reasoning traces. Using clinician-guided rubrics, we demonstrate that popular medical benchmarks vary widely in what they truly measure. Our study reveals significant competency gaps of frontier AI in attaining real-world readiness for health applications. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold AI systems accountable to ensure robustness, sound reasoning, and alignment with real medical demands.

[218] MedRule-KG: A Knowledge-Graph–Steered Scaffold for Mathematical Reasoning with a Lightweight Verifier

Crystal Su

Main category: cs.AI

TL;DR: MedRule-KG combines a typed knowledge graph with symbolic verification to enforce mathematical/logical constraints in LLM reasoning, achieving perfect accuracy on FDA benchmark.

DetailsMotivation: LLMs often produce fluent but mathematically/logically incorrect reasoning, violating basic constraints despite appearing coherent.

Method: MedRule-KG: compact typed knowledge graph encoding entities/relations with domain-inspired rules, plus symbolic verifier that checks predictions and applies minimal corrections for consistency.

Result: On 90-example FDA benchmark: grounding in MedRule-KG improves EM from 0.767 to 0.900; adding verifier yields 1.000 EM and eliminates all rule violations.

Conclusion: MedRule-KG provides general scaffold for safe mathematical reasoning; code/data released for reproducibility.

Abstract: Large language models (LLMs) often produce fluent reasoning steps while violating simple mathematical or logical constraints. We introduce MedRule-KG, a compact typed knowledge graph coupled with a symbolic verifier, designed to enforce mathematically interpretable rules in reasoning tasks. MedRule-KG encodes entities, relations, and three domain-inspired rules, while the verifier checks predictions and applies minimal corrections to guarantee consistency. On a 90-example FDA-derived benchmark, grounding in MedRule-KG improves exact match (EM) from 0.767 to 0.900, and adding the verifier yields 1.000 EM while eliminating rule violations entirely. We demonstrate how MedRule-KG provides a general scaffold for safe mathematical reasoning, discuss ablations, and release code and data to encourage reproducibility.

[219] ReCode: Unify Plan and Action for Universal Granularity Control

Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yuyu Luo, Bang Liu, Chenglin Wu

Main category: cs.AI

TL;DR: ReCode is a recursive code generation paradigm that unifies planning and action in LLM-based agents by treating high-level plans as abstract functions that get recursively decomposed into primitive actions, enabling dynamic granularity control and generating rich training data.

DetailsMotivation: Current LLM-based agents lack the ability to operate fluidly across decision granularities like humans do. They enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization capabilities.

Method: ReCode unifies planning and action within a single code representation where high-level plans are treated as abstract placeholder functions. The agent recursively decomposes these functions into finer-grained sub-functions until reaching primitive actions, dissolving the rigid boundary between plan and action.

Result: Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training. The recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes.

Conclusion: Unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control in LLM-based agents, addressing a fundamental limitation in current agent architectures.

Abstract: Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.

[220] Toward Robust EEG-based Intention Decoding during Misarticulated Speech in Dysarthria

Ha-Na Jo, Jung-Sun Lee, Eunyeong Ko

Main category: cs.AI

TL;DR: EEG-based soft multitask learning framework improves dysarthria speech intention decoding by suppressing nonspecific spectral responses and aligning domains.

DetailsMotivation: Dysarthria impairs speech motor control, reducing intelligibility, but EEG-based communication support for dysarthric individuals remains limited. The study aims to address this gap by developing EEG-based assistive technology for people with speech impairments.

Method: Recorded EEG from one dysarthric participant during Korean speech tasks, labeling trials as correct/misarticulated. Used spectral analysis to identify neural patterns, then developed a soft multitask learning framework with maximum mean discrepancy-based alignment to suppress nonspecific spectral responses and enhance class discrimination.

Result: Misarticulated trials showed elevated frontal-central delta/alpha power and reduced temporal gamma activity. The proposed model achieved F1-scores of 52.7% (correct) and 41.4% (misarticulated), improving by 2% and 11% over baseline, enabling more stable intention decoding despite articulation errors.

Conclusion: The study demonstrates the potential of EEG-based assistive systems for communication in language-impaired individuals, showing that neural patterns can be leveraged to decode speech intentions even when articulation fails.

Abstract: Dysarthria impairs motor control of speech, often resulting in reduced intelligibility and frequent misarticulations. Although interest in brain-computer interface technologies is growing, electroencephalogram (EEG)-based communication support for individuals with dysarthria remains limited. To address this gap, we recorded EEG data from one participant with dysarthria during a Korean automatic speech task and labeled each trial as correct or misarticulated. Spectral analysis revealed that misarticulated trials exhibited elevated frontal-central delta and alpha power, along with reduced temporal gamma activity. Building on these observations, we developed a soft multitask learning framework designed to suppress these nonspecific spectral responses and incorporated a maximum mean discrepancy-based alignment module to enhance class discrimination while minimizing domain-related variability. The proposed model achieved F1-scores of 52.7 % for correct and 41.4 % for misarticulated trials-an improvement of 2 % and 11 % over the baseline-demonstrating more stable intention decoding even under articulation errors. These results highlight the potential of EEG-based assistive systems for communication in language impaired individuals.

[221] UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

Main category: cs.AI

TL;DR: UpBench is a dynamic benchmark using real Upwork jobs to evaluate LLM agents’ real-world competence, adaptability, and human collaboration capacity through expert-designed rubrics and financial outcome anchoring.

DetailsMotivation: Existing benchmarks are static, synthetic, or domain-limited, providing limited insight into how LLM agents perform in dynamic, economically meaningful environments and real-world work contexts.

Method: Uses real Upwork jobs as tasks, with expert freelancers decomposing jobs into detailed acceptance criteria and rubrics. Employs rubric-based evaluation with per-criterion feedback from human experts, with regular task refreshing to reflect evolving online work.

Result: Enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary metrics, providing a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts.

Conclusion: UpBench offers a path toward collaborative AI-human frameworks where AI amplifies human capability through partnership rather than replacement, with human expertise integrated throughout the evaluation pipeline for professional fidelity.

Abstract: As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

[222] MedRule-KG: A Knowledge-Graph–Steered Scaffold for Reliable Mathematical and Biomedical Reasoning

Crystal Su

Main category: cs.AI

TL;DR: MedRule-KG is a knowledge-graph scaffold with a verifier that steers LLM generation toward valid scientific outputs, reducing violations by 83.2% while improving exact match across 90 drug discovery tasks.

DetailsMotivation: To impose domain-consistent structure on LLMs used for scientific reasoning and early-stage drug discovery, ensuring mathematically and biomedically valid outputs.

Method: MedRule-KG combines a compact knowledge-graph scaffold with a lightweight verifier that injects curated symbolic facts into prompts and enforces rule satisfaction with deterministic checking. Formalizes generation as constrained inference with soft guidance surrogate for decoding.

Result: Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, reduces violation counts by 83.2% relative to strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size.

Conclusion: The approach is practical for interactive design as the verifier adds negligible latency, effectively steering LLM generation toward valid scientific outputs in drug discovery applications.

Abstract: We study how to impose domain-consistent structure on large language models (LLMs) used for scientific reasoning and early-stage drug discovery. We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs. The system injects curated symbolic facts into prompts and then enforces rule satisfaction with a deterministic checker. We formalize generation as constrained inference, introduce a soft guidance surrogate suitable for decoding, and perform a thorough statistical analysis with uncertainty quantification. Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2% relative to a strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size, and the verifier adds negligible latency, making the approach practical for interactive design.

[223] Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

Qiming Bao, Xiaoxuan Fu

Main category: cs.AI

TL;DR: LLMs show strong base performance but degrade under essential rule deletion, contradictions, and some multi-law logical transformations, revealing brittleness in logical reasoning.

DetailsMotivation: To systematically evaluate LLMs' reasoning reliability under structured perturbations of logical rule systems, going beyond surface-form variation to test logical generalization capabilities.

Method: Controlled evaluation framework with four stress tests: (1) rule deletion (redundant vs essential), (2) contradictory evidence injection, (3) logic-preserving rewrites using equivalence laws, and (4) multi-law equivalence stacking (2-5 transformations). Tested on BERT, Qwen2, and LLaMA-like models.

Result: All models achieve perfect accuracy on base tasks. No degradation under redundant rule deletion. Essential rule deletion drops performance to near-chance. Contradictions reduce accuracy to 0.0000. Single-law transformations largely preserve accuracy with minor degradations. Multi-law stacking shows model-dependent sensitivity: BERT matches base, TinyLlama shows marginal degradation, Qwen2 shows substantial drop.

Conclusion: Contemporary LLMs are generally stable under semantic-preserving reformulations but remain brittle to missing/inconsistent evidence and may degrade under composed logical transformations depending on model family. The framework provides a diagnostic tool for evaluating logical generalization.

Abstract: Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised. We present a controlled evaluation framework that probes reasoning reliability through four stress tests: (1) rule deletion, removing redundant versus essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites based on equivalence laws (contraposition, double negation, implication-to-disjunction, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that composes 2–5 transformations. Across three representative model families – BERT, Qwen2, and LLaMA-like models – all models attain Acc$=1.0000$ on the base split and show no degradation under redundant rule deletion. In contrast, essential rule deletion yields a pronounced decrease to near-chance performance, and injecting explicit contradictions reduces accuracy to 0.0000. Under logic-preserving rewrites, accuracy is largely preserved for single-law transformations with only small degradations in a few cases, whereas multi-law stacking exposes model-dependent sensitivity: BERT matches the base condition, TinyLlama shows only marginal degradation, and Qwen2 exhibits a substantial drop. Overall, the results indicate that contemporary LLMs are generally stable under semantic-preserving reformulations, yet remain brittle to missing or inconsistent evidence and may degrade under composed logical transformations depending on the model family. The proposed framework provides a concise diagnostic tool for isolating these failure modes and for evaluating logical generalisation beyond surface-form variation.

[224] Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Huacan Wang, Ronghao Chen

Main category: cs.AI

TL;DR: Octopus introduces a new multimodal reasoning paradigm with six orchestrated capabilities that can autonomously explore diverse reasoning pathways and dynamically adapt to task requirements, outperforming existing methods on most benchmark tasks.

DetailsMotivation: Existing multimodal reasoning models have fundamental architectural limitations - they lack human-like ability to autonomously explore diverse reasoning pathways and struggle to adapt to dynamically changing capability requirements in real-world tasks. Humans exhibit complementary thinking abilities that current methods only partially cover.

Method: Proposes Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm that defines six core capabilities essential for multimodal reasoning and organizes a comprehensive evaluation benchmark called Octopus-Bench. The system can autonomously explore during reasoning and dynamically select the most appropriate capability based on current state.

Result: Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.

Conclusion: The proposed six-capability orchestration paradigm demonstrates superior performance by enabling autonomous exploration and dynamic capability selection, addressing fundamental limitations of existing multimodal reasoning approaches through better coordination of complementary reasoning abilities.

Abstract: Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.

[225] ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan

Main category: cs.AI

TL;DR: ICPO improves RLVR for LLM reasoning by using intrinsic confidence from generation probabilities to calculate preference advantage scores, addressing issues like coarse-grained rewards, noise, and inefficient exploration.

DetailsMotivation: Existing RLVR methods for enhancing LLM reasoning suffer from coarse-grained rewards, reward noise, and inefficient exploration, leading to unstable training and entropy collapse.

Method: ICPO calculates preference advantage scores by comparing relative generation probabilities of multiple responses under the same prompt, integrating these with verifiable rewards to guide exploration.

Result: ICPO alleviates coarse-grained rewards and noise, curbs overconfident errors, enhances undervalued high-quality responses, and prevents overfitting to specific strategies.

Conclusion: Comprehensive experiments across four general-domain and three mathematical benchmarks show ICPO steadily boosts reasoning compared to GRPO.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.

[226] TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi

Main category: cs.AI

TL;DR: TRACE is a framework for transparent reasoning evaluation that diagnoses reasoning trajectories using auxiliary reasoning sets, exposing failures missed by standard final-answer evaluation.

DetailsMotivation: Current large vision-language models struggle with reliable mathematical and scientific reasoning, and standard final-answer evaluation masks reasoning errors, allowing silent failures to persist without detection.

Method: TRACE uses Auxiliary Reasoning Sets (ARS) - compact sub question-answer pairs that decompose complex problems. It evaluates intermediate steps through consistency-based metrics and diagnoses reasoning trajectories rather than only end results.

Result: Experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint where reasoning failures occur. TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths.

Conclusion: TRACE offers actionable signals for model improvement by exposing reasoning failures overlooked by standard evaluation, supporting effective filtering, debugging, and model refinement through transparent reasoning analysis.

Abstract: Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.

[227] SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Esaú Villatoro-Tello, Thomas Schaaf, Ricard Marxer, Petr Motlicek

Main category: cs.AI

TL;DR: SDialog is an open-source Python toolkit that provides an end-to-end framework for building and analyzing LLM-based conversational agents, integrating dialog generation, evaluation, and mechanistic interpretability.

DetailsMotivation: To create a unified framework that addresses the fragmented nature of conversational AI development by combining generation, evaluation, and interpretability tools into a single system, enabling more systematic research and development of dialog systems.

Method: Built around a standardized Dialog representation, SDialog provides: 1) persona-driven multi-agent simulation with composable orchestration, 2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, 3) mechanistic interpretability tools for activation inspection and steering, and 4) audio generation with full acoustic simulation.

Result: SDialog integrates with all major LLM backends and enables mixed-backend experiments under a unified API, providing researchers with a comprehensive toolkit for building, benchmarking, and understanding conversational systems.

Conclusion: SDialog offers a systematic approach to conversational AI research by coupling generation, evaluation, and interpretability in a dialog-centric architecture, making it easier to build, analyze, and understand LLM-based conversational agents.

Abstract: We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

[228] CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

Yakun Zhu, Zhongzhen Huang, Qianhan Feng, Linjie Mu, Yannian Gu, Shaoting Zhang, Qi Dou, Xiaofan Zhang

Main category: cs.AI

TL;DR: CP-Env is a controllable agentic hospital environment that evaluates LLMs across end-to-end clinical pathways, revealing that most models struggle with pathway complexity despite excessive reasoning steps.

DetailsMotivation: Current benchmarks focus on static exams or isolated dialogues, which inadequately evaluate LLMs in dynamic clinical scenarios involving complex decision-making and transitions between different stages of care.

Method: CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios from triage to multidisciplinary team meetings, following real hospital adaptive flow with branching, long-horizon task execution. A three-tiered evaluation framework (Clinical Efficacy, Process Competency, Professional Ethics) is proposed.

Result: Most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Excessive reasoning steps can be counterproductive, while top models show reduced tool dependency through internalized knowledge.

Conclusion: CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation, providing benchmark and evaluation tools for further research.

Abstract: Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP_ENV.

[229] Neuronal Attention Circuit (NAC) for Representation Learning

Waleed Razzaq, Izis Kanjaraway, Yun-Bo Zhao

Main category: cs.AI

TL;DR: NAC (Neuronal Attention Circuit) is a continuous-time attention mechanism that reformulates attention logits as solutions to linear ODEs using biologically-inspired sparse gating, enabling efficient adaptive dynamics with theoretical guarantees and competitive performance across domains.

DetailsMotivation: Attention mechanisms improve representation learning but their discrete nature limits continuous-time modeling. The authors aim to develop a biologically plausible continuous-time attention mechanism that can handle irregular time-series data while maintaining efficiency.

Method: NAC reformulates attention logits computation as solving a linear first-order ODE with nonlinear interlinked gates inspired by C. elegans neuronal circuits. It uses sparse sensory gates for key-query projections and a sparse backbone network with two heads for content-target and learnable time-constant gates. Supports three computation modes: explicit Euler integration, exact closed-form solution, and steady-state approximation. Includes sparse Top-K pairwise concatenation for memory efficiency.

Result: NAC matches or outperforms competing baselines in accuracy across irregular time-series classification, autonomous vehicle lane-keeping, and industrial prognostics. It occupies an intermediate position in runtime and memory efficiency compared to other continuous-time baselines.

Conclusion: NAC provides a biologically plausible, theoretically grounded continuous-time attention mechanism with strong empirical performance, bridging the gap between discrete attention and continuous-time modeling while maintaining computational efficiency.

Abstract: Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically plausible CT-Attention mechanism that reformulates attention logits computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing \textit{C. elegans} Neuronal Circuit Policies (NCPs) wiring mechanism. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing \textit{content-target} and \textit{learnable time-constant} gates, enabling efficient adaptive dynamics. NAC supports three attention logit computation modes: (i) explicit Euler integration, (ii) exact closed-form solution, and (iii) steady-state approximation. To improve memory intensity, we implemented a sparse Top-\emph{K} pairwise concatenation scheme that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability, bounded approximation errors, and universal approximation. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory efficiency compared with several CT baselines.

[230] EpiPlanAgent: Agentic Automated Epidemic Response Planning

Kangkun Mao, Fang Xu, Jinru Ding, Yidong Jiang, Yujun Yao, Yirong Chen, Junming Liu, Xiaoqin Wu, Qian Wu, Xiaoyan Huang, Jie Xu

Main category: cs.AI

TL;DR: EpiPlanAgent: An agent-based LLM system that automates digital emergency response plan generation and validation for epidemics, improving completeness and reducing development time.

DetailsMotivation: Traditional epidemic response planning relies on labor-intensive manual methods, creating a need for automated, scalable solutions to improve public health preparedness.

Method: Multi-agent framework using LLMs with task decomposition, knowledge grounding, and simulation modules, tested by public health professionals with real-world outbreak scenarios.

Result: Significantly improved plan completeness and guideline alignment while drastically reducing development time; expert evaluation confirmed high consistency with human-authored content.

Conclusion: EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

Abstract: Epidemic response planning is essential yet traditionally reliant on labor-intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent-based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi-agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public health professionals tested the system using real-world outbreak scenarios in a controlled evaluation. Results demonstrated that EpiPlanAgent significantly improved the completeness and guideline alignment of plans while drastically reducing development time compared to manual workflows. Expert evaluation confirmed high consistency between AI-generated and human-authored content. User feedback indicated strong perceived utility. In conclusion, EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

[231] Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang

Main category: cs.AI

TL;DR: Training-free LLM agent architecture for zero-shot PCG parameter configuration using Actor-Critic agents to bridge semantic gap between natural language instructions and technical parameters.

DetailsMotivation: PCG tools require precise configuration of opaque technical parameters, and while LLMs promise natural language interfaces, off-the-shelf models fail to bridge the semantic gap between abstract user instructions and strict parameter specifications.

Method: Proposes a training-free architecture with Actor and Critic LLM agents that work iteratively: Actor reasons over tool parameters, Critic refines configurations to align with human design preferences through autonomous reasoning.

Result: Outperforms single-agent baselines, produces diverse and structurally valid 3D environments from natural language descriptions, establishes new benchmark for instruction-following in PCG.

Conclusion: Off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools without task-specific fine-tuning, shifting burden from model training to architectural reasoning for scalable mastery of complex software.

Abstract: Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.

[232] Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen

Main category: cs.AI

TL;DR: InternGeometry is an LLM agent that achieves medalist-level performance on IMO geometry problems using iterative proposition generation, symbolic verification, and reinforcement learning with minimal training data.

DetailsMotivation: Current AI geometry problem-solving is dominated by expert models like AlphaGeometry 2 that require massive data synthesis and search. LLMs have strong mathematical reasoning but lack good heuristics for auxiliary constructions in geometry. The authors aim to build the first medalist-level LLM agent for geometry.

Method: InternGeometry uses iterative proposition generation and auxiliary construction proposals, verified by a symbolic engine. It incorporates reflection on engine feedback and dynamic memory for extensive interactions (200+ per problem). Complexity-Boosting Reinforcement Learning (CBRL) gradually increases problem complexity during training.

Result: Solves 44/50 IMO geometry problems (2000-2024), exceeding average gold medalist score (40.9). Achieves this with only 13K training examples (0.004% of AlphaGeometry 2’s data). Can propose novel auxiliary constructions not found in human solutions.

Conclusion: Demonstrates LLM agents’ potential for expert-level geometry tasks with minimal training data. The approach overcomes heuristic limitations through iterative verification and feedback. Model, data, and symbolic engine will be released to support future research.

Abstract: Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine’s feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.

[233] Unified Smart Factory Model: A model-based Approach for Integrating Industry 4.0 and Sustainability for Manufacturing Systems

Ishaan Kaushal, Amaresh Chakrabarti

Main category: cs.AI

TL;DR: USFM is a framework that translates sustainability goals into measurable factory indicators using MBSE to model manufacturing activities, demonstrated through PCB assembly case study.

DetailsMotivation: There's a gap between high-level sustainability goals and practical factory implementation. Industries, especially SMEs, need systematic ways to translate sustainability targets into measurable indicators and data collection processes.

Method: Developed Unified Smart Factory Model (USFM) using Object Process Methodology (MBSE language) to model manufacturing activities as processes. Integrated Manufacturing Process and System, Data Process, and KPI Selection/Assessment. Demonstrated through PCB assembly factory case study.

Result: Successfully demonstrated how environmental sustainability KPIs (energy consumption, environmental impact) can be selected, modeled, and mapped to necessary data. The systematic approach reduces redundancy, minimizes risk of missing critical information, and enhances data collection.

Conclusion: USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries, particularly SMEs aiming to achieve sustainability targets through systematic information mapping.

Abstract: This paper presents the Unified Smart Factory Model (USFM), a comprehensive framework designed to translate high-level sustainability goals into measurable factory-level indicators with a systematic information map of manufacturing activities. The manufacturing activities were modelled as set of manufacturing, assembly and auxiliary processes using Object Process Methodology, a Model Based Systems Engineering (MBSE) language. USFM integrates Manufacturing Process and System, Data Process, and Key Performance Indicator (KPI) Selection and Assessment in a single framework. Through a detailed case study of Printed Circuit Board (PCB) assembly factory, the paper demonstrates how environmental sustainability KPIs can be selected, modelled, and mapped to the necessary data, highlighting energy consumption and environmental impact metrics. The model’s systematic approach can reduce redundancy, minimize the risk of missing critical information, and enhance data collection. The paper concluded that the USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries specifically SMEs aiming to achieve sustainability targets.

[234] HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Wang Lu, Yao Zhu, Jindong Wang

Main category: cs.AI

TL;DR: HAROOD is a comprehensive benchmark for human activity recognition in out-of-distribution settings, evaluating 16 methods across 4 OOD scenarios using 6 datasets to assess OOD algorithm effectiveness for HAR.

DetailsMotivation: Current HAR research lacks comprehensive evaluation of OOD algorithms across different distribution shift scenarios (cross-person, device, environment, time), making it unclear which methods work best and whether OOD approaches are necessary for HAR.

Method: Proposed HAROOD benchmark with 4 defined OOD scenarios (cross-person, cross-position, cross-dataset, cross-time), built testbed covering 6 datasets, implemented 16 comparative methods using CNN-based and Transformer-based architectures, and established two model selection protocols.

Result: Extensive experiments revealed no single OOD method consistently outperforms others across all scenarios, indicating substantial room for improvement in OOD-based HAR research.

Conclusion: HAROOD provides a modular, extensible benchmark to facilitate OOD-based HAR research, highlighting the need for more robust methods that can handle diverse distribution shifts in real-world activity recognition scenarios.

Abstract: Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

cs.SD

[235] The TCG CREST – RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge

Nikhil Raghav, Arnab Banerjee, Janojit Chakraborty, Avisek Gupta, Swami Punyeshwarananda, Md Sahidullah

Main category: cs.SD

TL;DR: The paper presents a multilingual audio processing pipeline for speaker diarization, identification, transcription and translation, focusing on real-world applicability in low-resource multilingual/code-mixed scenarios.

DetailsMotivation: To address the NCIIPC Startup India AI Grand Challenge Problem Statement 06 on language-agnostic speaker identification and diarization with transcription/translation, and study real-world applicability of in-house speaker diarization systems.

Method: Developed integrated pipeline with: robust VAD technique, fine-tuned speaker embedding models for low-resource settings, multi-kernel consensus spectral clustering framework for diarization, plus speaker/language identification, ASR, and neural machine translation modules with post-processing refinements.

Result: Substantially improved diarization performance across all recordings in the training corpus provided by organizers through the multi-kernel consensus spectral clustering framework.

Conclusion: The integrated multilingual audio processing pipeline successfully addresses language-agnostic speaker identification and diarization challenges, demonstrating improved performance in multilingual and code-mixed scenarios through robust techniques and system integration.

Abstract: In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.

[236] Mitigation of multi-path propagation artefacts in acoustic targets with cepstral adaptive filtering

Lucas C. F. Domingos, Russell S. A. Brinkworth, Paulo E. Santos, Karl Sammut

Main category: cs.SD

TL;DR: Proposed adaptive cepstral filtering method separates target signals from reflections in spectrograms, improving SNR and classification performance in multi-path acoustic environments.

DetailsMotivation: Passive acoustic sensing for monitoring moving targets is hindered by multi-path reflections and motion artifacts. Existing filtering techniques don't properly incorporate environmental characteristics or account for medium property variability, limiting their ability to separate source and reflection components.

Method: Temporal filtering applied to cepstral coefficients using an adaptive band-stop filter that dynamically adjusts its bandwidth based on the relative intensity of quefrency components. This separates target signals from reflections in spectrograms.

Result: Improved SNR, log-spectral distance (LSD), and Itakura-Saito (IS) distance across velocities from 10-100 m/s in aircraft noise with simulated motion. Enhanced ship-type classification performance by 2.28 and 2.62 MCC percentage points for DeepShip and VTUAD v2 datasets respectively.

Conclusion: The method demonstrates potential to improve acoustic target classification and time-delay estimation in multi-path environments. Future work includes amplitude preservation and multi-sensor applications.

Abstract: Passive acoustic sensing is a cost-effective solution for monitoring moving targets such as vessels and aircraft, but its performance is hindered by complex propagation effects like multi-path reflections and motion-induced artefacts. Existing filtering techniques do not properly incorporate the characteristics of the environment or account for variability in medium properties, limiting their effectiveness in separating source and reflection components. This paper proposes a method for separating target signals from their reflections in a spectrogram. Temporal filtering is applied to cepstral coefficients using an adaptive band-stop filter, which dynamically adjusts its bandwidth based on the relative intensity of the quefrency components. The method improved the signal-to-noise ratio (SNR), log-spectral distance (LSD), and Itakura-Saito (IS) distance across velocities ranging from 10 to 100 metres per second in aircraft noise with simulated motion. It also enhanced the performance of ship-type classification in underwater tasks by 2.28 and 2.62 Matthews Correlation Coefficient percentage points for the DeepShip and VTUAD v2 datasets, respectively. These results demonstrate the potential of the proposed pipeline to improve acoustic target classification and time-delay estimation in multi-path environments, with future work aimed at amplitude preservation and multi-sensor applications.

[237] The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller

Main category: cs.SD

TL;DR: Proposes emotion-informed learning as a unified training framework for speech deepfake detection, leveraging emotion as a bridge between conventional features and emotion-oriented representations to improve performance and interpretability.

DetailsMotivation: Current speech deepfake detection uses diverse low-level acoustic features that lack unification and human interpretability. As deepfake generation improves, distinguishing real from synthetic speech becomes harder. Emotion remains a uniquely human attribute that deepfake generators struggle to replicate, providing a promising detection cue.

Method: A novel training framework that uses emotion as a bridge between conventional deepfake features and emotion-oriented representations. The approach leverages implicit correlations between existing acoustic/semantic features and emotional expression to create a unified training strategy.

Result: Consistent improvements on FakeOrReal (up to ~6% accuracy increase, ~4% EER reduction) and In-the-Wild datasets (up to ~2% accuracy increase, ~1% EER reduction), with comparable results on ASVspoof2019. Provides both performance gains and interpretable feature directions.

Conclusion: Emotion-informed learning offers an effective unified framework for speech deepfake detection that improves model performance while providing interpretable feature directions, addressing current limitations of disparate feature sets and lack of human-intuitive representations.

Abstract: Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.

[238] PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation

Longshen Ou, Ye Wang

Main category: cs.SD

TL;DR: PhraseVAE and PhraseLDM introduce phrase-level latent diffusion for full-song symbolic music generation, addressing long-sequence issues by compressing polyphonic music into compact phrase representations and generating complete songs in one pass.

DetailsMotivation: Existing symbolic music models suffer from extremely long sequences, limited context length, and weak support for long-range structure due to operating on note-attribute tokens.

Method: PhraseVAE compresses variable-length polyphonic note sequences into 64-dimensional phrase-level representations. PhraseLDM then uses latent diffusion on this space to generate entire multi-track songs in a single pass without autoregressive components.

Result: The system supports up to 128 bars (8 minutes), generates full songs within seconds with only 45M parameters, and produces coherent local texture, idiomatic instrument patterns, and clear global structure with competitive musical quality and diversity.

Conclusion: Phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation, encouraging future research to move beyond note-attribute tokens to phrase-level units as more musically meaningful modeling targets.

Abstract: This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.

[239] Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition

Sheng Feng, Shuqing Ma, Xiaoqian Zhu

Main category: cs.SD

TL;DR: Proposes UATR-GTransformer, a non-Euclidean deep learning model combining Transformer and graph neural networks for underwater acoustic target recognition, achieving competitive performance on benchmark datasets.

DetailsMotivation: Underwater acoustic target recognition is challenging due to complex ship noise and ocean environments. Existing deep learning models assume Euclidean space, which is unsuitable for the inherently complex topology of underwater acoustic signals with non-stationary, non-Gaussian, and nonlinear characteristics.

Method: UATR-GTransformer integrates Transformer architectures with graph neural networks. It has three components: 1) Mel patchify block partitions Mel-spectrogram into overlapping patches, 2) GTransformer block uses Transformer Encoder to capture mutual information between patches to generate Mel-graph embeddings, then GNN enhances embeddings by modeling local neighborhood relationships, and 3) feed-forward network for feature transformation and classification.

Result: Experiments on two widely used benchmark datasets demonstrate that UATR-GTransformer achieves performance competitive with state-of-the-art methods. Interpretability analysis shows the model effectively extracts rich frequency-domain information.

Conclusion: The proposed non-Euclidean DL model successfully addresses limitations of Euclidean assumptions for underwater acoustic signals, showing potential for ocean engineering applications through its ability to capture complex topological relationships in acoustic data.

Abstract: Underwater acoustic target recognition (UATR) is extremely challenging due to the complexity of ship-radiated noise and the variability of ocean environments. Although deep learning (DL) approaches have achieved promising results, most existing models implicitly assume that underwater acoustic data lie in a Euclidean space. This assumption, however, is unsuitable for the inherently complex topology of underwater acoustic signals, which exhibit non-stationary, non-Gaussian, and nonlinear characteristics. To overcome this limitation, this paper proposes the UATR-GTransformer, a non-Euclidean DL model that integrates Transformer architectures with graph neural networks (GNNs). The model comprises three key components: a Mel patchify block, a GTransformer block, and a classification head. The Mel patchify block partitions the Mel-spectrogram into overlapping patches, while the GTransformer block employs a Transformer Encoder to capture mutual information between split patches to generate Mel-graph embeddings. Subsequently, a GNN enhances these embeddings by modeling local neighborhood relationships, and a feed-forward network (FFN) further performs feature transformation. Experiments results based on two widely used benchmark datasets demonstrate that the UATR-GTransformer achieves performance competitive with state-of-the-art methods. In addition, interpretability analysis reveals that the proposed model effectively extracts rich frequency-domain information, highlighting its potential for applications in ocean engineering.

[240] Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Xinyi Tong, Yiran Zhu, Jishang Chen, Chunru Zhan, Tianle Wang, Sirui Zhang, Nian Liu, Tiezheng Ge, Duo Xu, Xin Jin, Feng Yu, Song-Chun Zhu

Main category: cs.SD

TL;DR: VeM is a latent music diffusion model that generates high-quality, semantically and rhythmically aligned background music for videos using hierarchical video parsing and beat synchronization mechanisms.

DetailsMotivation: Current video-to-music generation approaches suffer from incomplete video representation leading to weak alignment, and inadequate temporal/rhythmic correspondence, especially in beat synchronization.

Method: Uses hierarchical video parsing as a “music conductor,” modality-specific encoders with storyboard-guided cross-attention (SG-CAtt), and frame-level transition-beat aligner/adapter (TB-As) for dynamic beat synchronization with visual transitions.

Result: Experimental results demonstrate superiority over existing methods, particularly in semantic relevance and rhythmic precision.

Conclusion: VeM effectively addresses alignment challenges in video-to-music generation through comprehensive video parsing and precise beat synchronization mechanisms.

Abstract: Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.

[241] Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang

Main category: cs.SD

TL;DR: TimeAudio enhances Large Audio-Language Models with precise temporal localization and long audio understanding through temporal markers, absolute time-aware encoding, and segment-level token merging.

DetailsMotivation: Current LALMs struggle with timestamp understanding for temporal localization and are limited to short audio perception, restricting their capabilities on fine-grained audio tasks that require precise temporal reasoning.

Method: Three key innovations: (1) unique temporal markers for time-sensitive reasoning, (2) absolute time-aware encoding to ground acoustic features with time information, and (3) segment-level token merging to reduce audio token redundancy for efficient long audio processing.

Result: Strong performance across fine-grained tasks including dense captioning, temporal grounding, and timeline speech summarization, demonstrating robust temporal localization and reasoning capabilities.

Conclusion: TimeAudio successfully addresses the limitations of current LALMs by enabling precise temporal perception and long audio understanding, establishing a new approach for fine-grained audio-language tasks.

Abstract: Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to achieve end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, demonstrating TimeAudio’s robust temporal localization and reasoning capabilities.

[242] Diffusion-based Surrogate Model for Time-varying Underwater Acoustic Channels

Kexin Li, Mandar Chitre

Main category: cs.SD

TL;DR: StableUASim: A pre-trained conditional latent diffusion model for generating realistic underwater acoustic channel realizations, enabling data-efficient adaptation to new environments.

DetailsMotivation: Existing methods for modeling underwater acoustic channels have limitations - physics models require detailed environmental knowledge, while stochastic replay methods lack diversity and fail to generalize to unseen scenarios, reducing practical applicability.

Method: Proposes StableUASim, a pre-trained conditional latent diffusion surrogate model that captures stochastic dynamics of underwater acoustic channels using generative modeling. Features autoencoder latent representation for efficient analysis and supports conditional generation from specific measurements.

Result: Experimental results show StableUASim accurately reproduces key channel characteristics and communication performance, providing a scalable, data-efficient, and physically consistent surrogate model.

Conclusion: StableUASim offers a practical solution for underwater communication system design and machine learning applications by generating diverse, statistically realistic channel realizations with rapid adaptation to new environments using minimal data.

Abstract: Accurate modeling of time-varying underwater acoustic channels is essential for the design, evaluation, and deployment of reliable underwater communication systems. Conventional physics models require detailed environmental knowledge, while stochastic replay methods are constrained by the limited diversity of measured channels and often fail to generalize to unseen scenarios, reducing their practical applicability. To address these challenges, we propose StableUASim, a pre-trained conditional latent diffusion surrogate model that captures the stochastic dynamics of underwater acoustic communication channels. Leveraging generative modeling, StableUASim produces diverse and statistically realistic channel realizations, while supporting conditional generation from specific measurement samples. Pre-training enables rapid adaptation to new environments using minimal additional data, and the autoencoder latent representation facilitates efficient channel analysis and compression. Experimental results demonstrate that StableUASim accurately reproduces key channel characteristics and communication performance, providing a scalable, data-efficient, and physically consistent surrogate model for both system design and machine learning-driven underwater applications.

cs.LG

[243] Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, Xia Song

Main category: cs.LG

TL;DR: WebSTAR: A scalable data synthesis pipeline that transforms noisy computer use agent rollouts into reliable training data through step-level filtering and reasoning augmentation, achieving state-of-the-art performance on WebVoyager benchmark.

DetailsMotivation: Training computer use agents (CUAs) is difficult due to high GUI interaction costs and scarcity of quality trajectory data. Existing human demonstration datasets don't scale, while synthetic data from strong CUAs contains too many incorrect/suboptimal actions for effective imitation learning.

Method: Developed a scalable data synthesis pipeline with step-level filtering that evaluates individual actions to retain only correct steps, plus reasoning augmentation for improved planning. Created WebSTAR dataset (13.3K trajectories, 100K graded steps) from OpenAI’s computer-use-preview model. Also created WebSCORE dataset of graded step-level actions and trained StepRM, a 7B multimodal reward model distilled from o4-mini.

Result: Qwen-2.5-VL-Instruct models trained on WebSTAR achieved >15% improvement over SoTA open-source CUA model UI-TARS-1.5-7B on WebVoyager with only supervised finetuning. StepRM matches o4-mini’s grading quality while being far more efficient for deployment.

Conclusion: Step-level filtering is a key principle for scalable CUA training. The paper provides practical tools (WebSTAR, WebSCORE datasets and StepRM reward model) to advance robust and efficient computer use agents without human annotation.

Abstract: Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI’s computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight reward model (StepRM) as practical tools to advance robust and efficient CUAs.

[244] Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer’s Disease Diagnosis

Farica Zhuang, Dinara Aliyeva, Shu Yang, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

Main category: cs.LG

TL;DR: MREF-AD: A Multimodal Regional Expert Fusion model for Alzheimer’s disease diagnosis using Mixture-of-Experts framework to adaptively fuse amyloid PET and MRI biomarkers across brain regions with interpretable region-specific weights.

DetailsMotivation: Current multimodal fusion approaches for AD diagnosis rely on simple feature concatenation, which cannot adaptively balance contributions of different biomarkers (amyloid PET and MRI) across brain regions, unlike clinical practice that integrates complementary information.

Method: Proposes MREF-AD: a Mixture-of-Experts framework that models meso-scale brain regions in each modality as independent experts, using two-level gating networks to learn subject-specific fusion weights for adaptive multimodal integration.

Result: Using ADNI data, MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, demonstrating utility as a general framework for adaptive multimodal fusion.

Conclusion: MREF-AD offers both improved diagnostic performance and interpretable insights into how structural and molecular imaging jointly contribute to AD diagnosis, serving as a general framework for adaptive multimodal fusion in neuroimaging.

Abstract: Accurate and early diagnosis of Alzheimer’s disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging.

[245] Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

Matvey Nepomnyaschiy, Oleg Pereziabov, Anvar Tliamov, Stanislav Mikhailov, Ilya Afanasyev

Main category: cs.LG

TL;DR: A multi-agent framework for multimodal emotion recognition that uses autonomous agents for each modality encoder and fusion classifier, coordinated by a central supervisor, enabling modular integration, component replacement, and reduced computational overhead.

DetailsMotivation: Current multimodal deep learning models for emotion recognition are computationally intensive and inflexible to modality changes, making them difficult to maintain and adapt for human-agent interaction scenarios.

Method: Proposes a multi-agent framework where each modality encoder (vision, audio, text) and the fusion classifier operate as autonomous agents coordinated by a central supervisor, enabling modular integration and component replacement.

Result: Demonstrated feasibility through proof-of-concept implementation supporting vision, audio, and text modalities, with improved training efficiency and flexibility for modality integration.

Conclusion: The framework enables more flexible, scalable, and maintainable perception modules for embodied and virtual agents in human-agent interaction scenarios.

Abstract: Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

[246] MoB: Mixture of Bidders

Dev Vyas

Main category: cs.LG

TL;DR: MoB replaces learned gating in MoE with VCG auctions where experts bid true costs (execution + forgetting), achieving stateless routing immune to catastrophic forgetting and 4.5× improvement over baselines.

DetailsMotivation: Mixture of Experts architectures suffer from catastrophic forgetting in continual learning because the learned gating network itself forgets previous tasks, limiting their applicability to continual learning scenarios.

Method: Replace learned gating networks with Vickrey-Clarke-Groves (VCG) auctions where experts bid their true cost (execution cost = predicted loss + forgetting cost = Elastic Weight Consolidation penalty). Experts compete for data batches through decentralized economic mechanisms. Extended with autonomous self-monitoring experts that detect knowledge consolidation boundaries.

Result: On Split-MNIST benchmarks: MoB achieves 88.77% average accuracy vs 19.54% for Gated MoE and 27.96% for Monolithic EWC (4.5× improvement over strongest baseline). Provides stateless routing immune to catastrophic forgetting, truthful bidding guaranteed by incentive compatibility, and emergent specialization without explicit task boundaries.

Conclusion: MoB framework successfully addresses catastrophic forgetting in MoE architectures through game-theoretic routing mechanisms, enabling effective continual learning with decentralized expert selection and autonomous task boundary detection.

Abstract: Mixture of Experts (MoE) architectures have demonstrated remarkable success in scaling neural networks, yet their application to continual learning remains fundamentally limited by a critical vulnerability: the learned gating network itself suffers from catastrophic forgetting. We introduce Mixture of Bidders (MoB), a novel framework that reconceptualizes expert routing as a decentralized economic mechanism. MoB replaces learned gating networks with Vickrey-Clarke-Groves (VCG) auctions, where experts compete for each data batch by bidding their true cost – a principled combination of execution cost (predicted loss) and forgetting cost (Elastic Weight Consolidation penalty). This game-theoretic approach provides three key advantages: (1) {stateless routing that is immune to catastrophic forgetting, (2) \textbf{truthful bidding} guaranteed by dominant-strategy incentive compatibility, and (3) emergent specialization without explicit task boundaries. On Split-MNIST benchmarks, MoB achieves 88.77% average accuracy compared to 19.54% for Gated MoE and 27.96% for Monolithic EWC, representing a 4.5 times improvement over the strongest baseline. We further extend MoB with autonomous self-monitoring experts that detect their own knowledge consolidation boundaries, eliminating the need for explicit task demarcation.

[247] TECM*: A Data-Driven Assessment to Reinforcement Learning Methods and Application to Heparin Treatment Strategy for Surgical Sepsis

Jiang Liu, Yujie Li, Chan Zhou, Yihao Xie, Qilong Sun, Xin Shu, Peiwei Li, Chunyong Yang, Yiziting Zhu, Jiaqi Zhu, Yuwen Chen, Bo An, Hao Wu, Bin Yi

Main category: cs.LG

TL;DR: RL framework with continuous cxSOFA scoring and TECM evaluation optimizes heparin therapy for surgical sepsis patients, reducing mortality and hospital stay.

DetailsMotivation: Sepsis is life-threatening and requires optimized heparin therapy. Current SOFA scoring is discrete and lacks nuance for RL-based treatment optimization.

Method: Convert discrete SOFA to continuous cxSOFA for nuanced state/reward functions; define good/bad strategies stepwise; propose Treatment Effect Comparison Matrix (TECM) for evaluation; apply Q-Learning, DQN, DDQN, BCQ, CQL algorithms using MIMIC-IV and eICU data.

Result: cxSOFA-CQL model performed best: reduced mortality from 1.83% to 0.74% and average hospital stay from 11.11 to 9.42 days. TECM showed consistent outcomes across models.

Conclusion: RL framework enables interpretable, robust heparin therapy optimization. Continuous cxSOFA and TECM provide nuanced treatment assessment, promising improved clinical outcomes and decision-support reliability.

Abstract: Objective: Sepsis is a life-threatening condition caused by severe infection leading to acute organ dysfunction. This study proposes a data-driven metric and a continuous reward function to optimize personalized heparin therapy in surgical sepsis patients. Methods: Data from the MIMIC-IV v1.0 and eICU v2.0 databases were used for model development and evaluation. The training cohort consisted of abdominal surgery patients receiving unfractionated heparin (UFH) after postoperative sepsis onset. We introduce a new RL-based framework: converting the discrete SOFA score to a continuous cxSOFA for more nuanced state and reward functions; Second, defining “good” or “bad” strategies based on cxSOFA by a stepwise manner; Third, proposing a Treatment Effect Comparison Matrix (TECM), analogous to a confusion matrix for classification tasks, to evaluate the treatment strategies. We applied different RL algorithms, Q-Learning, DQN, DDQN, BCQ and CQL to optimize the treatment and comprehensively evaluated the framework. Results: Among the AI-derived strategies, the cxSOFA-CQL model achieved the best performance, reducing mortality from 1.83% to 0.74% with the average hospital stay from 11.11 to 9.42 days. TECM demonstrated consistent outcomes across models, highlighting robustness. Conclusion: The proposed RL framework enables interpretable and robust optimization of heparin therapy in surgical sepsis. Continuous cxSOFA scoring and TECM-based evaluation provide nuanced treatment assessment, showing promise for improving clinical outcomes and decision-support reliability.

[248] Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

Wei Duan, Jie Lu, En Yu, Junyu Xuan

Main category: cs.LG

TL;DR: BVME introduces variational message encoding for bandwidth-limited multi-agent RL, achieving comparable performance with 67-83% fewer message dimensions by learning Gaussian posteriors with KL regularization.

DetailsMotivation: Current graph-based MARL methods focus on learning sparse coordination graphs (who communicates) but don't address what information to transmit under hard bandwidth constraints. Naive dimensionality reduction degrades coordination performance, and deterministic projections lack control over compression.

Method: Bandwidth-constrained Variational Message Encoding (BVME) treats messages as samples from learned Gaussian posteriors, regularized via KL divergence to an uninformative prior. This variational framework provides principled, tunable control over compression strength through interpretable hyperparameters.

Result: Across SMACv1, SMACv2, and MPE benchmarks, BVME achieves comparable or superior performance while using 67-83% fewer message dimensions. Gains are most pronounced on sparse graphs where message quality critically impacts coordination. Ablations show U-shaped sensitivity to bandwidth.

Conclusion: BVME provides an effective solution for bandwidth-limited MARL by enabling selective encoding with principled compression control, excelling at extreme bandwidth ratios while adding minimal overhead to existing coordination graph methods.

Abstract: Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges. While recent methods excel at learning sparse coordination graphs-determining who communicates with whom-they do not address what information should be transmitted under hard bandwidth constraints. We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performance. Hard bandwidth constraints force selective encoding, but deterministic projections lack mechanisms to control how compression occurs. We introduce Bandwidth-constrained Variational Message Encoding (BVME), a lightweight module that treats messages as samples from learned Gaussian posteriors regularized via KL divergence to an uninformative prior. BVME’s variational framework provides principled, tunable control over compression strength through interpretable hyperparameters, directly constraining the representations used for decision-making. Across SMACv1, SMACv2, and MPE benchmarks, BVME achieves comparable or superior performance while using 67–83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination. Ablations reveal U-shaped sensitivity to bandwidth, with BVME excelling at extreme ratios while adding minimal overhead.

[249] MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax

Zhanpeng Chen, Weihao Gao, Shunyu Wang, Yanan Zhu, Hong Meng, Yuexian Zou

Main category: cs.LG

TL;DR: MolSculpt is a novel framework that sculpts 3D molecular geometries from chemical syntax by integrating a frozen 1D molecular foundation model with a 3D diffusion model, achieving SOTA performance in de novo and conditional 3D molecule generation.

DetailsMotivation: Prior methods using 1D representations like SELFIES fail to fully exploit rich chemical knowledge in 1D models, creating a disconnect between 1D syntactic generation and 3D geometric realization. There's a need to bridge this gap for more precise 3D molecular geometry generation crucial for drug discovery and material science.

Method: MolSculpt uses a frozen 1D molecular foundation model and a 3D molecular diffusion model. It introduces learnable queries to extract chemical knowledge from the foundation model, and a trainable projector injects this cross-modal information into the diffusion model’s conditioning space to guide 3D geometry generation through end-to-end optimization.

Result: MolSculpt achieves state-of-the-art performance in both de novo 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on GEOM-DRUGS and QM9 datasets.

Conclusion: The proposed MolSculpt framework successfully bridges the gap between 1D chemical syntax and 3D geometric realization by deeply integrating 1D latent chemical knowledge into the 3D generation process, enabling more precise molecular geometry generation for drug discovery and material science applications.

Abstract: Generating precise 3D molecular geometries is crucial for drug discovery and material science. While prior efforts leverage 1D representations like SELFIES to ensure molecular validity, they fail to fully exploit the rich chemical knowledge entangled within 1D models, leading to a disconnect between 1D syntactic generation and 3D geometric realization. To bridge this gap, we propose MolSculpt, a novel framework that “sculpts” 3D molecular geometries from chemical syntax. MolSculpt is built upon a frozen 1D molecular foundation model and a 3D molecular diffusion model. We introduce a set of learnable queries to extract inherent chemical knowledge from the foundation model, and a trainable projector then injects this cross-modal information into the conditioning space of the diffusion model to guide the 3D geometry generation. In this way, our model deeply integrates 1D latent chemical knowledge into the 3D generation process through end-to-end optimization. Experiments demonstrate that MolSculpt achieves state-of-the-art (SOTA) performance in \textit{de novo} 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on both the GEOM-DRUGS and QM9 datasets. Code is available at https://github.com/SakuraTroyChen/MolSculpt.

[250] Memoryless Policy Iteration for Episodic POMDPs

Roy van Zuijlen, Duarte Antunes

Main category: cs.LG

TL;DR: New policy-iteration algorithms for POMDPs using memoryless/finite-memory policies with periodic improvement patterns, achieving computational efficiency and model-free learning.

DetailsMotivation: Memoryless and finite-memory policies are practical for POMDPs but extending classical policy iteration methods is difficult due to the non-Markovian output process and interdependent policy-improvement steps.

Method: Introduces a family of monotonically improving policy-iteration algorithms that alternate between single-stage output-based policy improvements and policy evaluations according to prescribed periodic patterns. Also develops a model-free variant that estimates values from data and learns memoryless policies directly.

Result: Identifies optimal patterns that maximize computational-efficiency index and finds simplest pattern with minimal period. Achieves significant computational speedups over policy-gradient baselines and recent specialized algorithms in both model-based and model-free settings across several POMDP examples.

Conclusion: The proposed family of policy-iteration algorithms provides an effective approach for solving POMDPs with memoryless/finite-memory policies, offering computational efficiency advantages over existing methods in both model-based and model-free contexts.

Abstract: Memoryless and finite-memory policies offer a practical alternative for solving partially observable Markov decision processes (POMDPs), as they operate directly in the output space rather than in the high-dimensional belief space. However, extending classical methods such as policy iteration to this setting remains difficult; the output process is non-Markovian, making policy-improvement steps interdependent across stages. We introduce a new family of monotonically improving policy-iteration algorithms that alternate between single-stage output-based policy improvements and policy evaluations according to a prescribed periodic pattern. We show that this family admits optimal patterns that maximize a natural computational-efficiency index, and we identify the simplest pattern with minimal period. Building on this structure, we further develop a model-free variant that estimates values from data and learns memoryless policies directly. Across several POMDPs examples, our method achieves significant computational speedups over policy-gradient baselines and recent specialized algorithms in both model-based and model-free settings.

[251] Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerating Neural Network Verification

Duo Zhou, Jorge Chavez, Hesun Chen, Grani A. Hanasusanto, Huan Zhang

Main category: cs.LG

TL;DR: Clip-and-Verify introduces a linear constraint-driven clipping framework that enhances neural network verification by efficiently using linear constraints to reduce search space and improve bounds, achieving state-of-the-art performance.

DetailsMotivation: Current neural network verifiers rely heavily on branch-and-bound procedures, but there's a need for more efficient methods to handle challenging verification properties and reduce computational complexity.

Method: Developed a linear constraint-driven clipping framework with two novel algorithms: 1) reduces verified/irrelevant input space portions during branch-and-bound, and 2) directly improves intermediate bounds throughout the network using linear constraints from bound propagation and other sources, implemented with specialized GPU procedures.

Result: Achieves up to 96% reduction in branch-and-bound subproblems, consistently tightens bounds across benchmarks, and achieves state-of-the-art verified accuracy; part of the VNN-COMP 2025 winning verifier α,β-CROWN.

Conclusion: The Clip-and-Verify framework provides an efficient, scalable approach to neural network verification that significantly improves performance by leveraging linear constraints, demonstrating practical effectiveness across diverse benchmarks.

Abstract: State-of-the-art neural network (NN) verifiers demonstrate that applying the branch-and-bound (BaB) procedure with fast bounding techniques plays a key role in tackling many challenging verification properties. In this work, we introduce the linear constraint-driven clipping framework, a class of scalable and efficient methods designed to enhance the efficacy of NN verifiers. Under this framework, we develop two novel algorithms that efficiently utilize linear constraints to 1) reduce portions of the input space that are either verified or irrelevant to a subproblem in the context of branch-and-bound, and 2) directly improve intermediate bounds throughout the network. The process novelly leverages linear constraints that often arise from bound propagation methods and is general enough to also incorporate constraints from other sources. It efficiently handles linear constraints using a specialized GPU procedure that can scale to large neural networks without the use of expensive external solvers. Our verification procedure, Clip-and-Verify, consistently tightens bounds across multiple benchmarks and can significantly reduce the number of subproblems handled during BaB. We show that our clipping algorithms can be integrated with BaB-based verifiers such as $α,β$-CROWN, utilizing either the split constraints in activation-space BaB or the output constraints that denote the unverified input space. We demonstrate the effectiveness of our procedure on a broad range of benchmarks where, in some instances, we witness a 96% reduction in the number of subproblems during branch-and-bound, and also achieve state-of-the-art verified accuracy across multiple benchmarks. Clip-and-Verify is part of the $α,β$-CROWN verifier (http://abcrown.org), the VNN-COMP 2025 winner. Code available at https://github.com/Verified-Intelligence/Clip_and_Verify.

[252] Investigating ECG Diagnosis with Ambiguous Labels using Partial Label Learning

Sana Rahmani, Javad Hashemi, Ali Etemad

Main category: cs.LG

TL;DR: First systematic study of Partial Label Learning (PLL) methods for ECG diagnosis, evaluating 9 PLL algorithms on ECG data with various clinically-motivated ambiguity generation strategies.

DetailsMotivation: Label ambiguity is inherent in real-world ECG diagnosis due to overlapping conditions and diagnostic disagreement, but current ECG models assume clean annotations, limiting development and evaluation under real-world conditions. PLL frameworks are designed for ambiguous labels but remain unexplored in medical time-series domains like ECG.

Method: Adapted 9 PLL algorithms to multi-label ECG diagnosis and evaluated them using diverse clinically-motivated ambiguity generation strategies, including both unstructured (random) and structured ambiguities (cardiologist-derived similarities, treatment relationships, diagnostic taxonomies). Experiments conducted on PTB-XL and Chapman datasets.

Result: PLL methods vary substantially in their robustness to different types and degrees of ambiguity. The study identifies key limitations of current PLL approaches in clinical settings.

Conclusion: Outlines future directions for developing robust and clinically aligned ambiguity-aware learning frameworks for ECG diagnosis, highlighting the need for improved PLL methods that can handle real-world label ambiguity in medical applications.

Abstract: Label ambiguity is an inherent problem in real-world electrocardiogram (ECG) diagnosis, arising from overlapping conditions and diagnostic disagreement. However, current ECG models are trained under the assumption of clean and non-ambiguous annotations, which limits both the development and the meaningful evaluation of models under real-world conditions. Although Partial Label Learning (PLL) frameworks are designed to learn from ambiguous labels, their effectiveness in medical time-series domains, ECG in particular, remains largely unexplored. In this work, we present the first systematic study of PLL methods for ECG diagnosis. We adapt nine PLL algorithms to multi-label ECG diagnosis and evaluate them using a diverse set of clinically motivated ambiguity generation strategies, capturing both unstructured (e.g., random) and structured ambiguities (e.g., cardiologist-derived similarities, treatment relationships, and diagnostic taxonomies). Our experiments on the PTB-XL and Chapman datasets demonstrate that PLL methods vary substantially in their robustness to different types and degrees of ambiguity. Through extensive analysis, we identify key limitations of current PLL approaches in clinical settings and outline future directions for developing robust and clinically aligned ambiguity-aware learning frameworks for ECG diagnosis.

[253] Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah

Main category: cs.LG

TL;DR: TTS improves LLM reasoning but its effectiveness for VLMs varies: closed-source models benefit from structured reasoning and self-refinement, while open-source models only reliably improve with external verification, and gains depend on task type.

DetailsMotivation: Test-time scaling (TTS) has shown promise for improving LLM reasoning by adding computation at inference, but its application to multimodal Vision-Language Models (VLMs) remains underexplored, creating a gap in understanding how TTS methods work across different VLM architectures and tasks.

Method: Conducted systematic empirical study of inference time reasoning methods across both open-source and closed-source VLMs on different benchmarks, evaluating structured reasoning, iterative self-refinement, and external verification techniques.

Result: Closed-source models consistently benefit from structured reasoning and iterative self-refinement, while open-source VLMs show inconsistent behavior - external verification provides reliable gains but iterative refinement often degrades performance. TTS effectiveness is dataset-dependent, with clear improvements on multi-step reasoning tasks but limited gains on perception-focused benchmarks.

Conclusion: TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models for more effective inference-time computation allocation.

Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

[254] In-Context Multi-Objective Optimization

Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski

Main category: cs.LG

TL;DR: TAMO is a transformer-based, fully amortized policy for multi-objective Bayesian optimization that eliminates per-task surrogate fitting and acquisition engineering by pretraining on diverse corpora and proposing designs with single forward passes.

DetailsMotivation: Current multi-objective Bayesian optimization methods require problem-specific tuning of surrogates and acquisition functions, are myopic (lacking multi-step planning), and have significant computational overhead, especially in parallel or time-sensitive applications.

Method: TAMO uses a transformer architecture that operates across varying input and objective dimensions, pretrained with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories. The model conditions on the entire query history to approximate the Pareto frontier and proposes new designs with single forward passes.

Result: TAMO reduces proposal time by 50-1000x compared to alternatives while matching or improving Pareto quality under tight evaluation budgets across synthetic benchmarks and real-world tasks.

Conclusion: Transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, opening a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

Abstract: Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

[255] Refining Graphical Neural Network Predictions Using Flow Matching for Optimal Power Flow with Constraint-Satisfaction Guarantee

Kshitiz Khanal

Main category: cs.LG

TL;DR: Two-stage physics-informed learning framework combining GNNs and Continuous Flow Matching for fast, near-optimal DC-OPF solutions with guaranteed feasibility.

DetailsMotivation: Traditional DC-OPF solvers are computationally expensive for large-scale systems requiring frequent updates, while existing ML approaches struggle with constraint satisfaction and optimality.

Method: Two-stage approach: 1) Physics-informed GNN learns feasible initial solutions using losses encoding power system constraints (economic dispatch, Kirchhoff’s laws, KKT conditions); 2) Continuous Flow Matching refines solutions toward optimality through learned vector field regression.

Result: Achieves near-optimal solutions with cost gaps below 0.1% for nominal loads and below 3% for extreme conditions (70-130% nominal load), while maintaining 100% feasibility on IEEE 30-bus system.

Conclusion: Framework bridges gap between fast approximate neural predictions and slow optimal solvers, offering practical solution for modern power systems with high renewable penetration requiring frequent dispatch updates.

Abstract: The DC Optimal Power Flow (DC-OPF) problem is fundamental to power system operations, requiring rapid solutions for real-time grid management. While traditional optimization solvers provide optimal solutions, their computational cost becomes prohibitive for large-scale systems requiring frequent recalculations. Machine learning approaches offer promise for acceleration but often struggle with constraint satisfaction and cost optimality. We present a novel two-stage learning framework that combines physics-informed Graph Neural Networks (GNNs) with Continuous Flow Matching (CFM) for solving DC-OPF problems. Our approach embeds fundamental physical principles–including economic dispatch optimality conditions, Kirchhoff’s laws, and Karush-Kuhn-Tucker (KKT) complementarity conditions–directly into the training objectives. The first stage trains a GNN to produce feasible initial solutions by learning from physics-informed losses that encode power system constraints. The second stage employs CFM, a simulation-free continuous normalizing flow technique, to refine these solutions toward optimality through learned vector field regression. Evaluated on the IEEE 30-bus system across five load scenarios ranging from 70% to 130% nominal load, our method achieves near-optimal solutions with cost gaps below 0.1% for nominal loads and below 3% for extreme conditions, while maintaining 100% feasibility. Our framework bridges the gap between fast but approximate neural network predictions and optimal but slow numerical solvers, offering a practical solution for modern power systems with high renewable penetration requiring frequent dispatch updates.

[256] Fairness-Regularized Online Optimization with Switching Costs

Pengfei Li, Yuelin Han, Adam Wierman, Shaolei Ren

Main category: cs.LG

TL;DR: This paper introduces FairOBD, an online algorithm that simultaneously addresses fairness and action smoothness (switching costs) in online convex optimization, proving theoretical guarantees and demonstrating practical effectiveness in AI resource provisioning.

DetailsMotivation: Fairness and action smoothness are both important considerations in online optimization problems, but existing approaches haven't addressed them simultaneously. The paper aims to bridge this gap by developing algorithms that can handle fairness constraints while minimizing switching costs between actions.

Method: The authors propose FairOBD (Fairness-regularized Online Balanced Descent), which decomposes long-term fairness costs into a sequence of online costs using an auxiliary variable. This auxiliary variable then regularizes online actions to achieve fair outcomes while balancing hitting costs and switching costs.

Result: Theoretical analysis shows that without switching costs, no online algorithm can achieve sublinear regret or finite competitive ratio. However, FairOBD achieves a worst-case asymptotic competitive ratio against a novel benchmark - the optimal offline algorithm with parameterized constraints. Empirical evaluation in dynamic computing resource provisioning for AI inference shows FairOBD effectively reduces total fairness-regularized costs and promotes fair outcomes compared to baselines.

Conclusion: FairOBD successfully reconciles the tension between minimizing hitting costs, switching costs, and fairness costs in online optimization, providing both theoretical guarantees and practical effectiveness for applications like socially responsible AI inference.

Abstract: Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length $T$ increases. Then, we propose FairOBD (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, FairOBD decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that FairOBD offers a worst-case asymptotic competitive ratio against a novel benchmark – the optimal offline algorithm with parameterized constraints – by considering $T\to\infty$. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate FairOBD, showing that FairOBD can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions.

[257] The Vekua Layer: Exact Physical Priors for Implicit Neural Representations via Generalized Analytic Functions

Vladimer Khasia

Main category: cs.LG

TL;DR: Vekua Layer (VL) is a differentiable spectral method using Generalized Analytic Functions that transforms neural field learning into convex least-squares problems, achieving machine precision on PDEs and acting as a physics-informed spectral filter.

DetailsMotivation: Implicit Neural Representations (INRs) suffer from spectral bias and computational expense of non-convex optimization. The authors aim to address these limitations by leveraging classical mathematical theory to create a more efficient and accurate approach.

Method: VL restricts the hypothesis space to the kernel of governing differential operators using Harmonic and Fourier-Bessel bases. This transforms learning from iterative gradient descent to a strictly convex least-squares problem solved via linear projection, grounded in Generalized Analytic Functions theory.

Result: VL achieves machine precision (MSE ≈ 10⁻³³) on exact reconstruction tasks and shows superior stability with incoherent sensor noise (MSE ≈ 0.03). It enables “holographic” extrapolation of global fields from partial boundary data via analytic continuation, outperforming Sinusoidal Representation Networks (SIRENs).

Conclusion: VL provides a physics-informed spectral filtering approach that overcomes spectral bias and computational limitations of traditional INRs, offering exact solutions to homogeneous elliptic PDEs and enabling analytic continuation capabilities not available in standard coordinate-based approximations.

Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for parameterizing physical fields, yet they often suffer from spectral bias and the computational expense of non-convex optimization. We introduce the Vekua Layer (VL), a differentiable spectral method grounded in the classical theory of Generalized Analytic Functions. By restricting the hypothesis space to the kernel of the governing differential operator – specifically utilizing Harmonic and Fourier-Bessel bases – the VL transforms the learning task from iterative gradient descent to a strictly convex least-squares problem solved via linear projection. We evaluate the VL against Sinusoidal Representation Networks (SIRENs) on homogeneous elliptic Partial Differential Equations (PDEs). Our results demonstrate that the VL achieves machine precision ($\text{MSE} \approx 10^{-33}$) on exact reconstruction tasks and exhibits superior stability in the presence of incoherent sensor noise ($\text{MSE} \approx 0.03$), effectively acting as a physics-informed spectral filter. Furthermore, we show that the VL enables “holographic” extrapolation of global fields from partial boundary data via analytic continuation, a capability absent in standard coordinate-based approximations.

[258] Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles

Lennard Manuel, Hamid Gadirov, Steffen Frey

Main category: cs.LG

TL;DR: Enhanced autoencoder framework with clustering and contrastive losses improves visualization of high-dimensional scientific ensemble datasets.

DetailsMotivation: Scientific ensemble datasets are high-dimensional and complex, making analysis and visualization challenging. Traditional dimensionality reduction and autoencoders struggle with such data.

Method: Proposes an enhanced autoencoder framework with: 1) EfficientNetV2 for pseudo-labeling unlabeled data, 2) Joint optimization of reconstruction, clustering (soft silhouette score), and contrastive losses, 3) UMAP for 2D projection from latent space, 4) Evaluation using silhouette score.

Result: Models with clustering or contrastive loss marginally outperform baselines on two scientific ensemble datasets: soil channel structures from MCMC and droplet-on-film impact dynamics.

Conclusion: The enhanced autoencoder framework with clustering and contrastive losses improves feature extraction and visualization for high-dimensional scientific ensemble datasets.

Abstract: Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.

[259] Harnessing Rich Multi-Modal Data for Spatial-Temporal Homophily-Embedded Graph Learning Across Domains and Localities

Takuya Kurihana, Xiaojian Zhang, Wing Yee Au, Hon Yung Wong

Main category: cs.LG

TL;DR: Proposes a heterogeneous data pipeline for cross-domain urban data fusion to address complex city problems using over 50 diverse data sources, demonstrating strong predictive performance and transferability across localities.

DetailsMotivation: Modern cities need data-driven insights but face challenges with heterogeneous, multi-modal data collected independently by different agencies with varying standards. National datasets are consumable but exhibit significant heterogeneity, making integrated analysis difficult for smart city applications.

Method: Develops a heterogeneous data pipeline for cross-domain data fusion across time-varying, spatial-varying, and spatial-varying time-series datasets. Includes a data-learning module that integrates homophily from spatial-varying datasets into graph-learning, embedding locality information into models. Uses over 50 diverse data sources.

Result: Demonstrates generalizability and flexibility through five real-world observations using publicly accessible datasets (ride-share, traffic crash, crime reports) from multiple cities. Shows strong predictive performance with minimal reconfiguration when transferred to new localities or domains.

Conclusion: The framework advances scalable data-informed urban systems, addressing pressing challenges in smart city analytics by enabling effective cross-domain data fusion that transfers well across different cities and problem domains.

Abstract: Modern cities are increasingly reliant on data-driven insights to support decision making in areas such as transportation, public safety and environmental impact. However, city-level data often exists in heterogeneous formats, collected independently by local agencies with diverse objectives and standards. Despite their numerous, wide-ranging, and uniformly consumable nature, national-level datasets exhibit significant heterogeneity and multi-modality. This research proposes a heterogeneous data pipeline that performs cross-domain data fusion over time-varying, spatial-varying and spatial-varying time-series datasets. We aim to address complex urban problems across multiple domains and localities by harnessing the rich information over 50 data sources. Specifically, our data-learning module integrates homophily from spatial-varying dataset into graph-learning, embedding information of various localities into models. We demonstrate the generalizability and flexibility of the framework through five real-world observations using a variety of publicly accessible datasets (e.g., ride-share, traffic crash, and crime reports) collected from multiple cities. The results show that our proposed framework demonstrates strong predictive performance while requiring minimal reconfiguration when transferred to new localities or domains. This research advances the goal of building data-informed urban systems in a scalable way, addressing one of the most pressing challenges in smart city analytics.

[260] Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement

Andrea Procopio, Marco Esposito, Sara Raggiunto, Andrey Gizdov, Alberto Belli, Paola Pierleoni

Main category: cs.LG

TL;DR: Ultra-light 1D CNNs (as few as 305 parameters) outperform magnitude thresholding for Parkinson’s disease gait detection on resource-constrained wearables, achieving ~94% PR-AUC with 10x fewer parameters than baseline models while meeting strict edge deployment constraints.

DetailsMotivation: Need for accurate, efficient gait detection in Parkinson's disease using short acceleration windows from wearables, targeting resource-constrained edge devices where traditional methods (thresholding) have poor precision and high false positives.

Method: Compared magnitude thresholding to three 1D CNN architectures: literature baseline (separable convolutions), ultra-light separable model (305 params), and residual separable model (533 params). Used BioStampRC21 dataset with 2-second windows at 30Hz, subject-independent LOSO validation on 16 PD patients with chest-worn IMUs.

Result: Residual separable model (533 params) achieved PR-AUC=94.5%, F1=91.2%, MCC=89.4%, matching/surpassing baseline (5,552 params) with 10x fewer parameters. Thresholding had high recall (89.0%) but low precision (76.5%). Chest and thigh sensors most reliable; forearms degraded performance due to arm motion. Models executed in sub-10ms on STM32 MCUs.

Conclusion: Ultra-light separable CNNs provide superior accuracy-efficiency-generalization trade-off for wearable PD gait detection compared to fixed thresholds, enabling on-sensor processing for transmission/storage gating and demonstrating the value of tailored time-series models for edge deployment.

Abstract: We study on-device time-series analysis for gait detection in Parkinson’s disease (PD) from short windows of triaxial acceleration, targeting resource-constrained wearables and edge nodes. We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections. Using the BioStampRC21 dataset, 2 s windows at 30 Hz, and subject-independent leave-one-subject-out (LOSO) validation on 16 PwPD with chest-worn IMUs, our residual separable model (Model 2, 533 params) attains PR-AUC = 94.5%, F1 = 91.2%, MCC = 89.4%, matching or surpassing the baseline (5,552 params; PR-AUC = 93.7%, F1 = 90.5%, MCC = 88.5%) with approximately 10x fewer parameters. The smallest model (Model 1, 305 params) reaches PR-AUC = 94.0%, F1 = 91.0%, MCC = 89.1%. Thresholding obtains high recall (89.0%) but low precision (76.5%), yielding many false positives and high inter-subject variance. Sensor-position analysis (train-on-all) shows chest and thighs are most reliable; forearms degrade precision/recall due to non-gait arm motion; naive fusion of all sites does not outperform the best single site. Both compact CNNs execute within tight memory/latency budgets on STM32-class MCUs (sub-10 ms on low-power boards), enabling on-sensor gating of transmission/storage. Overall, ultra-light separable CNNs provide a superior accuracy-efficiency-generalization trade-off to fixed thresholds for wearable PD gait detection and underscore the value of tailored time-series models for edge deployment.

[261] Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

Alwin Jin, Sean M. Hendryx, Vaskar Nath

Main category: cs.LG

TL;DR: The paper introduces “progress-oriented benchmarks” that measure scientific advancement rather than just static problem-solving, using NanoGPT speedrun as an example to catalyze improvements in language modeling infrastructure.

DetailsMotivation: Current benchmarks focus on static, already-solved problems which constrain the kinds of advances we can measure and incentivize. The field needs benchmarks where progress on the benchmark directly advances the scientific field itself.

Method: Created a progress-oriented benchmark environment based on NanoGPT speedrun, standardizing dataset slice, reference model, training harness, telemetry, with run-time verification and anti-gaming checks. Evaluation focuses on scientific delta: best-attained loss and efficiency frontier.

Result: Achieved new state-of-the-art training time, improving previous record by 3 seconds, and qualitatively observed emergence of novel algorithmic ideas. The environment enables comparisons between models/agents as means to catalyze reusable improvements to language modeling stack.

Conclusion: Proposes shifting from static problem leaderboards to test-time research on open-ended yet measurable scientific problems, reframing “benchmarking” as a vehicle for scientific advancement where progress on the benchmark equals progress on the science.

Abstract: Current benchmarks that test LLMs on static, already-solved problems (e.g., math word problems) effectively demonstrated basic capability acquisition. The natural progression has been toward larger, more comprehensive and challenging collections of static problems, an approach that inadvertently constrains the kinds of advances we can measure and incentivize. To address this limitation, we argue for progress-oriented benchmarks, problem environments whose objectives are themselves the core targets of scientific progress, so that achieving state of the art on the benchmark advances the field. As a introductory step, we instantiate an environment based on the NanoGPT speedrun. The environment standardizes a dataset slice, a reference model and training harness, and rich telemetry, with run-time verification and anti-gaming checks. Evaluation centers on the scientific delta achieved: best-attained loss and the efficiency frontier. Using this environment, we achieve a new state-of-the-art training time, improving upon the previous record by 3 seconds, and qualitatively observe the emergence of novel algorithmic ideas. Moreover, comparisons between models and agents remain possible, but they are a means, not the end; the benchmark’s purpose is to catalyze reusable improvements to the language modeling stack. With this release, the overarching goal is to seed a community shift from static problem leaderboards to test-time research on open-ended yet measurable scientific problems. In this new paradigm, progress on the benchmark is progress on the science, thus reframing “benchmarking” as a vehicle for scientific advancement.

[262] On the failure of ReLU activation for physics-informed machine learning

Conor Rowan

Main category: cs.LG

TL;DR: ReLU activation functions perform poorly in physics-informed machine learning due to issues with automatic differentiation of discontinuous fields, not just because of piecewise linear limitations.

DetailsMotivation: To diagnose why ReLU activation functions consistently underperform compared to sigmoid, tanh, and swish functions in physics-informed machine learning, despite its widespread success in other ML domains.

Method: Analyzed ReLU’s performance on physics-informed problems, examining both second-order differential equations and variational problems with only first derivatives. Investigated how automatic differentiation in PyTorch handles discontinuous fields during training.

Result: Found that ReLU fails even on first-derivative variational problems. The failure stems from automatic differentiation issues with discontinuous fields, causing mis-specified gradients of the physics-informed loss function during training.

Conclusion: ReLU’s poor performance in physics-informed ML is caused by automatic differentiation problems with discontinuous fields, not just its piecewise linear nature. This explains why smooth activation functions like sigmoid, tanh, and swish perform better.

Abstract: Physics-informed machine learning uses governing ordinary and/or partial differential equations to train neural networks to represent the solution field. Like any machine learning problem, the choice of activation function influences the characteristics and performance of the solution obtained from physics-informed training. Several studies have compared common activation functions on benchmark differential equations, and have unanimously found that the rectified linear unit (ReLU) is outperformed by competitors such as the sigmoid, hyperbolic tangent, and swish activation functions. In this work, we diagnose the poor performance of ReLU on physics-informed machine learning problems. While it is well-known that the piecewise linear form of ReLU prevents it from being used on second-order differential equations, we show that ReLU fails even on variational problems involving only first derivatives. We identify the cause of this failure as second derivatives of the activation, which are taken not in the formulation of the loss, but in the process of training. Namely, we show that automatic differentiation in PyTorch fails to characterize derivatives of discontinuous fields, which causes the gradient of the physics-informed loss to be mis-specified, thus explaining the poor performance of ReLU.

[263] Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models

Divya Kothandaraman, Jaclyn Pytlarz

Main category: cs.LG

TL;DR: A Gradient Projection Framework for concept-level feature exclusion in diffusion models that prevents memorization of sensitive attributes while preserving generation quality.

DetailsMotivation: Memorization in large-scale text-to-image diffusion models poses security and IP risks, enabling adversarial attribute extraction and unauthorized reproduction of sensitive features. Existing dememorization techniques fail to prevent internalization of prohibited concept-level features, and discarding all images containing sensitive features wastes valuable training data.

Method: Introduces a Gradient Projection Framework that operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Projects each gradient update onto the orthogonal complement of the sensitive feature’s embedding space, zeroing out its influence on model weights.

Result: The framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. It integrates seamlessly into standard diffusion model training pipelines and complements existing defenses.

Conclusion: By reframing memorization control as selective learning, the approach establishes a new paradigm for IP-safe and privacy-preserving generative AI through concept-level feature exclusion.

Abstract: Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature’s embedding space, thereby zeroing out its influence on the model’s weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.

[264] Fast EXP3 Algorithms

Ryoma Sato, Shinji Ito

Main category: cs.LG

TL;DR: EXP3 algorithm can be implemented in constant time per round; paper proposes more practical algorithms and analyzes trade-offs between regret bounds and time complexities.

DetailsMotivation: The motivation is to improve the practical efficiency of EXP3 algorithm by reducing its time complexity while maintaining good regret bounds, making it more suitable for real-world applications.

Method: The paper proposes constant-time implementations of EXP3 and introduces more practical algorithms, then analyzes the trade-offs between their regret bounds and computational complexities.

Result: The paper shows that EXP3 can be implemented in constant time per round and presents algorithms with different trade-offs between regret performance and computational efficiency.

Conclusion: There exists a spectrum of algorithms with varying trade-offs between regret bounds and time complexity, enabling practitioners to choose appropriate algorithms based on their specific computational constraints and performance requirements.

Abstract: We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms.

[265] Latent Variable Causal Discovery under Selection Bias

Haoyue Dai, Yiwen Qiu, Ignavier Ng, Xinshuai Dong, Peter Spirtes, Kun Zhang

Main category: cs.LG

TL;DR: Rank constraints in covariance matrices can identify latent variable causal structures under selection bias, enabling identification of one-factor models despite selection mechanisms.

DetailsMotivation: Selection bias in latent variable causal discovery is an important but underexplored problem due to lack of statistical tools that can handle both latent variables and selection bias simultaneously.

Method: The paper studies rank constraints as a generalization of conditional independence constraints, analyzing ranks of covariance submatrices in linear Gaussian models under selection bias. Provides graph-theoretic characterization of rank constraints that preserve information about causal structures and selection mechanisms.

Result: Shows that rank constraints in biased covariance matrices preserve meaningful information about causal structures and selection mechanisms. Demonstrates that the one-factor model (classical latent variable model) can be identified under selection bias using these rank constraints.

Conclusion: Rank constraints provide an effective tool for latent variable causal discovery under selection bias, with simulations and real-world experiments confirming their effectiveness in identifying causal structures despite selection mechanisms.

Abstract: Addressing selection bias in latent variable causal discovery is important yet underexplored, largely due to a lack of suitable statistical tools: While various tools beyond basic conditional independencies have been developed to handle latent variables, none have been adapted for selection bias. We make an attempt by studying rank constraints, which, as a generalization to conditional independence constraints, exploits the ranks of covariance submatrices in linear Gaussian models. We show that although selection can significantly complicate the joint distribution, interestingly, the ranks in the biased covariance matrices still preserve meaningful information about both causal structures and selection mechanisms. We provide a graph-theoretic characterization of such rank constraints. Using this tool, we demonstrate that the one-factor model, a classical latent variable model, can be identified under selection bias. Simulations and real-world experiments confirm the effectiveness of using our rank constraints.

[266] Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, Gulnara D. Kabaeva

Main category: cs.LG

TL;DR: ASR-KF-EGR is a training-free inference framework that reduces KV cache memory usage by 55-67% through reversible soft-freezing of low-importance tokens, with sublinear freeze scheduling and entropy-guided recovery.

DetailsMotivation: To enable efficient large language model generation in memory-constrained environments without requiring fine-tuning or permanently discarding context information.

Method: Uses reversible soft-freeze mechanism to temporarily suspend KV updates for low-importance tokens within sliding attention window, stores tokens off-GPU, restores on demand, with sublinear freeze scheduling that increases freeze duration sublinearly with repeated low-importance detections.

Result: 55-67% reduction in active KV cache size on LLaMA-3 8B while maintaining generation quality and passing needle-in-haystack retrieval tests.

Conclusion: Provides practical, architecture-agnostic solution for memory-efficient long-context LLM deployment without training or fine-tuning requirements.

Abstract: We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.

[267] Task-Aware Multi-Expert Architecture For Lifelong Deep Learning

Jianyu Wang, Jacob Nean-Hua Sheikh, Cat P. Le, Hoda Bidkhori

Main category: cs.LG

TL;DR: TAME is a continual learning algorithm that uses task similarity to select relevant pretrained experts, integrates them via shared dense layers, and employs replay buffers with attention to prevent forgetting.

DetailsMotivation: To address catastrophic forgetting in lifelong deep learning where neural networks learn sequentially across tasks while preserving prior knowledge, and to enable flexible adaptation to new tasks while retaining important knowledge from previous ones.

Method: Task-Aware Multi-Expert (TAME) maintains a pool of pretrained neural networks, activates the most relevant expert for each new task based on task similarity, uses a shared dense layer to integrate features from chosen experts, employs a replay buffer storing representative samples and embeddings from previous tasks, and uses an attention mechanism to prioritize the most relevant stored information for predictions.

Result: Experiments on binary classification tasks from CIFAR-100 show that TAME improves accuracy on new tasks while sustaining performance on earlier ones, demonstrating effective balance between adaptation and retention in lifelong learning.

Conclusion: TAME effectively addresses catastrophic forgetting in lifelong learning by leveraging task similarity for expert selection, replay buffers for knowledge retention, and attention mechanisms for relevant information prioritization, achieving a good balance between adaptation to new tasks and preservation of prior knowledge.

Abstract: Lifelong deep learning (LDL) trains neural networks to learn sequentially across tasks while preserving prior knowledge. We propose Task-Aware Multi-Expert (TAME), a continual learning algorithm that leverages task similarity to guide expert selection and knowledge transfer. TAME maintains a pool of pretrained neural networks and activates the most relevant expert for each new task. A shared dense layer integrates features from the chosen expert to generate predictions. To reduce catastrophic forgetting, TAME uses a replay buffer that stores representative samples and embeddings from previous tasks and reuses them during training. An attention mechanism further prioritizes the most relevant stored information for each prediction. Together, these components allow TAME to adapt flexibly while retaining important knowledge across evolving task sequences. Experiments on binary classification tasks derived from CIFAR-100 show that TAME improves accuracy on new tasks while sustaining performance on earlier ones, highlighting its effectiveness in balancing adaptation and retention in lifelong learning settings.

[268] Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language

Yunkai Zhang, Yawen Zhang, Ming Zheng, Kezhen Chen, Chongyang Gao, Ruian Ge, Siyuan Teng, Amine Jelloul, Jinmeng Rao, Xiaoyuan Guo, Chiang-Wei Fang, Zeyu Zheng, Jie Yang

Main category: cs.LG

TL;DR: Insight Miner is a multimodal model that generates comprehensive time-series descriptions using domain knowledge, trained on the new TS-Insights dataset created via an agentic workflow.

DetailsMotivation: Time-series analysis requires deep domain expertise and is time-consuming/labor-intensive. There's a need for automated systems that can generate high-quality insights from time-series data without requiring specialized knowledge.

Method: 1) Created TS-Insights dataset (100k time-series windows from 20 forecasting datasets) using an agentic workflow: statistical tools extract features, GPT-4 synthesizes coherent trend descriptions. 2) Developed Insight Miner LMM and instruction-tuned it on TS-Insights dataset.

Result: Insight Miner outperforms state-of-the-art multimodal models (LLaVA and GPT-4) in generating time-series descriptions and insights. The model shows promise for enabling LLMs to interpret time series as a native input modality.

Conclusion: The work demonstrates a promising direction for leveraging LMMs in time series analysis and serves as a foundational step toward enabling LLMs to interpret time series as a native input modality, potentially democratizing time-series analysis across domains.

Abstract: Time-series data is critical across many scientific and industrial domains, including environmental analysis, agriculture, transportation, and finance. However, mining insights from this data typically requires deep domain expertise, a process that is both time-consuming and labor-intensive. In this paper, we propose \textbf{Insight Miner}, a large-scale multimodal model (LMM) designed to generate high-quality, comprehensive time-series descriptions enriched with domain-specific knowledge. To facilitate this, we introduce \textbf{TS-Insights}\footnote{Available at \href{https://huggingface.co/datasets/zhykoties/time-series-language-alignment}{https://huggingface.co/datasets/zhykoties/time-series-language-alignment}.}, the first general-domain dataset for time series and language alignment. TS-Insights contains 100k time-series windows sampled from 20 forecasting datasets. We construct this dataset using a novel \textbf{agentic workflow}, where we use statistical tools to extract features from raw time series before synthesizing them into coherent trend descriptions with GPT-4. Following instruction tuning on TS-Insights, Insight Miner outperforms state-of-the-art multimodal models, such as LLaVA \citep{liu2023llava} and GPT-4, in generating time-series descriptions and insights. Our findings suggest a promising direction for leveraging LMMs in time series analysis, and serve as a foundational step toward enabling LLMs to interpret time series as a native input modality.

[269] A Simple Generalisation of the Implicit Dynamics of In-Context Learning

Francesco Innocenti, El Mehdi Achour

Main category: cs.LG

TL;DR: This paper extends the theory that transformer blocks implicitly update feedforward network weights during in-context learning, generalizing it to all sequence positions, any transformer block, and more realistic architectures including layer normalization.

DetailsMotivation: Previous theories of in-context learning (ICL) relied on simplified toy models and data settings. Recent work by Dherin et al. (2025) showed that transformer blocks can be seen as implicitly updating feedforward network weights based on context, but this theory needs extension to more realistic scenarios.

Method: The authors provide a simple generalization of Dherin et al.’s result that works for: (i) all sequence positions (not just the last), (ii) any transformer block (not just the first), and (iii) more realistic residual blocks including layer normalization. They empirically verify their theory on simple in-context linear regression tasks and investigate relationships between implicit updates across different tokens and blocks.

Result: The paper successfully generalizes the implicit weight update theory to more comprehensive and realistic transformer architectures. Empirical verification on linear regression tasks supports the theoretical extensions, and the analysis reveals relationships between implicit updates across different tokens and transformer blocks.

Conclusion: This work brings the theory of implicit weight updates in transformer blocks closer to practical applications, potentially enabling validation on large-scale models and providing deeper understanding of how in-context learning works in realistic transformer architectures.

Abstract: In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models.

[270] Features Emerge as Discrete States: The First Application of SAEs to 3D Representations

Albert Miao, Chenliang Zhou, Jiawei Zhou, Cengiz Oztireli

Main category: cs.LG

TL;DR: First application of Sparse Autoencoders (SAEs) to 3D domain, analyzing VAE features for 3D reconstruction, discovering discrete state space with phase transitions.

DetailsMotivation: SAEs are powerful for decomposing neural activations but rarely applied outside text domain, limiting feature decomposition theory. Need to explore SAEs in 3D to understand feature learning dynamics.

Method: Applied SAEs to analyze features of state-of-the-art 3D reconstruction VAE trained on 53k 3D models from Objaverse dataset. Used SAEs to decompose hidden activations and study feature behavior.

Result: Found network encodes discrete rather than continuous features, approximating discrete state space with phase-like transitions. Explained three unintuitive behaviors: preference for positional encoding, sigmoidal reconstruction loss from ablation, and bimodal phase transition distribution showing interference redistribution.

Conclusion: First SAE application to 3D domain provides framework explaining feature learning dynamics, showing models use discrete state spaces with phase transitions and redistribute interference to prioritize feature saliency.

Abstract: Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations, translating the hidden state into human ideas with high semantic value despite no external intervention or guidance. However, this technique has rarely been applied outside of the textual domain, limiting theoretical explorations of feature decomposition. We present the \textbf{first application of SAEs to the 3D domain}, analyzing the features used by a state-of-the-art 3D reconstruction VAE applied to 53k 3D models from the Objaverse dataset. We observe that the network encodes discrete rather than continuous features, leading to our key finding: \textbf{such models approximate a discrete state space, driven by phase-like transitions from feature activations}. Through this state transition framework, we address three otherwise unintuitive behaviors – the inclination of the reconstruction model towards positional encoding representations, the sigmoidal behavior of reconstruction loss from feature ablation, and the bimodality in the distribution of phase transition points. This final observation suggests the model \textbf{redistributes the interference caused by superposition to prioritize the saliency of different features}. Our work not only compiles and explains unexpected phenomena regarding feature decomposition, but also provides a framework to explain the model’s feature learning dynamics. The code and dataset of encoded 3D objects will be available on release.

[271] SRLR: Symbolic Regression based Logic Recovery to Counter Programmable Logic Controller Attacks

Hao Zhou, Suman Sourav, Binbin Chen, Ke Yu

Main category: cs.LG

TL;DR: SRLR is a symbolic regression-based approach for recovering PLC logic from input-output data to detect controller logic attacks with explainable rules, outperforming existing methods by up to 39% in challenging ICS environments.

DetailsMotivation: PLC controllers in industrial control systems are vulnerable to cyber-attacks. Existing detection methods have limitations: specification-based approaches require expert knowledge or source code access, while machine learning models lack explainability for their decisions.

Method: SRLR uses symbolic regression to recover PLC logic from input-output data only. It enhances deep symbolic regression with ICS-specific properties: frequency domain representation for control logic, handling multiple operational modes, filtering outlier inputs, and reducing formula complexity for effective search.

Result: SRLR consistently outperforms all existing methods across various ICS settings, achieving up to 39% higher recovery accuracy in challenging environments. It also demonstrates stability in large-scale systems, successfully handling a distribution grid with hundreds of voltage regulators.

Conclusion: SRLR provides an effective, explainable solution for PLC logic recovery and attack detection that doesn’t require source code or expert specifications, making it practical for real-world industrial control system security.

Abstract: Programmable Logic Controllers (PLCs) are critical components in Industrial Control Systems (ICSs). Their potential exposure to external world makes them susceptible to cyber-attacks. Existing detection methods against controller logic attacks use either specification-based or learnt models. However, specification-based models require experts’ manual efforts or access to PLC’s source code, while machine learning-based models often fall short of providing explanation for their decisions. We design SRLR – a it Symbolic Regression based Logic Recovery} solution to identify the logic of a PLC based only on its inputs and outputs. The recovered logic is used to generate explainable rules for detecting controller logic attacks. SRLR enhances the latest deep symbolic regression methods using the following ICS-specific properties: (1) some important ICS control logic is best represented in frequency domain rather than time domain; (2) an ICS controller can operate in multiple modes, each using different logic, where mode switches usually do not happen frequently; (3) a robust controller usually filters out outlier inputs as ICS sensor data can be noisy; and (4) with the above factors captured, the degree of complexity of the formulas is reduced, making effective search possible. Thanks to these enhancements, SRLR consistently outperforms all existing methods in a variety of ICS settings that we evaluate. In terms of the recovery accuracy, SRLR’s gain can be as high as 39% in some challenging environment. We also evaluate SRLR on a distribution grid containing hundreds of voltage regulators, demonstrating its stability in handling large-scale, complex systems with varied configurations.

[272] QGEC : Quantum Golay Code Error Correction

Hideo Mukai, Hoshitaro Ohnishi

Main category: cs.LG

TL;DR: Proposed Quantum Golay code Error Correction (QGEC) using Transformer decoders, showing Golay code (23 qubits, distance 7) outperforms toric code (50 qubits, distance 5) in decoding accuracy across various noise models.

DetailsMotivation: Quantum error correction is essential for fault-tolerant quantum computation since qubits are highly susceptible to noise. Traditional QEC methods use syndrome measurements instead of direct data qubit measurements, but efficient decoding remains challenging. The Golay code, known for efficiency in classical information theory, may offer advantages for quantum error correction.

Method: Proposed QGEC method using Golay code with Transformer-based decoders. Evaluated decoder accuracy across: 1) three different weight sets for generative polynomials, 2) three noise models with varying correlations between bit-flip and phase-flip errors, and 3) compared Golay code (23 data qubits, distance 7) against toric code (50 data qubits, distance 5) under discrete uniform distribution noise.

Result: 1) Noise models with smaller correlation between bit-flip and phase-flip errors gave better decoding accuracy. 2) Weights of generative polynomials had little effect on decoder accuracy. 3) Golay code achieved higher decoding accuracy than toric code despite requiring fewer qubits (23 vs 50) and having larger code distance (7 vs 5).

Conclusion: Transformer-based decoding enables efficient quantum error correction with Golay codes, potentially allowing more efficient fault-tolerant quantum computation compared to toric codes. The Golay code’s superior performance with fewer qubits suggests it could be a promising approach for practical quantum error correction implementations.

Abstract: Quantum computers have the possibility of a much reduced calculation load compared with classical computers in specific problems. Quantum error correction (QEC) is vital for handling qubits, which are vulnerable to external noise. In QEC, actual errors are predicted from the results of syndrome measurements by stabilizer generators, in place of making direct measurements of the data qubits. Here, we propose Quantum Golay code Error Correction (QGEC), a QEC method using Golay code, which is an efficient coding method in classical information theory. We investigated our method’s ability in decoding calculations with the Transformer. We evaluated the accuracy of the decoder in a code space defined by the generative polynomials with three different weights sets and three noise models with different correlations of bit-flip error and phase-flip error. Furthermore, under a noise model following a discrete uniform distribution, we compared the decoding performance of Transformer decoders with identical architectures trained respectively on Golay and toric codes. The results showed that the noise model with the smaller correlation gave better accuracy, while the weights of the generative polynomials had little effect on the accuracy of the decoder. In addition, they showed that Golay code requiring 23 data qubits and having a code distance of 7 achieved higher decoding accuracy than toric code which requiring 50 data qubits and having a code distance of 5. This suggests that implementing quantum error correction using a Transformer may enable the Golay code to realize fault-tolerant quantum computation more efficiently.

[273] TV2TV: A Unified Framework for Interleaved Language and Video Generation

Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

Main category: cs.LG

TL;DR: TV2TV is a novel video generation framework that interleaves text and video generation, allowing the model to “think in words” before “acting in pixels” to improve video quality and controllability.

DetailsMotivation: Current video generation models struggle with complex outputs requiring semantic branching and high-level reasoning about what should happen next in videos.

Method: TV2TV uses a Mixture-of-Transformers (MoT) architecture to jointly learn language modeling (next-token prediction) and video flow matching (next-frame prediction), allowing dynamic alternation between generating text and video frames during inference.

Result: TV2TV shows substantial improvements in visual quality and prompt alignment on video game data, and scales to natural videos (sports) with strong visual quality and prompt alignment for complex real-world action sequences.

Conclusion: TV2TV represents a promising step toward video generation with open-ended textual reasoning and control by offloading reasoning to language modeling and enabling fine-grained user intervention.

Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to “think in words” about subsequent content before ``acting in pixels’’ to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model’s ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

[274] Benchmarking the Generality of Vision-Language-Action Models

Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang

Main category: cs.LG

TL;DR: MultiNet v1.0 is a unified benchmark for evaluating cross-domain generalization of vision-language models across six capability regimes, revealing current models fail to generalize beyond training distributions despite strong in-distribution performance.

DetailsMotivation: Current evaluation practices for multimodal agents are fragmented across isolated benchmarks, making it difficult to assess whether foundation models truly generalize beyond their training distributions. There's a need for unified evaluation to measure cross-domain generality.

Method: Introduced MultiNet v1.0 benchmark covering six foundational capability regimes: visual grounding, spatial reasoning, tool use, physical commonsense, multi-agent coordination, and continuous robot control. Evaluated models including GPT-5, Pi0, and Magma on this unified framework.

Result: No model demonstrated consistent generality across domains. All exhibited substantial degradation on unseen domains, unfamiliar modalities, or cross-domain task shifts despite strong performance within training distributions. Failures included modality misalignment, output format instability, and catastrophic knowledge degradation under domain transfer.

Conclusion: There’s a persistent gap between the aspiration of generalist intelligence and actual capabilities of current foundation models. MultiNet v1.0 provides standardized evaluation for diagnosing these gaps and guiding development of future generalist agents.

Abstract: Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today’s foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training distributions.These failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain transfer.Our findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation models.MultiNet v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist agents.Code, data, and leaderboards are publicly available.

[275] Condensation-Concatenation Framework for Dynamic Graph Continual Learning

Tingxu Yan, Ye Yuan

Main category: cs.LG

TL;DR: CCC framework for continual learning on dynamic graphs uses condensation-concatenation to preserve historical knowledge and mitigate catastrophic forgetting from topological changes.

DetailsMotivation: Dynamic graphs experience continuous structural changes that cause catastrophic forgetting in GNNs. Existing continual learning methods for dynamic graphs overlook how topological changes affect existing nodes.

Method: CCC condenses historical graph snapshots into compact semantic representations preserving label distribution and topological properties, then concatenates these with current graph representations selectively. Also refines forgetting measure (FM) to quantify predictive performance degradation of existing nodes due to structural updates.

Result: CCC demonstrates superior performance over state-of-the-art baselines across four real-world datasets in extensive experiments.

Conclusion: The proposed CCC framework effectively addresses catastrophic forgetting in dynamic graphs by preserving historical knowledge through condensation-concatenation and better quantifying forgetting in dynamic graph scenarios.

Abstract: Dynamic graphs are prevalent in real-world scenarios, where continuous structural changes induce catastrophic forgetting in graph neural networks (GNNs). While continual learning has been extended to dynamic graphs, existing methods overlook the effects of topological changes on existing nodes. To address it, we propose a novel framework for continual learning on dynamic graphs, named Condensation-Concatenation-based Continual Learning (CCC). Specifically, CCC first condenses historical graph snapshots into compact semantic representations while aiming to preserve the original label distribution and topological properties. Then it concatenates these historical embeddings with current graph representations selectively. Moreover, we refine the forgetting measure (FM) to better adapt to dynamic graph scenarios by quantifying the predictive performance degradation of existing nodes caused by structural updates. CCC demonstrates superior performance over state-of-the-art baselines across four real-world datasets in extensive experiments.

[276] Pace: Physics-Aware Attentive Temporal Convolutional Network for Battery Health Estimation

Sara Sameer, Wei Zhang, Kannan Dhivya Dharshini, Xin Lou, Yulin Gao, Terence Goh, Qingyu Yan

Main category: cs.LG

TL;DR: Pace: A physics-aware attentive temporal convolutional network that integrates sensor data with battery physics for accurate battery health estimation, outperforming existing models by 6.5% and achieving 2.0x better performance.

DetailsMotivation: Batteries are critical for modern energy systems (EVs, grid storage), and effective health management is essential for safety, cost-efficiency, and sustainability. Current methods need improvement for accurate health estimation across various usage conditions.

Method: Propose Pace network that integrates raw sensor measurements with battery physics features from equivalent circuit model. Includes three battery-specific modules: dilated temporal blocks for efficient temporal encoding, chunked attention blocks for context modeling, and dual-head output block for fusing short- and long-term degradation patterns.

Result: Outperforms existing models on large public dataset with average performance improvement of 6.5% and 2.0x better than two best-performing baselines. Successfully deployed in real-time edge implementation on Raspberry Pi, demonstrating practical viability.

Conclusion: Pace establishes itself as a practical, high-performance solution for battery health analytics that accurately predicts battery health across various usage conditions through physics-aware deep learning architecture.

Abstract: Batteries are critical components in modern energy systems such as electric vehicles and power grid energy storage. Effective battery health management is essential for battery system safety, cost-efficiency, and sustainability. In this paper, we propose Pace, a physics-aware attentive temporal convolutional network for battery health estimation. Pace integrates raw sensor measurements with battery physics features derived from the equivalent circuit model. We develop three battery-specific modules, including dilated temporal blocks for efficient temporal encoding, chunked attention blocks for context modeling, and a dual-head output block for fusing short- and long-term battery degradation patterns. Together, the modules enable Pace to predict battery health accurately and efficiently in various battery usage conditions. In a large public dataset, Pace performs much better than existing models, achieving an average performance improvement of 6.5 and 2.0x compared to two best-performing baseline models. We further demonstrate its practical viability with a real-time edge deployment on a Raspberry Pi. These results establish Pace as a practical and high-performance solution for battery health analytics.

[277] Rethinking Expert Trajectory Utilization in LLM Post-training

Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin

Main category: cs.LG

TL;DR: The paper proposes a Plasticity-Ceiling Framework for post-training LLMs, establishing Sequential SFT-then-RL as optimal, with scaling guidelines for transitioning timing, data importance, and trajectory selection.

DetailsMotivation: The optimal use of expert trajectories in post-training (combining SFT and RL) remains unresolved, with unclear mechanisms for maximizing performance from these trajectories.

Method: Proposes Plasticity-Ceiling Framework to decompose performance into foundational SFT performance and RL plasticity. Uses extensive benchmarking to evaluate different approaches and derive scaling guidelines.

Result: Sequential SFT-then-RL pipeline is superior to synchronized approaches. Three key scaling guidelines: 1) Transition to RL at SFT Stable/Mild Overfitting phase maximizes final ceiling; 2) Data scale determines primary potential, trajectory difficulty acts as multiplier; 3) Minimum SFT validation loss indicates best expert trajectories.

Conclusion: Provides actionable guidelines for maximizing value from expert trajectories in post-training, establishing SFT-then-RL as standard with specific timing, scaling, and selection criteria.

Abstract: While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More’’ in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.

[278] Spectral entropy prior-guided deep feature fusion architecture for magnetic core loss

Cong Yao, Chunye Gong, Jin Zhang

Main category: cs.LG

TL;DR: SEPI-TFPNet: A hybrid model combining empirical models with deep learning for improved magnetic core loss prediction, achieving better accuracy and robustness than existing methods.

DetailsMotivation: Traditional core loss modeling methods have accuracy limitations, and while data-driven models from the MagNet Challenge show good fitting performance, they lack interpretability and cross-distribution generalization capabilities.

Method: Hybrid model with physical-prior submodule using spectral entropy discrimination to select appropriate empirical models, and data-driven submodule with CNN, multi-head attention, and BiLSTM networks to extract flux-density time-series features, plus adaptive feature fusion module for multimodal integration.

Result: The method outperforms 21 representative models from the 2023 MagNet Challenge and three advanced methods from 2024-2025, demonstrating improved modeling accuracy and robustness.

Conclusion: SEPI-TFPNet successfully addresses limitations of purely data-driven approaches by integrating physical priors with deep learning, achieving better core loss prediction while maintaining interpretability and generalization capabilities.

Abstract: Accurate core loss modeling is critical for the design of high-efficiency power electronic systems. Traditional core loss modeling methods have limitations in prediction accuracy. To advance this field, the IEEE Power Electronics Society launched the MagNet Challenge in 2023, the first international competition focused on data-driven power electronics design methods, aiming to uncover complex loss patterns in magnetic components through a data-driven paradigm. Although purely data-driven models demonstrate strong fitting performance, their interpretability and cross-distribution generalization capabilities remain limited. To address these issues, this paper proposes a hybrid model, SEPI-TFPNet, which integrates empirical models with deep learning. The physical-prior submodule employs a spectral entropy discrimination mechanism to select the most suitable empirical model under different excitation waveforms. The data-driven submodule incorporates convolutional neural networks, multi-head attention mechanisms, and bidirectional long short-term memory networks to extract flux-density time-series features. An adaptive feature fusion module is introduced to improve multimodal feature interaction and integration. Using the MagNet dataset containing various magnetic materials, this paper evaluates the proposed method and compares it with 21 representative models from the 2023 challenge and three advanced methods from 2024-2025. The results show that the proposed method achieves improved modeling accuracy and robustness.

[279] DAPO: Design Structure-Aware Pass Ordering in High-Level Synthesis with Graph Contrastive and Reinforcement Learning

Jinming Ge, Linfeng Du, Likith Anaparty, Shangkun Li, Tingyuan Liang, Afzal Ahmad, Vivek Chaturvedi, Sharad Sinha, Zhiyao Xie, Jiang Xu, Wei Zhang

Main category: cs.LG

TL;DR: DAPO is a design structure-aware pass ordering framework for HLS that uses program semantics, contrastive learning, and reinforcement learning to discover design-specific optimization strategies, achieving 2.36× speedup over Vitis HLS.

DetailsMotivation: Existing HLS tools use fixed optimization strategies from software compilers, which are ineffective for FPGA-based accelerators. They lack the ability to tailor optimizations to specific designs due to missing capabilities in semantic understanding, hardware metric estimation, and advanced search algorithms.

Method: DAPO extracts program semantics from control and data flow graphs, uses contrastive learning to generate embeddings, employs an analytical model for hardware metric estimation, and guides a reinforcement learning agent to discover design-specific optimization strategies.

Result: The framework achieves a 2.36× speedup over Vitis HLS on average when evaluated on classic HLS designs.

Conclusion: DAPO demonstrates that design-aware optimization strategies guided by program semantics and reinforcement learning can significantly outperform fixed optimization approaches in HLS tools.

Abstract: High-Level Synthesis (HLS) tools are widely adopted in FPGA-based domain-specific accelerator design. However, existing tools rely on fixed optimization strategies inherited from software compilations, limiting their effectiveness. Tailoring optimization strategies to specific designs requires deep semantic understanding, accurate hardware metric estimation, and advanced search algorithms – capabilities that current approaches lack. We propose DAPO, a design structure-aware pass ordering framework that extracts program semantics from control and data flow graphs, employs contrastive learning to generate rich embeddings, and leverages an analytical model for accurate hardware metric estimation. These components jointly guide a reinforcement learning agent to discover design-specific optimization strategies. Evaluations on classic HLS designs demonstrate that our end-to-end flow delivers a 2.36 speedup over Vitis HLS on average.

[280] Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits

Minwoo Park, Junwoo Chang, Jongeun Choi, Roberto Horowitz

Main category: cs.LG

TL;DR: EDPs combine diffusion models with geometric symmetries for efficient policy learning. The paper introduces symmetry-aware RL steering for EDPs, showing improved sample efficiency and stability compared to standard RL approaches.

DetailsMotivation: Standard RL applied to equivariant diffusion policies can be sample-inefficient and unstable because it ignores the geometric symmetries that EDPs are designed to exploit. There's a need for symmetry-aware steering methods to properly leverage EDPs' equivariant properties during fine-tuning.

Method: The authors theoretically establish that EDP diffusion processes are equivariant, which induces a group-invariant latent-noise MDP. They introduce a principled symmetry-aware steering framework and compare standard, equivariant, and approximately equivariant RL strategies across tasks with varying symmetry degrees.

Result: Exploiting symmetry during steering yields substantial benefits: enhanced sample efficiency, prevention of value divergence, and strong policy improvements even with extremely limited demonstration data. The paper identifies practical boundaries of strict equivariance under symmetry breaking.

Conclusion: Symmetry-aware RL steering is crucial for effectively fine-tuning equivariant diffusion policies, offering significant advantages over standard RL approaches while maintaining robustness to symmetry breaking in practical applications.

Abstract: Equivariant diffusion policies (EDPs) combine the generative expressivity of diffusion models with the strong generalization and sample efficiency afforded by geometric symmetries. While steering these policies with reinforcement learning (RL) offers a promising mechanism for fine-tuning beyond demonstration data, directly applying standard (non-equivariant) RL can be sample-inefficient and unstable, as it ignores the symmetries that EDPs are designed to exploit. In this paper, we theoretically establish that the diffusion process of an EDP is equivariant, which in turn induces a group-invariant latent-noise MDP that is well-suited for equivariant diffusion steering. Building on this theory, we introduce a principled symmetry-aware steering framework and compare standard, equivariant, and approximately equivariant RL strategies through comprehensive experiments across tasks with varying degrees of symmetry. While we identify the practical boundaries of strict equivariance under symmetry breaking, we show that exploiting symmetry during the steering process yields substantial benefits-enhancing sample efficiency, preventing value divergence, and achieving strong policy improvements even when EDPs are trained from extremely limited demonstrations.

[281] CAT: Can Trust be Predicted with Context-Awareness in Dynamic Heterogeneous Networks?

Jie Wang, Zheng Yan, Jiahe Lan, Xuyan Li, Elisa Bertino

Main category: cs.LG

TL;DR: CAT is a context-aware GNN-based trust prediction model that addresses limitations in existing approaches by handling trust dynamicity, network heterogeneity, and context-awareness through continuous-time representations, dual attention mechanisms, and meta-path-based contextual feature extraction.

DetailsMotivation: Current GNN-based trust prediction models have three key limitations: they fail to capture trust dynamicity (leading to questionable inferences), rarely consider network heterogeneity (losing rich semantics), and don't support context-awareness (making predictions coarse-grained). These gaps motivate the development of a more comprehensive trust prediction model.

Method: CAT consists of four layers: graph construction, embedding, heterogeneous attention, and prediction layers. It handles dynamic graphs using continuous-time representations with time encoding, models heterogeneity through dual attention mechanisms (node type importance and intra-type node importance), and achieves context-awareness via meta-paths for contextual feature extraction, context embeddings, and context-aware aggregation.

Result: Extensive experiments on three real-world datasets show CAT outperforms five groups of baselines in trust prediction. The model also demonstrates strong scalability to large-scale graphs and robustness against both trust-oriented and GNN-oriented attacks.

Conclusion: CAT is the first context-aware GNN-based trust prediction model that successfully addresses trust dynamicity, network heterogeneity, and context-awareness simultaneously, providing more accurate and robust trust predictions for real-world applications.

Abstract: Trust prediction provides valuable support for decision-making, risk mitigation, and system security enhancement. Recently, Graph Neural Networks (GNNs) have emerged as a promising approach for trust prediction, owing to their ability to learn expressive node representations that capture intricate trust relationships within a network. However, current GNN-based trust prediction models face several limitations: (i) Most of them fail to capture trust dynamicity, leading to questionable inferences. (ii) They rarely consider the heterogeneous nature of real-world networks, resulting in a loss of rich semantics. (iii) None of them support context-awareness, a basic property of trust, making prediction results coarse-grained. To this end, we propose CAT, the first Context-Aware GNN-based Trust prediction model that supports trust dynamicity and accurately represents real-world heterogeneity. CAT consists of a graph construction layer, an embedding layer, a heterogeneous attention layer, and a prediction layer. It handles dynamic graphs using continuous-time representations and captures temporal information through a time encoding function. To model graph heterogeneity and leverage semantic information, CAT employs a dual attention mechanism that identifies the importance of different node types and nodes within each type. For context-awareness, we introduce a new notion of meta-paths to extract contextual features. By constructing context embeddings and integrating a context-aware aggregator, CAT can predict both context-aware trust and overall trust. Extensive experiments on three real-world datasets demonstrate that CAT outperforms five groups of baselines in trust prediction, while exhibiting strong scalability to large-scale graphs and robustness against both trust-oriented and GNN-oriented attacks.

[282] Attacking and Securing Community Detection: A Game-Theoretic Framework

Yifan Niu, Aochuan Chen, Tingyang Xu, Jia Li

Main category: cs.LG

TL;DR: This paper extends adversarial attacks from graph classification to community detection, proposing novel attack/defense techniques and a game-theoretic framework (CD-GAME) to model interactive behaviors between attackers hiding individuals and defenders protecting community detection models.

DetailsMotivation: The motivation is to address adversarial attacks in community detection, which is more challenging than graph classification. This has practical applications for protecting personal privacy in social networks and understanding camouflage patterns in transaction networks.

Method: The paper proposes novel attack techniques to hide targeted individuals from detection models and defense techniques to enhance robustness. It introduces CD-GAME, a game-theoretic framework with two players: a graph attacker and a Rayleigh Quotient defender, modeling mutual influence and feedback mechanisms until Nash equilibrium.

Result: Extensive experiments show the proposed attack and defense methods outperform existing baselines significantly. CD-GAME reveals that at Nash equilibrium, attackers adopt more imperceptible strategies that maintain satisfactory effectiveness even after defense, unlike traditional single-step attacks.

Conclusion: The work successfully extends adversarial graph concepts to community detection, providing effective attack/defense techniques and valuable insights through the CD-GAME framework for understanding interactive scenarios in community detection problems.

Abstract: It has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations, can cause deep graph models to fail on classification tasks. In this work, we extend the concept of adversarial graphs to the community detection problem, which is more challenging. We propose novel attack and defense techniques for community detection problem, with the objective of hiding targeted individuals from detection models and enhancing the robustness of community detection models, respectively. These techniques have many applications in real-world scenarios, for example, protecting personal privacy in social networks and understanding camouflage patterns in transaction networks. To simulate interactive attack and defense behaviors, we further propose a game-theoretic framework, called CD-GAME. One player is a graph attacker, while the other player is a Rayleigh Quotient defender. The CD-GAME models the mutual influence and feedback mechanisms between the attacker and the defender, revealing the dynamic evolutionary process of the game. Both players dynamically update their strategies until they reach the Nash equilibrium. Extensive experiments demonstrate the effectiveness of our proposed attack and defense methods, and both outperform existing baselines by a significant margin. Furthermore, CD-GAME provides valuable insights for understanding interactive attack and defense scenarios in community detection problems. We found that in traditional single-step attack or defense, attacker tends to employ strategies that are most effective, but are easily detected and countered by defender. When the interactive game reaches a Nash equilibrium, attacker adopts more imperceptible strategies that can still achieve satisfactory attack effectiveness even after defense.

[283] Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li

Main category: cs.LG

TL;DR: NSPO is a novel RL framework that projects safety policy gradients into the null space of general tasks to mitigate alignment tax, preserving LLMs’ core abilities while ensuring effective safety alignment.

DetailsMotivation: Current RL-based safety alignment methods for LLMs suffer from "alignment tax" - forgetting learned general abilities when aligning models with human values and ethical principles.

Method: Null-Space constrained Policy Optimization (NSPO) projects safety policy gradients into the null space of general tasks, theoretically preserving original capabilities while ensuring descent direction for safety alignment.

Result: NSPO outperforms existing methods by large margins, achieves SOTA safety performance without sacrificing accuracy on math, code, and instruction-following tasks, and is data-efficient (requires only 40% of PKU-SafeRLHF data).

Conclusion: NSPO effectively addresses the alignment tax problem in LLM safety alignment, enabling preservation of core abilities while achieving strong safety performance with reduced data requirements.

Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model’s original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks. Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.

[284] Bhargava Cube–Inspired Quadratic Regularization for Structured Neural Embeddings

S Sairam, Prateek P Kulkarni

Main category: cs.LG

TL;DR: Novel neural representation learning using Bhargava cubes from number theory to impose algebraic constraints on 3D latent spaces, achieving 99.46% accuracy on MNIST with interpretable embeddings.

DetailsMotivation: Traditional deep learning methods create unstructured latent spaces lacking interpretability and mathematical consistency. The paper aims to incorporate structured mathematical priors into neural representation learning for better interpretability and mathematical grounding.

Method: Framework maps input data to constrained 3D latent spaces where embeddings are regularized to satisfy learned quadratic relationships derived from Bhargava’s combinatorial structures. Uses differentiable auxiliary loss function independent of classification objectives to guide models toward mathematically structured representations.

Result: Achieves 99.46% accuracy on MNIST while producing interpretable 3D embeddings that naturally cluster by digit class and satisfy learned quadratic constraints. Unlike existing manifold learning requiring explicit geometric supervision, this method imposes weak algebraic priors through differentiable constraints.

Conclusion: First application of number-theoretic constructs to neural representation learning, establishing a foundation for incorporating structured mathematical priors in neural networks. The approach ensures compatibility with standard optimization while providing mathematical consistency and interpretability.

Abstract: We present a novel approach to neural representation learning that incorporates algebraic constraints inspired by Bhargava cubes from number theory. Traditional deep learning methods learn representations in unstructured latent spaces lacking interpretability and mathematical consistency. Our framework maps input data to constrained 3-dimensional latent spaces where embeddings are regularized to satisfy learned quadratic relationships derived from Bhargava’s combinatorial structures. The architecture employs a differentiable auxiliary loss function operating independently of classification objectives, guiding models toward mathematically structured representations. We evaluate on MNIST, achieving 99.46% accuracy while producing interpretable 3D embeddings that naturally cluster by digit class and satisfy learned quadratic constraints. Unlike existing manifold learning approaches requiring explicit geometric supervision, our method imposes weak algebraic priors through differentiable constraints, ensuring compatibility with standard optimization. This represents the first application of number-theoretic constructs to neural representation learning, establishing a foundation for incorporating structured mathematical priors in neural networks.

[285] Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

Siwan Boufadène, François-Xavier Vialard

Main category: cs.LG

TL;DR: Sliced ReLU attention is a new attention mechanism with O(n log n) complexity that uses one-dimensional projections and sorting instead of softmax or ReLU on dot products.

DetailsMotivation: The paper aims to develop an attention mechanism that combines computational efficiency for long contexts with strong theoretical expressive power, addressing limitations of existing softmax and ReLU-based attention methods.

Method: Instead of applying nonlinearities to pairwise dot products, the method operates on one-dimensional projections of key-query differences and leverages sorting to achieve quasi-linear O(n log n) complexity through a differentiable, non-symmetric kernel.

Result: The sliced ReLU attention preserves theoretical expressive power with two in-context expressivity results: ability to perform nontrivial sequence-to-sequence disentangling tasks and contextual universal approximation property, previously known only for softmax attention.

Conclusion: Sliced ReLU attention offers a promising alternative to existing attention mechanisms with computational benefits for long contexts while maintaining strong theoretical guarantees, as demonstrated in small-scale experiments.

Abstract: We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key–query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.

[286] Hyperbolic Gaussian Blurring Mean Shift: A Statistical Mode-Seeking Framework for Clustering in Curved Spaces

Arghya Pratihar, Arnab Seal, Swagatam Das, Inesh Chattopadhyay

Main category: cs.LG

TL;DR: HypeGBMS extends Gaussian Blurring Mean Shift to hyperbolic space for hierarchical clustering, outperforming Euclidean methods on tree-like datasets.

DetailsMotivation: Standard GBMS works well in Euclidean space but fails to capture hierarchical/tree-like structures in data. There's a need for clustering methods that can handle non-Euclidean geometries while maintaining density-seeking behavior.

Method: Extends GBMS to hyperbolic space by replacing Euclidean distances with hyperbolic distances and using Möbius-weighted means to ensure geometric consistency. All updates respect hyperbolic geometry constraints.

Result: HypeGBMS significantly outperforms conventional mean-shift methods on 11 real-world datasets with hierarchical structures. Provides theoretical convergence guarantees and computational complexity analysis.

Conclusion: Bridges classical mean-shift clustering with hyperbolic representation learning, offering a principled approach for density-based clustering in curved spaces that effectively captures latent hierarchies.

Abstract: Clustering is a fundamental unsupervised learning task for uncovering patterns in data. While Gaussian Blurring Mean Shift (GBMS) has proven effective for identifying arbitrarily shaped clusters in Euclidean space, it struggles with datasets exhibiting hierarchical or tree-like structures. In this work, we introduce HypeGBMS, a novel extension of GBMS to hyperbolic space. Our method replaces Euclidean computations with hyperbolic distances and employs Möbius-weighted means to ensure that all updates remain consistent with the geometry of the space. HypeGBMS effectively captures latent hierarchies while retaining the density-seeking behavior of GBMS. We provide theoretical insights into convergence and computational complexity, along with empirical results that demonstrate improved clustering quality in hierarchical datasets. This work bridges classical mean-shift clustering and hyperbolic representation learning, offering a principled approach to density-based clustering in curved spaces. Extensive experimental evaluations on $11$ real-world datasets demonstrate that HypeGBMS significantly outperforms conventional mean-shift clustering methods in non-Euclidean settings, underscoring its robustness and effectiveness.

[287] NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics

Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Guangliang Liu, Yuxuan Liang, Xiaomeng Huang

Main category: cs.LG

TL;DR: NeuralOGCM combines differentiable physics solvers with deep learning to create fast, stable ocean models that outperform traditional numerical methods in speed and pure AI baselines in accuracy.

DetailsMotivation: To address the long-standing trade-off between computational efficiency and physical fidelity in high-precision scientific simulation, particularly for ocean modeling.

Method: A hybrid framework with: 1) A fully differentiable dynamical solver using physics knowledge as inductive bias, with learnable physical parameters, 2) A deep neural network to correct subgrid-scale processes and discretization errors, 3) Both components integrated by a unified ODE solver.

Result: NeuralOGCM maintains long-term stability and physical consistency while significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy.

Conclusion: The work paves a new path for building fast, stable, and physically-plausible models for scientific computing by fusing differentiable programming with deep learning.

Abstract: High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms key physical parameters (e.g., diffusion coefficients) into learnable parameters, enabling the model to autonomously optimize its physical core via end-to-end training. Concurrently, a deep neural network learns to correct for subgrid-scale processes and discretization errors not captured by the physics model. Both components work in synergy, with their outputs integrated by a unified ODE solver. Experiments demonstrate that NeuralOGCM maintains long-term stability and physical consistency, significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy. Our work paves a new path for building fast, stable, and physically-plausible models for scientific computing.

[288] Contrastive Time Series Forecasting with Anomalies

Joel Ekstrand, Zahra Taghiyarrenani, Slawomir Nowaczyk

Main category: cs.LG

TL;DR: Co-TSFA is a contrastive learning framework that helps time series forecasting models distinguish between forecast-relevant anomalies (that should influence predictions) and forecast-irrelevant anomalies (that should be ignored), improving robustness to anomalous events.

DetailsMotivation: Standard forecasting models fail to distinguish between anomalies that have lasting effects on future values (forecast-relevant) and those that are short-lived noise (forecast-irrelevant), leading to either overreaction to noise or missing important distributional shifts.

Method: Proposes Co-TSFA with input-only and input-output augmentations to model different anomaly types, and introduces a latent-output alignment loss that ties representation changes to forecast changes, encouraging invariance to irrelevant perturbations while preserving sensitivity to meaningful shifts.

Result: Experiments on Traffic and Electricity benchmarks and a real-world cash-demand dataset show Co-TSFA improves forecasting performance under anomalous conditions while maintaining accuracy on normal data.

Conclusion: Co-TSFA provides an effective regularization framework for time series forecasting that learns to distinguish between forecast-relevant and irrelevant anomalies, enhancing model robustness in real-world scenarios with anomalous events.

Abstract: Time series forecasting predicts future values from past data. In real-world settings, some anomalous events have lasting effects and influence the forecast, while others are short-lived and should be ignored. Standard forecasting models fail to make this distinction, often either overreacting to noise or missing persistent shifts. We propose Co-TSFA (Contrastive Time Series Forecasting with Anomalies), a regularization framework that learns when to ignore anomalies and when to respond. Co-TSFA generates input-only and input-output augmentations to model forecast-irrelevant and forecast-relevant anomalies, and introduces a latent-output alignment loss that ties representation changes to forecast changes. This encourages invariance to irrelevant perturbations while preserving sensitivity to meaningful distributional shifts. Experiments on the Traffic and Electricity benchmarks, as well as on a real-world cash-demand dataset, demonstrate that Co-TSFA improves performance under anomalous conditions while maintaining accuracy on normal data. An anonymized GitHub repository with the implementation of Co-TSFA is provided and will be made public upon acceptance.

[289] xGR: Efficient Generative Recommendation Serving at Scale

Qingxiao Sun, Tongxuan Liu, Shen Zhang, Siyu Wu, Peijun Yang, Haotian Liang, Menxin Li, Xiaolong Ma, Zhiwei Liang, Ziyi Ren, Minchao Zhang, Xinyu Liu, Ke Zhang, Depei Qian, Hailong Yang

Main category: cs.LG

TL;DR: xGR is a serving system for generative recommendation that optimizes LLM-based recommendation workloads with specialized techniques for handling long prompts, short outputs, and large beam search spaces to achieve high throughput under strict latency constraints.

DetailsMotivation: Generative recommendation using LLMs has different workload characteristics than traditional LLM serving - it processes long prompts but produces short, fixed-length outputs, with high computational costs in decode phases due to large beam width and time-consuming sorting overhead from vast item spaces.

Method: xGR uses three key techniques: 1) Unifies prefill and decode phases through staged computation and separated KV cache, 2) Enables early sorting termination and mask-based item filtering with data structure reuse, and 3) Reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism.

Result: Experiments with real-world recommendation service datasets show xGR achieves at least 3.49x throughput compared to state-of-the-art baselines under strict latency constraints.

Conclusion: xGR successfully addresses the unique serving challenges of generative recommendation systems by optimizing for their specific workload patterns, enabling efficient LLM-based recommendation serving with high throughput and low latency.

Abstract: Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR’s workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

[290] Parametric Numerical Integration with (Differential) Machine Learning

Álvaro Leitao, Jonatan Ráfales

Main category: cs.LG

TL;DR: Differential machine learning approach for parametric integrals outperforms standard methods across statistical functionals, Chebyshev expansions, and differential equation integrals.

DetailsMotivation: To develop more efficient and accurate machine learning methods for solving parametric integrals, which are fundamental in many scientific and engineering applications but can be computationally challenging.

Method: A differential learning framework that incorporates derivative information during training, applied to three problem classes: statistical functionals (moments, CDFs), Chebyshev expansions, and integrals from differential equations.

Result: The differential machine learning approach consistently outperforms standard architectures with lower mean squared error, enhanced scalability, and improved sample efficiency across all tested cases.

Conclusion: Incorporating derivative information in machine learning training provides significant advantages for solving parametric integrals, making it a superior approach for various applications from smooth benchmarks to challenging numerical integrals.

Abstract: In this work, we introduce a machine/deep learning methodology to solve parametric integrals. Besides classical machine learning approaches, we consider a differential learning framework that incorporates derivative information during training, emphasizing its advantageous properties. Our study covers three representative problem classes: statistical functionals (including moments and cumulative distribution functions), approximation of functions via Chebyshev expansions, and integrals arising directly from differential equations. These examples range from smooth closed-form benchmarks to challenging numerical integrals. Across all cases, the differential machine learning-based approach consistently outperforms standard architectures, achieving lower mean squared error, enhanced scalability, and improved sample efficiency.

[291] A Multi-Criteria Automated MLOps Pipeline for Cost-Effective Cloud-Based Classifier Retraining in Response to Data Distribution Shifts

Emmanuel K. Katalay, David O. Dimandja, Jordan F. Masakuna

Main category: cs.LG

TL;DR: Automated MLOps pipeline for neural network retraining using statistical drift detection to optimize computational resources and maintain model performance during data distribution changes.

DetailsMotivation: Manual MLOps processes for model retraining are inefficient when data distributions change over time (distribution drift), leading to performance deterioration in ML models that requires automated solutions.

Method: Developed an automated MLOps pipeline that uses multi-criteria statistical techniques to detect significant data distribution shifts and triggers neural network classifier retraining only when necessary.

Result: Experiments on benchmark anomaly detection datasets show significant improvements in model accuracy and robustness compared to traditional retraining strategies.

Conclusion: Provides a foundation for deploying reliable and adaptive ML systems in dynamic real-world settings where data distribution changes are common, with automated drift detection and optimized retraining.

Abstract: The performance of machine learning (ML) models often deteriorates when the underlying data distribution changes over time, a phenomenon known as data distribution drift. When this happens, ML models need to be retrained and redeployed. ML Operations (MLOps) is often manual, i.e., humans trigger the process of model retraining and redeployment. In this work, we present an automated MLOps pipeline designed to address neural network classifier retraining in response to significant data distribution changes. Our MLOps pipeline employs multi-criteria statistical techniques to detect distribution shifts and triggers model updates only when necessary, ensuring computational efficiency and resource optimization. We demonstrate the effectiveness of our framework through experiments on several benchmark anomaly detection data sets, showing significant improvements in model accuracy and robustness compared to traditional retraining strategies. Our work provides a foundation for deploying more reliable and adaptive ML systems in dynamic real-world settings, where data distribution changes are common.

[292] Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting

Federico Pennino, Maurizio Gabbrielli

Main category: cs.LG

TL;DR: Data-centric optimization framework discovers optimal training data mixtures that outperform training on entire datasets, achieving 19.41% improvement on PMSM dataset.

DetailsMotivation: Raw sensor data is often imbalanced and redundant, with not all data points contributing equally to model generalization. The standard "more data is better" paradigm may be suboptimal for deep learning models on sensor data.

Method: Framework that optimizes training data composition: 1) Uses large-scale encoder and k-means clustering to partition dataset into behaviorally consistent clusters, 2) Employs Optuna optimization to search high-dimensional space of possible data mixtures, 3) For each trial, constructs training set based on proposed cluster sampling ratios, 4) Trains and evaluates smaller target model.

Result: Data-centric search consistently discovers data mixtures yielding significantly higher performance than baselines trained on entire dataset. On PMSM dataset: improved MSE from 1.70 to 1.37 (19.41% improvement).

Conclusion: “Less is more” approach can be superior for training deep learning models on sensor data. Optimizing training data composition rather than model hyperparameters represents a promising data-centric paradigm shift.

Abstract: The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute equally to model generalization. In this paper, we show that, in some cases, “less is more” when considering datasets. We do this by reframing the data selection problem: rather than tuning model hyperparameters, we fix the model and optimize the composition of the training data itself. We introduce a framework for discovering the optimal “training diet” from a large, unlabeled time series corpus. Our framework first uses a large-scale encoder and k-means clustering to partition the dataset into distinct, behaviorally consistent clusters. These clusters represent the fundamental ‘ingredients’ available for training. We then employ the Optuna optimization framework to search the high-dimensional space of possible data mixtures. For each trial, Optuna proposes a specific sampling ratio for each cluster, and a new training set is constructed based on this recipe. A smaller target model is then trained and evaluated. Our experiments reveal that this data-centric search consistently discovers data mixtures that yield models with significantly higher performance compared to baselines trained on the entire dataset. Specifically - evaluated on PMSM dataset - our method improved performance from a baseline MSE of 1.70 to 1.37, a 19.41% improvement.

[293] Elastic-Net Multiple Kernel Learning: Combining Multiple Data Sources for Prediction

Janaina Mourão-Miranda, Zakria Hussain, Konstantinos Tsirlis, Christophe Phillips, John Shawe-Taylor

Main category: cs.LG

TL;DR: The paper introduces a new elastic-net regularized multiple kernel learning (ENMKL) formulation with analytical kernel weight updates, implemented for SVM and KRR in neuroimaging toolbox PRoNTo, showing improved performance and interpretability over existing methods.

DetailsMotivation: Existing elastic-net MKL methods use complex two-stage procedures for kernel weight optimization. There's a need for simpler, more efficient ENMKL formulations that provide analytical solutions for kernel weights, especially in neuroimaging where interpretability and handling correlated kernels are crucial.

Method: Proposed an alternative ENMKL formulation that yields simple analytical updates for kernel weights. Derived explicit algorithms for both SVM and kernel ridge regression (KRR) under this framework. Implemented methods in open-source Pattern Recognition for Neuroimaging Toolbox (PRoNTo).

Result: ENMKL matches or outperforms l1-norm MKL in all tasks and only underperforms standard SVM in one scenario. ENMKL produces sparser, more interpretable models by selectively weighting correlated kernels.

Conclusion: The new ENMKL formulation provides efficient analytical solutions for kernel weights, offering improved performance and interpretability for neuroimaging applications where handling correlated kernels and model sparsity are important.

Abstract: Multiple Kernel Learning (MKL) models combine several kernels in supervised and unsupervised settings to integrate multiple data representations or sources, each represented by a different kernel. MKL seeks an optimal linear combination of base kernels that maximizes a generalized performance measure under a regularization constraint. Various norms have been used to regularize the kernel weights, including $l1$, $l2$ and $lp$, as well as the “elastic-net” penalty, which combines $l1$- and $l2$-norm to promote both sparsity and the selection of correlated kernels. This property makes elastic-net regularized MKL (ENMKL) especially valuable when model interpretability is critical and kernels capture correlated information, such as in neuroimaging. Previous ENMKL methods have followed a two-stage procedure: fix kernel weights, train a support vector machine (SVM) with the weighted kernel, and then update the weights via gradient descent, cutting-plane methods, or surrogate functions. Here, we introduce an alternative ENMKL formulation that yields a simple analytical update for the kernel weights. We derive explicit algorithms for both SVM and kernel ridge regression (KRR) under this framework, and implement them in the open-source Pattern Recognition for Neuroimaging Toolbox (PRoNTo). We evaluate these ENMKL algorithms against $l1$-norm MKL and against SVM (or KRR) trained on the unweighted sum of kernels across three neuroimaging applications. Our results show that ENMKL matches or outperforms $l1$-norm MKL in all tasks and only underperforms standard SVM in one scenario. Crucially, ENMKL produces sparser, more interpretable models by selectively weighting correlated kernels.

[294] Fully Inductive Node Representation Learning via Graph View Transformation

Dooho Lee, Myeong Kong, Minho Jeong, Jaemin Yoo

Main category: cs.LG

TL;DR: The paper introduces Graph View Transformation (GVT) and Recurrent GVT, a fully inductive graph model that generalizes to unseen datasets without retraining by operating in a novel “view space” representation.

DetailsMotivation: Current graph models struggle with cross-dataset generalization due to varying feature spaces across different graph datasets. Feature space transformations often violate inductive applicability to unseen datasets, limiting model design.

Method: Introduces the “view space” as a unified representation for arbitrary graphs, and proposes Graph View Transformation (GVT) - a node- and feature-permutation-equivariant mapping in this space. Uses Recurrent GVT as a building block for fully inductive node representation learning.

Result: Pretrained on OGBN-Arxiv and evaluated on 27 node-classification benchmarks, Recurrent GVT outperforms GraphAny (prior fully inductive model) by +8.93% and surpasses 12 individually tuned GNNs by at least +3.30%.

Conclusion: The view space is established as a principled and effective foundation for fully inductive node representation learning, enabling cross-dataset generalization without retraining.

Abstract: Generalizing a pretrained model to unseen datasets without retraining is an essential step toward a foundation model. However, achieving such cross-dataset, fully inductive inference is difficult in graph-structured data where feature spaces vary widely in both dimensionality and semantics. Any transformation in the feature space can easily violate the inductive applicability to unseen datasets, strictly limiting the design space of a graph model. In this work, we introduce the view space, a novel representational axis in which arbitrary graphs can be naturally encoded in a unified manner. We then propose Graph View Transformation (GVT), a node- and feature-permutation-equivariant mapping in the view space. GVT serves as the building block for Recurrent GVT, a fully inductive model for node representation learning. Pretrained on OGBN-Arxiv and evaluated on 27 node-classification benchmarks, Recurrent GVT outperforms GraphAny, the prior fully inductive graph model, by +8.93% and surpasses 12 individually tuned GNNs by at least +3.30%. These results establish the view space as a principled and effective ground for fully inductive node representation learning.

[295] Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

Stefan Tabakov, Asen Popov, Dimitar Dimitrov, S. Ensiye Kiyamousavi, Vladimir Hristov, Boris Kraychev

Main category: cs.LG

TL;DR: AAS decomposes long-horizon VLA demonstrations into typed atomic actions, creating a validated dataset that improves policy learning and task success rates.

DetailsMotivation: Current VLA models generalize poorly when tasks require new compositions of skills or objects, needing better decomposition methods for long-horizon demonstrations.

Method: Atomic Action Slicing (AAS) decomposes LIBERO demonstrations into short, typed atomic actions with labels for action type, temporal span, and confidence. Uses Gemini 2.5 Pro for segmentation and fine-tunes CLIP-RT+ on the atomic dataset.

Result: Created validated dataset of 2,124 atomic segments. Stronger segmenter (Gemini 2.5 Pro) matches planner-defined plans and is robust to keyframe jitter. Fine-tuning improved task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long.

Conclusion: AAS provides effective decomposition of long-horizon demonstrations into atomic actions, improving VLA model generalization and task success. The GATE-VLAP dataset is publicly released.

Abstract: Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)

[296] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter

Main category: cs.LG

TL;DR: Brain-Semantoks is a self-supervised framework for fMRI time series that learns abstract brain dynamics representations using semantic tokenization and self-distillation, enabling strong downstream task performance with linear probes and showing reliable scaling benefits.

DetailsMotivation: Current fMRI foundation models focus on low-level information through mask-and-reconstruct objectives on small brain regions, making representations sensitive to noise and temporal fluctuations, requiring extensive fine-tuning for downstream tasks.

Method: Two core innovations: 1) semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and 2) self-distillation objective that enforces representational stability across time, stabilized through a novel training curriculum.

Result: Learned representations enable strong performance on various downstream tasks using only linear probes, and comprehensive scaling analyses show more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

Conclusion: Brain-Semantoks provides a robust self-supervised framework for learning abstract representations of brain dynamics from fMRI time series, addressing noise sensitivity issues and enabling effective transfer learning with minimal fine-tuning.

Abstract: The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

[297] Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration

Alexander Tyurin

Main category: cs.LG

TL;DR: GD for neural networks reduces to generalized perceptron algorithms, revealing implicit acceleration: nonlinear models achieve $\tilde{O}(\sqrt{d})$ iteration complexity vs linear models’ $Ω(d)$.

DetailsMotivation: Understanding optimization dynamics of gradient descent in neural networks, including convergence rates, trajectories, oscillations, and implicit acceleration, remains challenging despite GD's widespread use.

Method: Analyze nonlinear models with logistic loss, showing GD steps reduce to generalized perceptron algorithms. Use classical linear algebra tools to analyze simplified algorithmic steps on a minimalistic two-layer model example.

Result: Nonlinearity in two-layer models provably yields faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to linear models’ $Ω(d)$, explaining implicit acceleration phenomenon. Theoretical results supported by extensive numerical experiments.

Conclusion: The reduction of GD to generalized perceptron algorithms provides a new perspective on neural network optimization dynamics and explains implicit acceleration, offering an alternative view to advance research in this area.

Abstract: Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to $Ω(d)$ achieved by linear models, where $d$ is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.

[298] A Fast Interpretable Fuzzy Tree Learner

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

Main category: cs.LG

TL;DR: Proposes fuzzy greedy trees for interpretable rule mining, combining computational efficiency of greedy algorithms with interpretability of fuzzy logic, achieving competitive accuracy with lower cost than evolutionary methods.

DetailsMotivation: Existing fuzzy rule-mining algorithms don't guarantee both sensible linguistic partitions and small rule-base sizes needed for interpretability. Evolutionary approaches are computationally expensive, while neural methods like ANFIS lose linguistic interpretability.

Method: Adapts classical tree-based splitting algorithms from crisp rules to fuzzy trees, combining computational efficiency of greedy algorithms with interpretability advantages of fuzzy logic.

Result: Achieves interpretable linguistic partitions, substantially improves running time compared to evolutionary approaches, maintains competitive predictive performance, produces more interpretable rule bases with constrained complexity.

Conclusion: Fuzzy greedy trees offer an effective balance between interpretability and computational efficiency for fuzzy rule-based systems, achieving comparable accuracy to state-of-the-art fuzzy classifiers with significantly lower computational cost.

Abstract: Fuzzy rule-based systems have been mostly used in interpretable decision-making because of their interpretable linguistic rules. However, interpretability requires both sensible linguistic partitions and small rule-base sizes, which are not guaranteed by many existing fuzzy rule-mining algorithms. Evolutionary approaches can produce high-quality models but suffer from prohibitive computational costs, while neural-based methods like ANFIS have problems retaining linguistic interpretations. In this work, we propose an adaptation of classical tree-based splitting algorithms from crisp rules to fuzzy trees, combining the computational efficiency of greedy algoritms with the interpretability advantages of fuzzy logic. This approach achieves interpretable linguistic partitions and substantially improves running time compared to evolutionary-based approaches while maintaining competitive predictive performance. Our experiments on tabular classification benchmarks proof that our method achieves comparable accuracy to state-of-the-art fuzzy classifiers with significantly lower computational cost and produces more interpretable rule bases with constrained complexity. Code is available in: https://github.com/Fuminides/fuzzy_greedy_tree_public

[299] Bridging Streaming Continual Learning via In-Context Large Tabular Models

Afonso Lourenço, João Gama, Eric P. Xing, Goreti Marreiros

Main category: cs.LG

TL;DR: The paper proposes using large in-context tabular models (LTMs) as a bridge between Continual Learning (CL) and Stream Learning (SL) for Streaming Continual Learning (SCL), with data selection principles of distribution matching and compression.

DetailsMotivation: Existing research communities address continual learning and stream learning challenges in isolation - CL focuses on long-term retention without real-time constraints, while SL emphasizes rapid adaptation but neglects forgetting. There's a need to bridge these paradigms for effective streaming continual learning.

Method: Proposes using large in-context tabular models (LTMs) that summarize unbounded streams into compact sketches on-the-fly. Structures SCL around two core data selection principles: (1) distribution matching to balance plasticity and stability, and (2) distribution compression to control memory size through diversification and retrieval mechanisms.

Result: The paper presents a conceptual framework showing how LTMs provide a natural bridge for SCL, recovering classical SL motivation of compressing massive streams while aligning with CL’s experience-replay requirements. It clarifies how both communities implicitly use divide-to-conquer strategies for managing plasticity-stability trade-offs.

Conclusion: Large in-context tabular models offer a promising approach for Streaming Continual Learning by bridging the gap between Continual Learning and Stream Learning through principled data selection mechanisms that balance adaptation, retention, and memory efficiency.

Abstract: In streaming scenarios, models must learn continuously, adapting to concept drifts without erasing previously acquired knowledge. However, existing research communities address these challenges in isolation. Continual Learning (CL) focuses on long-term retention and mitigating catastrophic forgetting, often without strict real-time constraints. Stream Learning (SL) emphasizes rapid, efficient adaptation to high-frequency data streams, but typically neglects forgetting. Recent efforts have tried to combine these paradigms, yet no clear algorithmic overlap exists. We argue that large in-context tabular models (LTMs) provide a natural bridge for Streaming Continual Learning (SCL). In our view, unbounded streams should be summarized on-the-fly into compact sketches that can be consumed by LTMs. This recovers the classical SL motivation of compressing massive streams with fixed-size guarantees, while simultaneously aligning with the experience-replay desiderata of CL. To clarify this bridge, we show how the SL and CL communities implicitly adopt a divide-to-conquer strategy to manage the tension between plasticity (performing well on the current distribution) and stability (retaining past knowledge), while also imposing a minimal complexity constraint that motivates diversification (avoiding redundancy in what is stored) and retrieval (re-prioritizing past information when needed). Within this perspective, we propose structuring SCL with LTMs around two core principles of data selection for in-context learning: (1) distribution matching, which balances plasticity and stability, and (2) distribution compression, which controls memory size through diversification and retrieval mechanisms.

[300] High-Dimensional Surrogate Modeling for Closed-Loop Learning of Neural-Network-Parameterized Model Predictive Control

Sebastian Hirt, Valentinus Suwanto, Hendrik Alsmeier, Maik Pfefferkorn, Rolf Findeisen

Main category: cs.LG

TL;DR: Bayesian neural networks outperform Gaussian processes as surrogate models in Bayesian optimization for learning high-dimensional controller parameters, enabling successful optimization of hundreds to thousands of parameters.

DetailsMotivation: Bayesian optimization struggles with dense high-dimensional controller parameterizations (like in model predictive controllers) because standard Gaussian process surrogate models fail to capture the structure of such high-dimensional spaces.

Method: Proposes using Bayesian neural networks as surrogate models instead of standard Gaussian processes. Compares three approaches: Gaussian processes with Matern kernels, finite-width Bayesian neural networks, and infinite-width Bayesian neural networks on a cart-pole control task.

Result: Bayesian neural networks achieve faster and more reliable convergence of closed-loop cost and enable successful optimization of parameterizations with hundreds of dimensions. Infinite-width Bayesian neural networks maintain performance with over 1000 parameters, while Matern-kernel Gaussian processes rapidly lose effectiveness.

Conclusion: Bayesian neural network surrogate models are suitable for learning dense high-dimensional controller parameterizations and offer practical guidance for selecting surrogate models in learning-based controller design.

Abstract: Learning controller parameters from closed-loop data has been shown to improve closed-loop performance. Bayesian optimization, a widely used black-box and sample-efficient learning method, constructs a probabilistic surrogate of the closed-loop performance from few experiments and uses it to select informative controller parameters. However, it typically struggles with dense high-dimensional controller parameterizations, as they may appear, for example, in tuning model predictive controllers, because standard surrogate models fail to capture the structure of such spaces. This work suggests that the use of Bayesian neural networks as surrogate models may help to mitigate this limitation. Through a comparison between Gaussian processes with Matern kernels, finite-width Bayesian neural networks, and infinite-width Bayesian neural networks on a cart-pole task, we find that Bayesian neural network surrogate models achieve faster and more reliable convergence of the closed-loop cost and enable successful optimization of parameterizations with hundreds of dimensions. Infinite-width Bayesian neural networks also maintain performance in settings with more than one thousand parameters, whereas Matern-kernel Gaussian processes rapidly lose effectiveness. These results indicate that Bayesian neural network surrogate models may be suitable for learning dense high-dimensional controller parameterizations and offer practical guidance for selecting surrogate models in learning-based controller design.

[301] SpectralKrum: A Spectral-Geometric Defense Against Byzantine Attacks in Federated Learning

Aditya Tripathi, Karan Sharma, Rahul Mishra, Tapas Kumar Maiti

Main category: cs.LG

TL;DR: SpectralKrum: A federated learning defense combining spectral subspace estimation with Krum selection to filter Byzantine attacks in non-IID data settings.

DetailsMotivation: Existing robust aggregation methods (Krum, Bulyan, etc.) lose effectiveness when client data is heterogeneous (non-IID) and adversaries can observe or approximate the defense mechanism. There's a need for defenses that work under realistic non-IID conditions.

Method: SpectralKrum fuses spectral subspace estimation with geometric neighbor-based selection. It learns a low-dimensional manifold from historical aggregates, projects incoming updates into this subspace, applies Krum selection in compressed coordinates, and filters candidates whose orthogonal residual energy exceeds a data-driven threshold.

Result: Evaluated against 8 baselines across 7 attack scenarios on CIFAR-10 with non-IID partitions. SpectralKrum is competitive against directional and subspace-aware attacks (adaptive-steer, buffer-drift) but offers limited advantage under label-flip and min-max attacks where malicious updates remain spectrally indistinguishable from benign ones.

Conclusion: SpectralKrum provides an effective defense against certain Byzantine attacks in non-IID FL settings by leveraging spectral subspace estimation, but has limitations when malicious updates are spectrally similar to benign ones.

Abstract: Federated Learning (FL) distributes model training across clients who retain their data locally, but this architecture exposes a fundamental vulnerability: Byzantine clients can inject arbitrarily corrupted updates that degrade or subvert the global model. While robust aggregation methods (including Krum, Bulyan, and coordinate-wise defenses) offer theoretical guarantees under idealized assumptions, their effectiveness erodes substantially when client data distributions are heterogeneous (non-IID) and adversaries can observe or approximate the defense mechanism. This paper introduces SpectralKrum, a defense that fuses spectral subspace estimation with geometric neighbor-based selection. The core insight is that benign optimization trajectories, despite per-client heterogeneity, concentrate near a low-dimensional manifold that can be estimated from historical aggregates. SpectralKrum projects incoming updates into this learned subspace, applies Krum selection in compressed coordinates, and filters candidates whose orthogonal residual energy exceeds a data-driven threshold. The method requires no auxiliary data, operates entirely on model updates, and preserves FL privacy properties. We evaluate SpectralKrum against eight robust baselines across seven attack scenarios on CIFAR-10 with Dirichlet-distributed non-IID partitions (alpha = 0.1). Experiments spanning over 56,000 training rounds show that SpectralKrum is competitive against directional and subspace-aware attacks (adaptive-steer, buffer-drift), but offers limited advantage under label-flip and min-max attacks where malicious updates remain spectrally indistinguishable from benign ones.

[302] The Adaptive Vekua Cascade: A Differentiable Spectral-Analytic Solver for Physics-Informed Representation

Vladimer Khasia

Main category: cs.LG

TL;DR: AVC is a hybrid neural architecture that combines deep learning with classical approximation theory to overcome spectral bias and curse of dimensionality in coordinate-based networks by learning domain warping and using differentiable linear solvers.

DetailsMotivation: Coordinate-based neural networks suffer from two fundamental limitations: spectral bias (difficulty learning high-frequency dynamics) and curse of dimensionality (parameter explosion in discrete feature grids), which hinder their effectiveness for representing complex physical fields.

Method: AVC decouples manifold learning from function approximation using a deep network to learn a diffeomorphic warping of the physical domain, projecting dynamics onto a latent manifold where solutions are represented by generalized analytic functions. It replaces gradient-descent output layers with a differentiable linear solver that optimally resolves spectral coefficients in closed form during forward pass.

Result: AVC achieves state-of-the-art accuracy on five physics benchmarks (Helmholtz wave propagation, medical reconstruction, 3D Navier-Stokes turbulence) while reducing parameters by orders of magnitude (840 vs 4.2 million for 3D grids) and converging 2-3x faster than implicit neural representations.

Conclusion: The work establishes a new paradigm for memory-efficient, spectrally accurate scientific machine learning by bridging deep learning with classical approximation theory through adaptive domain warping and differentiable linear solvers.

Abstract: Coordinate-based neural networks have emerged as a powerful tool for representing continuous physical fields, yet they face two fundamental pathologies: spectral bias, which hinders the learning of high-frequency dynamics, and the curse of dimensionality, which causes parameter explosion in discrete feature grids. We propose the Adaptive Vekua Cascade (AVC), a hybrid architecture that bridges deep learning and classical approximation theory. AVC decouples manifold learning from function approximation by using a deep network to learn a diffeomorphic warping of the physical domain, projecting complex spatiotemporal dynamics onto a latent manifold where the solution is represented by a basis of generalized analytic functions. Crucially, we replace the standard gradient-descent output layer with a differentiable linear solver, allowing the network to optimally resolve spectral coefficients in a closed form during the forward pass. We evaluate AVC on a suite of five rigorous physics benchmarks, including high-frequency Helmholtz wave propagation, sparse medical reconstruction, and unsteady 3D Navier-Stokes turbulence. Our results demonstrate that AVC achieves state-of-the-art accuracy while reducing parameter counts by orders of magnitude (e.g., 840 parameters vs. 4.2 million for 3D grids) and converging 2-3x faster than implicit neural representations. This work establishes a new paradigm for memory-efficient, spectrally accurate scientific machine learning. The code is available at https://github.com/VladimerKhasia/vecua.

[303] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Etienne Boursier, Claire Boyer

Main category: cs.LG

TL;DR: Softmax attention converges to linear attention in infinite-prompt limit, enabling transfer of linear attention analysis tools to softmax attention for large prompts.

DetailsMotivation: Softmax attention's nonlinear structure makes theoretical analysis challenging. Need a framework to understand its behavior, especially in large-prompt regimes common in practice.

Method: Develop measure-based framework for softmax attention. Show softmax converges to linear operator in infinite-prompt limit. Establish non-asymptotic concentration bounds for outputs/gradients. Prove stability along training trajectory for sub-Gaussian tokens.

Result: Softmax attention approaches linear attention behavior as prompt length increases. Optimization analyses for linear attention transfer to softmax attention with sufficiently long prompts. Provides toolkit for studying training dynamics in large-prompt regimes.

Conclusion: Large-prompt softmax attention inherits analytical structure of linear attention, enabling principled theoretical analysis previously limited to linear attention models.

Abstract: Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

[304] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

Ahmad Shamail, Claire McWhite

Main category: cs.LG

TL;DR: A geometric method using random sequential addition of elements reveals interaction patterns through L-shaped distributions, with an L-score quantifying synergy (-1), independence (0), and redundancy (+1).

DetailsMotivation: Many systems have complex component interactions (amplification, redundancy, independence) that are difficult to quantify and visualize systematically.

Method: Add elements in random sequential orders, plot contributions over trials to reveal L-shaped patterns. Formalize with L-score ranging from -1 (synergy) to +1 (redundancy), using pairwise measurements to infer higher-order interactions.

Result: Characteristic L-shaped patterns emerge: redundant pairs show only first-added element contributes, synergistic pairs show only joint contribution, independent elements show order-invariant distributions.

Conclusion: Provides a unified geometric, metric-agnostic approach to quantify interaction structure across domains where performance can be evaluated incrementally over element sequences.

Abstract: Many systems exhibit complex interactions between their components: some features or actions amplify each other’s effects, others provide redundant information, and some contribute independently. We present a simple geometric method for discovering interactions and redundancies: when elements are added in random sequential orders and their contributions plotted over many trials, characteristic L-shaped patterns emerge that directly reflect interaction structure. The approach quantifies how the contribution of each element depends on those added before it, revealing patterns that distinguish interaction, independence, and redundancy on a unified scale. When pairwise contributions are visualized as two–dimensional point clouds, redundant pairs form L–shaped patterns where only the first-added element contributes, while synergistic pairs form L–shaped patterns where only elements contribute together. Independent elements show order–invariant distributions. We formalize this with the L–score, a continuous measure ranging from $-1$ (perfect synergy, e.g. $Y=X_1X_2$) to $0$ (independence) to $+1$ (perfect redundancy, $X_1 \approx X_2$). The relative scaling of the L–shaped arms reveals feature dominance in which element consistently provides more information. Although computed only from pairwise measurements, higher–order interactions among three or more elements emerge naturally through consistent cross–pair relationships (e.g. AB, AC, BC). The method is metric–agnostic and broadly applicable to any domain where performance can be evaluated incrementally over non-repeating element sequences, providing a unified geometric approach to uncovering interaction structure.

[305] Large Continual Instruction Assistant

Jingyang Qiao, Zhizhong Zhang, Xin Tan, Yanyun Qu, Shouhong Ding, Yuan Xie

Main category: cs.LG

TL;DR: CoIN proposes a stable-plasticity balanced continual instruction tuning framework that automatically determines optimal balance weights via gradient analysis and parameter allocation based on semantic similarity.

DetailsMotivation: Existing continual instruction tuning methods suffer from catastrophic forgetting when using gradient updates, while EMA approaches with fixed weights cannot adapt to changing datasets, leading to imbalance between plasticity (learning new tasks) and stability (preserving old knowledge).

Method: Proposes a framework with: 1) Stable-plasticity balanced coefficient automatically determined via Taylor expansion analysis of gradients and learned parameters; 2) Semantic similarity-based parameter allocation that decides whether to retrain or expand parameters; 3) Optimal parameter selection for testing instances based on instruction similarity.

Result: Extensive experiments across multiple continual instruction tuning benchmarks show the approach enhances anti-forgetting capabilities while significantly improving overall continual tuning performance.

Conclusion: The proposed CoIN framework effectively addresses the plasticity-stability trade-off in continual instruction tuning through adaptive balancing and semantic-aware parameter management, offering a general solution for continual learning with large language models.

Abstract: Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https://github.com/JingyangQiao/CoIN.

[306] Large Language Model Agent for Modular Task Execution in Drug Discovery

Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.LG

TL;DR: A modular LLM-powered framework automates early-stage drug discovery tasks including data retrieval, molecular generation, property prediction, refinement, and 3D structure generation.

DetailsMotivation: To streamline and automate the computationally intensive early-stage drug discovery pipeline by leveraging LLMs combined with domain-specific tools, addressing challenges in data retrieval, molecular design, and property optimization.

Method: A modular framework combining LLM reasoning with specialized tools for biomedical data retrieval, RAG-based question answering, molecular generation, multi-property prediction (75 properties including ADMET), iterative molecular refinement, and 3D protein-ligand structure generation using Boltz-2.

Result: The agent successfully retrieved biomolecular data and improved contextual accuracy in Q&A. Molecular refinement increased QED > 0.6 molecules from 34 to 55 (out of 100), and Ghose filter compliance rose from 32 to 55. The framework generated 3D complexes and provided binding affinity estimates.

Conclusion: The LLM-powered framework effectively supports molecular screening, prioritization, and structure evaluation in drug discovery. Its modular design enables flexible integration of evolving tools, providing a scalable foundation for AI-assisted therapeutic discovery.

Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, literature-grounded question answering via retrieval-augmented generation, molecular generation, multi-property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. The agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 75 properties, including ADMET-related and general physicochemical descriptors, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

[307] The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Alexander Xiong, Xuandong Zhao, Aneesh Pappu, Dawn Song

Main category: cs.LG

TL;DR: This paper provides a comprehensive survey of LLM memorization, covering its causes, detection methods, implications, and mitigation strategies.

DetailsMotivation: LLMs demonstrate remarkable capabilities but also exhibit memorization of training data, raising critical questions about model behavior, privacy risks, and the boundary between learning and memorization.

Method: The paper synthesizes recent studies and investigates memorization landscape through analysis of key drivers (training data duplication, training dynamics, fine-tuning), detection methodologies (prefix-based extraction, membership inference, adversarial prompting), and broader implications.

Result: The paper provides a comprehensive overview of current research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.

Conclusion: The paper discusses mitigation strategies including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the need to minimize harmful memorization with model utility.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the need to minimize harmful memorization with model utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.

[308] HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu

Main category: cs.LG

TL;DR: HyperAdaLoRA accelerates AdaLoRA convergence using hypernetwork-generated SVD parameters with dynamic rank allocation via pruning.

DetailsMotivation: Existing LoRA methods use uniform rank allocation and AdaLoRA has slow convergence/high computational overhead despite dynamic rank allocation via SVD.

Method: HyperAdaLoRA uses attention-based hypernetwork to dynamically generate SVD parameters (P, Λ, Q) instead of direct optimization, with pruning of singular value outputs for dynamic rank allocation.

Result: Faster convergence without performance loss across various datasets/models; broad applicability validated on other LoRA-based approaches.

Conclusion: HyperAdaLoRA effectively addresses AdaLoRA’s convergence issues while maintaining performance, demonstrating general applicability to LoRA-based PEFT methods.

Abstract: Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, Λ, Q)$, HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.

[309] LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin

Main category: cs.LG

TL;DR: LaDiR is a novel reasoning framework that uses latent diffusion models to generate diverse reasoning trajectories, enabling holistic refinement of reasoning steps and improving accuracy over traditional autoregressive methods.

DetailsMotivation: LLMs using chain-of-thought generation have limitations: autoregressive decoding prevents holistic refinement of earlier tokens and leads to inefficient exploration of diverse solutions. There's a need for a framework that allows iterative refinement and parallel generation of multiple reasoning paths.

Method: 1. Construct latent reasoning space using VAE to encode text reasoning steps into blocks of thought tokens. 2. Use latent diffusion model with blockwise bidirectional attention mask to denoise latent thought tokens, enabling longer horizon reasoning and iterative refinement. 3. Framework supports parallel generation of diverse reasoning trajectories with adaptive test-time compute.

Result: Empirical evaluations on mathematical reasoning and planning benchmarks show LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods.

Conclusion: LaDiR represents a new paradigm for text reasoning with latent diffusion, demonstrating superior performance by enabling holistic refinement and diverse solution exploration through continuous latent representations and iterative diffusion processes.

Abstract: Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM’s autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.

[310] MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

Victor Rambaud, Salvador Mascarenhas, Yair Lakretz

Main category: cs.LG

TL;DR: MapFormers are Transformer-based architectures that learn cognitive maps from observational data through self-supervised learning, enabling superior out-of-distribution generalization by disentangling structure from content using input-dependent positional encoding.

DetailsMotivation: Current AI systems lack the strong out-of-distribution generalization capabilities that humans and animals possess through cognitive maps, which encode abstract relationships among entities and provide flexibility to adapt to new situations.

Method: Developed MapFormers based on Transformers that learn cognitive maps by disentangling structural relationships from content through input-dependent positional encoding updates. Created two variants: one for episodic memory (absolute positional encoding) and one for working memory (relative positional encoding).

Result: MapFormers achieved near-perfect performance on tasks including 2D navigation, learning cognitive maps of underlying spaces and generalizing to out-of-distribution scenarios (e.g., longer sequences) where current architectures fail.

Conclusion: Models designed to learn cognitive maps with structural bias for structure-content disentanglement (achievable through input-dependent positional encoding in Transformers) demonstrate superiority. MapFormers have broad applications in neuroscience and AI for explaining neural mechanisms and scaling relational model learning.

Abstract: A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.

[311] BAMBO: Construct Ability and Efficiency LLM Pareto Set via Bayesian Adaptive Multi-objective Block-wise Optimization

Kesheng Chen, Wenjian Luo, Zhenqian Zhu, Yamin Hu, Yiya Xi

Main category: cs.LG

TL;DR: BAMBO is a Bayesian optimization framework that automatically constructs Pareto sets for LLMs by adaptively partitioning models into optimal blocks, balancing granularity and computational tractability.

DetailsMotivation: Existing LLM merging techniques are inadequate for constructing Pareto sets - model-level methods yield sparse suboptimal solutions, while layer-wise approaches suffer from computational intractability due to high dimensionality.

Method: BAMBO uses Hybrid Optimal Block Partitioning (formulated as 1D clustering with dynamic programming) to balance intra-block homogeneity and inter-block information distribution, reducing dimensionality while preserving granularity. It operates within an evolutionary loop driven by qEHVI acquisition function.

Result: BAMBO discovers superior and more comprehensive Pareto frontiers than baselines, enabling agile model selection tailored to diverse operational constraints.

Conclusion: BAMBO provides an automated framework for tractable Pareto set construction in LLMs, effectively navigating the capability-efficiency trade-off by overcoming limitations of existing merging techniques.

Abstract: Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the “curse of dimensionality,” rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.

[312] Personalized Federated Learning with Exact Stochastic Gradient Descent

Sotirios Nikoloutsopoulos, Iordanis Koutsopoulos, Michalis K. Titsias

Main category: cs.LG

TL;DR: PFLEGO: A low-energy SGD-type algorithm for personalized federated learning with separate common and client-specific weights, achieving O(1/√T) convergence with reduced per-client computation.

DetailsMotivation: Mobile energy-limited regimes need efficient federated learning with low per-client computational cost. Existing methods like FedAvg and FedPer have computational overhead that makes them unsuitable for energy-constrained mobile devices.

Method: Propose PFLEGO algorithm with two weight sets: common (shared) and personalized (client-specific). Clients perform multiple full gradient updates only on personalized weights locally, then compute joint gradient for both weights at final step, returning common weight gradients to server for distributed SGD update.

Result: Prove O(1/√T) convergence rate even for non-convex settings (neural networks). Experiments show substantially lower per-round wall-clock time (energy proxy) and superior performance vs FedAvg/FedPer on Omniglot, CIFAR-10, MNIST, Fashion-MNIST, EMNIST datasets.

Conclusion: PFLEGO provides energy-efficient personalized federated learning with theoretical convergence guarantees and practical performance improvements, making it suitable for mobile energy-limited environments.

Abstract: We propose a Stochastic Gradient Descent (SGD)-type algorithm for Personalized Federated Learning which can be particularly attractive for mobile energy-limited regimes due to its low per-client computational cost. The model to be trained includes a set of common weights for all clients, and a set of personalized weights that are specific to each client. At each optimization round, randomly selected clients perform multiple full gradient-descent updates over their client-specific weights towards optimizing the loss function on their own datasets, without updating the common weights. This procedure is energy-efficient since it has low computational cost per client. At the final update of each round, each client computes the joint gradient over both the client-specific and the common weights and returns the gradient of common weights to the server, which allows to perform an exact SGD step over the full set of weights in a distributed manner. For the overall optimization scheme, we rigorously prove convergence, even in non-convex settings such as those encountered when training neural networks, with a rate of $\mathcal{O} \left (\frac{1}{\sqrt{T}} \right )$ with respect to communication rounds $T$. In practice, PFLEGO exhibits substantially lower per-round wall-clock time, used as a proxy for energy. Our theoretical guarantees translate to superior performance in practice against baselines such as FedAvg and FedPer, as evaluated in several multi-class classification datasets, in particular, Omniglot, CIFAR-10, MNIST, Fashion-MNIST, and EMNIST.

[313] Data as Voters: Core Set Selection Using Approval-Based Multi-Winner Voting

Luis Sánchez-Fernández, Jesús A. Fisteus, Rafael López-Zaragoza

Main category: cs.LG

TL;DR: Novel core set selection method using approval-based multi-winner election principles where instances act as both voters and candidates, selecting representatives via voting rules to reduce training sets.

DetailsMotivation: To address the core set/instance selection problem in machine learning by leveraging concepts from multi-winner election systems to identify representative instances that can effectively reduce training set size while maintaining or improving classifier performance.

Method: Instances serve dual roles as voters and candidates. Each training instance (voter) defines approval sets based on local set concepts from existing literature. Representative voting rules then select winners, which become the reduced training set.

Result: The approach improves performance over state-of-the-art methods in several cases when tested with neural network classifiers, KNN, and SVM, with statistically significant differences.

Conclusion: Multi-winner election principles provide an effective framework for instance selection, offering a novel approach that can outperform existing methods for training set reduction in machine learning.

Abstract: We present a novel approach to the core set/instance selection problem in machine learning. Our approach is based on recent results on (proportional) representation in approval-based multi-winner elections. In our model, instances play a double role as voters and candidates. The approval set of each instance in the training set (acting as a voter) is defined from the concept of local set, which already exists in the literature. We then select the election winners by using a representative voting rule, and such winners are the data instances kept in the reduced training set. We evaluate our approach in two experiments involving neural network classifiers and classic machine learning classifiers (KNN and SVM). Our experiments show that, in several cases, our approach improves the performance of state-of-the-art methods, and the differences are statistically significant.

[314] M2NO: An Efficient Multi-Resolution Operator Framework for Dynamic Multi-Scale PDE Solvers

Zhihao Li, Zhilu Lai, Xiaobo Zhang, Wei Wang

Main category: cs.LG

TL;DR: M2NO is a deep learning framework that combines multigrid structure with multiwavelet spaces to efficiently solve high-dimensional PDEs by handling multi-scale features across resolutions.

DetailsMotivation: Solving high-dimensional PDEs efficiently requires handling multi-scale features across varying resolutions, which is challenging for existing methods.

Method: Integrates multigrid structure with predefined multiwavelet spaces, using multi-resolution analysis to selectively transfer low-frequency errors to coarser grids while preserving high-frequency details at finer levels.

Result: Outperforms existing models on diverse PDE benchmarks including high-resolution, super-resolution tasks, and preconditioning settings; also serves as effective preconditioner for iterative solvers.

Conclusion: M2NO is a robust and versatile solution for complex PDE simulations that efficiently captures both fine-scale variations and large-scale structures without additional complexity.

Abstract: Solving high-dimensional partial differential equations (PDEs) efficiently requires handling multi-scale features across varying resolutions. To address this challenge, we present the Multiwavelet-based Multigrid Neural Operator (M2NO), a deep learning framework that integrates a multigrid structure with predefined multiwavelet spaces. M2NO leverages multi-resolution analysis to selectively transfer low-frequency error components to coarser grids while preserving high-frequency details at finer levels. This design enhances both accuracy and computational efficiency without introducing additional complexity. Moreover, M2NO serves as an effective preconditioner for iterative solvers, further accelerating convergence in large-scale PDE simulations. Through extensive evaluations on diverse PDE benchmarks, including high-resolution, super-resolution tasks, and preconditioning settings, M2NO consistently outperforms existing models. Its ability to efficiently capture fine-scale variations and large-scale structures makes it a robust and versatile solution for complex PDE simulations. Our code and datasets are available on https://github.com/lizhihao2022/M2NO.

[315] TAEGAN: Generating Synthetic Tabular Data For Data Augmentation

Jiayu Li, Zilong Zhao, Kevin Yee, Uzair Javaid, Biplab Sikdar

Main category: cs.LG

TL;DR: TAEGAN is a novel GAN-based framework for synthetic tabular data generation that uses a masked auto-encoder as generator with self-supervised warmup training, outperforming 7 SOTA methods on 5/8 datasets with 27% utility boost and <5% model size.

DetailsMotivation: While diffusion and transformer models have advanced synthetic tabular data generation, GANs remain competitive due to training efficiency and strong generation capabilities. The paper aims to improve GAN-based tabular data generation by addressing stability issues and enhancing data distribution learning.

Method: TAEGAN introduces a masked auto-encoder as the generator with self-supervised warmup training (first in tabular GANs), a novel sampling method for imbalanced/skewed data, and an improved loss function to better capture data distribution and correlations.

Result: TAEGAN outperforms 7 state-of-the-art synthetic tabular data generation algorithms on 5 out of 8 datasets, achieving a 27% overall utility boost over the best-performing baseline while maintaining a model size less than 5% of the best baseline model.

Conclusion: TAEGAN demonstrates superior performance in synthetic tabular data generation by combining GAN efficiency with self-supervised warmup training and improved sampling/loss techniques, offering a lightweight yet powerful solution for data augmentation and privacy-preserving data sharing.

Abstract: Synthetic tabular data generation has gained significant attention for its potential in data augmentation and privacy-preserving data sharing. While recent methods like diffusion and auto-regressive models (i.e., transformer) have advanced the field, generative adversarial networks (GANs) remain highly competitive due to their training efficiency and strong data generation capabilities. In this paper, we introduce Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), a novel GAN-based framework that leverages a masked auto-encoder as the generator. TAEGAN is the first to incorporate self-supervised warmup training of generator into tabular GANs. It enhances GAN stability and exposes the generator to richer information beyond the discriminator’s feedback. Additionally, we propose a novel sampling method tailored for imbalanced or skewed data and an improved loss function to better capture data distribution and correlations. We evaluate TAEGAN against seven state-of-the-art synthetic tabular data generation algorithms. Results from eight datasets show that TAEGAN outperforms all baselines on five datasets, achieving a 27% overall utility boost over the best-performing baseline while maintaining a model size less than 5% of the best-performing baseline model. Code is available at: https://github.com/BetterdataLabs/taegan.

[316] WARPD: World model Assisted Reactive Policy Diffusion

Shashank Hegde, Satyajeet Das, Gautam Salhotra, Gaurav S. Sukhatme

Main category: cs.LG

TL;DR: WARPD is a new method that generates closed-loop neural policy weights directly instead of open-loop trajectories, offering extended action horizons with robustness to perturbations while dramatically reducing inference costs compared to Diffusion Policy.

DetailsMotivation: Diffusion models for robotic policies face challenges: large model sizes and slow inference limit high-frequency control, and Diffusion Policy suffers from a trade-off between performance and action horizon where fewer diffusion queries lead to larger trajectory chunks that accumulate tracking errors.

Method: WARPD (World model Assisted Reactive Policy Diffusion) generates closed-loop policies (neural policy weights) directly instead of open-loop trajectories. It learns behavioral distributions in parameter space rather than trajectory space, enabling extended action horizons with robustness to perturbations.

Result: WARPD outperforms Diffusion Policy in long-horizon and perturbed environments, and achieves multitask performance on par with DP while requiring only ~1/45th of the inference-time FLOPs per step.

Conclusion: Learning policies in parameter space rather than trajectory space offers significant advantages for robotic control: extended action horizons with robustness to perturbations while maintaining high task performance, plus dramatically reduced inference costs compared to trajectory-based diffusion approaches.

Abstract: With the increasing availability of open-source robotic data, imitation learning has become a promising approach for both manipulation and locomotion. Diffusion models are now widely used to train large, generalized policies that predict controls or trajectories, leveraging their ability to model multimodal action distributions. However, this generality comes at the cost of larger model sizes and slower inference, an acute limitation for robotic tasks requiring high control frequencies. Moreover, Diffusion Policy (DP), a popular trajectory-generation approach, suffers from a trade-off between performance and action horizon: fewer diffusion queries lead to larger trajectory chunks, which in turn accumulate tracking errors. To overcome these challenges, we introduce WARPD (World model Assisted Reactive Policy Diffusion), a method that generates closed-loop policies (weights for neural policies) directly, instead of open-loop trajectories. By learning behavioral distributions in parameter space rather than trajectory space, WARPD offers two major advantages: (1) extended action horizons with robustness to perturbations, while maintaining high task performance, and (2) significantly reduced inference costs. Empirically, WARPD outperforms DP in long-horizon and perturbed environments, and achieves multitask performance on par with DP while requiring only ~ 1/45th of the inference-time FLOPs per step.

[317] Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining

Haochen Zhang, Junze Yin, Guanchu Wang, Zirui Liu, Lin F. Yang, Tianyi Zhang, Anshumali Shrivastava, Vladimir Braverman

Main category: cs.LG

TL;DR: Proposes importance sampling for low-rank optimization in LLM pretraining to overcome limitations of dominant subspace methods, with provable convergence guarantees and better empirical performance.

DetailsMotivation: Existing low-rank optimization methods for memory-efficient LLM training use dominant subspace projection, but this approach has limitations: the dominant subspace stops changing during pretraining, constraining weight updates to similar subspaces and lacking convergence guarantees.

Method: Importance sampling for low-rank optimization in LLM pretraining, which selects subspaces differently than dominant subspace approaches, with provable convergence guarantees that previous methods lack.

Result: Empirically demonstrates that the proposed importance sampling method significantly outperforms previous low-rank optimization methods in LLM pretraining tasks.

Conclusion: Importance sampling provides a more effective approach to low-rank optimization for memory-efficient LLM training, overcoming the limitations of dominant subspace methods while offering theoretical convergence guarantees.

Abstract: Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is selecting suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling for low-rank optimization in LLM pretraining with a provable convergence guarantee, which the dominant subspace approach does not have. Empirically, we demonstrate that our method significantly outperforms previous methods in LLM pretraining tasks.

[318] FuncGenFoil: Airfoil Generation and Editing Model in Function Space

Jinouwen Zhang, Junjie Ren, Qianhong Ma, Jianyu Wu, Aobo Yang, Yan Lu, Lu Chen, Hairun Xie, Jing Wang, Miao Zhang, Wanli Ouyang, Shixiang Tang

Main category: cs.LG

TL;DR: FuncGenFoil: A function-space generative model for high-fidelity airfoil geometry generation with arbitrary-resolution sampling and smoothness.

DetailsMotivation: Aircraft manufacturing needs high-fidelity airfoil geometries with controllable and editable representations. Existing deep learning methods face a trade-off between expressive power and resolution adaptability when using predefined parametric representations or discrete point sets.

Method: Introduces FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves, combining advantages of parametric functions (arbitrary-resolution sampling, smoothness) with expressiveness of discrete point-based representations.

Result: Achieves 74.4% reduction in label error and 23.2% increase in diversity on AF-200K dataset compared to state-of-the-art methods.

Conclusion: Function-space modeling offers powerful and flexible framework for high-fidelity airfoil design and aerodynamic shape optimization.

Abstract: Aircraft manufacturing is the jewel in the crown of industry, in which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. Existing deep learning methods, which typically rely on predefined parametric representations (e.g., Bézier) or discrete point sets, face an inherent trade-off between expressive power and resolution adaptability. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves. Our method inherits the advantages of arbitrary-resolution sampling and smoothness from parametric functions, as well as the strong expressiveness of discrete point-based representations. Empirical evaluations demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation, achieving a relative 74.4% reduction in label error and a 23.2% increase in diversity on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design.

[319] Meta-Statistical Learning: Supervised Learning of Statistical Estimators

Maxime Peyrard, Kyunghyun Cho

Main category: cs.LG

TL;DR: Meta-statistical learning: A framework that uses supervised learning with permutation-invariant neural networks to automatically discover statistical estimators by treating estimator design as an optimization problem.

DetailsMotivation: Traditional statistical estimator design is analytically challenging and sometimes impossible (e.g., no universally unbiased estimator for standard deviation). There's a need for an automated, empirical approach to discover estimators with desirable frequentist properties.

Method: Amortized learning framework where entire datasets are input to permutation-invariant neural networks (like Set Transformers) trained via supervised learning to predict target statistical properties. The trained model becomes the estimator that can be analyzed through classical frequentist lens.

Result: Demonstrated strong results on two tasks: learning a normality test (classification) and estimating mutual information (regression), achieving good performance even with small models.

Conclusion: This paradigm opens a path to automate the discovery of generalizable and flexible statistical estimators, potentially revolutionizing how statistical inference tools are developed.

Abstract: Statistical inference, a central tool of science, revolves around the study and the usage of statistical estimators: functions that map finite samples to predictions about unknown distribution parameters. In the frequentist framework, estimators are evaluated based on properties such as bias, variance (for parameter estimation), accuracy, power, and calibration (for hypothesis testing). However, crafting estimators with desirable properties is often analytically challenging, and sometimes impossible, e.g., there exists no universally unbiased estimator for the standard deviation. In this work, we introduce meta-statistical learning, an amortized learning framework that recasts estimator design as an optimization problem via supervised learning. This takes a fully empirical approach to discovering statistical estimators; entire datasets are input to permutation-invariant neural networks, such as Set Transformers, trained to predict the target statistical property. The trained model is the estimator, and can be analyzed through the classical frequentist lens. We demonstrate the approach on two tasks: learning a normality test (classification) and estimating mutual information (regression), achieving strong results even with small models. Looking ahead, this paradigm opens a path to automate the discovery of generalizable and flexible statistical estimators.

[320] TRKM: Twin Restricted Kernel Machines for Classification and Regression

A. Quadir, M. Tanveer

Main category: cs.LG

TL;DR: TRKM (Twin Restricted Kernel Machine) enhances RKMs by combining twin models with RKM framework, using Fenchel-Young inequality for conjugate feature duality to improve classification/regression on complex data.

DetailsMotivation: RKMs face generalization challenges with unevenly distributed/complexly clustered data and computational burdens with large datasets. Need improved method that maintains RKM benefits while addressing these limitations.

Method: Proposes TRKM combining twin models with RKM framework. Uses Fenchel-Young inequality to introduce conjugate feature duality, formulating problems in dual variables. Employs energy function similar to RBM with visible/hidden variables, kernel trick for high-dimensional mapping, and regularized least squares for optimal hyperplane.

Result: Experiments on UCI and KEEL datasets show TRKM’s superiority over baselines in robustness and efficiency. Successful implementation on brain age dataset demonstrates efficacy in predicting brain age.

Conclusion: TRKM effectively addresses RKM limitations by integrating twin models with conjugate feature duality, offering improved generalization and computational efficiency for complex data classification/regression tasks.

Abstract: Restricted kernel machines (RKMs) have considerably improved generalization in machine learning. Recent advancements explored various techniques within the RKM framework, integrating kernel functions with least squares support vector machines (LSSVM) to mirror the energy function of restricted Boltzmann machines (RBM), leading to enhanced performance. However, RKMs may face challenges in generalization when dealing with unevenly distributed or complexly clustered data. Additionally, as the dataset size increases, the computational burden of managing high-dimensional feature spaces can become substantial, potentially hindering performance in large-scale datasets. To address these challenges, we propose twin restricted kernel machine (TRKM). TRKM combines the benefits of twin models with the robustness of the RKM framework to enhance classification and regression tasks. By leveraging the Fenchel-Young inequality, we introduce a novel conjugate feature duality, allowing the formulation of classification and regression problems in terms of dual variables. This duality provides an upper bound to the objective function of the TRKM problem, resulting in a new methodology under the RKM framework. The model uses an energy function similar to that of RBM, incorporating both visible and hidden variables corresponding to both classes. Additionally, the kernel trick is employed to map data into a high-dimensional feature space, where the model identifies an optimal separating hyperplane using a regularized least squares approach. Experiments on UCI and KEEL datasets confirm TRKM’s superiority over baselines, showcasing its robustness and efficiency in handling complex data. Furthermore, We implemented the TRKM model on the brain age dataset, demonstrating its efficacy in predicting brain age.

[321] Geometry-Informed Neural Operator Transformer

Qibang Liu, Weiheng Zhong, Hadi Meidani, Diab Abueidda, Seid Koric, Philippe Geubelle

Main category: cs.LG

TL;DR: GINOT integrates transformers with neural operators for efficient PDE predictions on arbitrary geometries using surface point clouds.

DetailsMotivation: Traditional numerical methods for PDEs are computationally expensive, especially for repeated evaluations on varying geometries. Machine learning surrogates offer efficiency but need to handle arbitrary geometries with unordered, non-uniform point clouds.

Method: Geometry-Informed Neural Operator Transformer (GINOT) combines transformer architecture with neural operator framework. Uses sampling/grouping strategy and attention mechanism to encode unordered surface point clouds with varying densities and point counts. Geometry information integrates with query points via attention in solution decoder.

Result: Validated on multiple challenging datasets, GINOT demonstrates high accuracy and strong generalization capabilities for complex arbitrary 2D and 3D geometries.

Conclusion: GINOT provides an effective transformer-based neural operator approach for efficient PDE predictions on arbitrary geometries, overcoming challenges of unordered, non-uniform point clouds.

Abstract: Machine-learning-based surrogate models offer significant computational efficiency and faster simulations compared to traditional numerical methods, especially for problems requiring repeated evaluations of partial differential equations. This work introduces the Geometry-Informed Neural Operator Transformer (GINOT), which integrates the transformer architecture with the neural operator framework to enable forward predictions on arbitrary geometries. GINOT employs a sampling and grouping strategy together with an attention mechanism to encode surface point clouds that are unordered, exhibit non-uniform point densities, and contain varying numbers of points for different geometries. The geometry information is seamlessly integrated with query points in the solution decoder through the attention mechanism. The performance of GINOT is validated on multiple challenging datasets, showcasing its high accuracy and strong generalization capabilities for complex and arbitrary 2D and 3D geometries.

[322] FT-MoE: Sustainable-learning Mixture of Experts for Fault-Tolerant Computing

Wenjing Xiao, Wenhao Song, Miaojiang Chen, Min Chen

Main category: cs.LG

TL;DR: FT-MoE: A sustainable-learning fault-tolerant computing framework using mixture-of-experts architecture for high-accuracy fault detection and classification in edge networks.

DetailsMotivation: Existing deep learning-based fault-tolerant algorithms struggle with heterogeneous fault knowledge, dynamic workloads, and limited data support, leading to poor fault detection quality and training efficiency due to homogenized fault knowledge perception.

Method: Proposes FT-MoE framework with dual-path architecture using mixture-of-experts (MoE) to enable different parameters to learn distinct fault knowledge. Uses two-stage learning combining comprehensive offline training with continual online tuning for adaptive parameter optimization.

Result: Constructed new fault detection dataset for edge networks with 10,000 intervals and fine-grained resource features. Experimental results show FT-MoE outperforms state-of-the-art methods on fault benchmark.

Conclusion: FT-MoE effectively addresses challenges in fault-tolerant computing by leveraging MoE architecture and sustainable learning approach, achieving superior fault detection and classification performance.

Abstract: Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages in predicting and diagnosing faults proactively, thereby ensuring reliable service delivery. However, due to the heterogeneity of fault knowledge, dynamic workloads, and limited data support, existing deep learning-based FT algorithms face challenges in fault detection quality and training efficiency. This is primarily because their homogenization of fault knowledge perception difficuties to fully capture diverse and complex fault patterns. To address these challenges, we propose FT-MoE, a sustainable-learning fault-tolerant computing framework based on a dual-path architecture for high-accuracy fault detection and classification. This model employs a mixture-of-experts (MoE) architecture, enabling different parameters to learn distinct fault knowledge. Additionally, we adopt a two-stage learning scheme that combines comprehensive offline training with continual online tuning, allowing the model to adaptively optimize its parameters in response to evolving real-time workloads. To facilitate realistic evaluation, we construct a new fault detection and classification dataset for edge networks, comprising 10,000 intervals with fine-grained resource features, surpassing existing datasets in both scale and granularity. Finally, we conduct extensive experiments on the FT benchmark to verify the effectiveness of FT-MoE. Results demonstrate that our model outperforms state-of-the-art methods.

[323] SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong

Main category: cs.LG

TL;DR: Saturn is a SAT-based RL framework that uses Boolean Satisfiability problems to train LLMs’ reasoning, addressing scalability, verifiability, and difficulty control limitations of existing RL tasks.

DetailsMotivation: Existing RL tasks for LLMs have three key limitations: (1) Scalability - they rely on expensive human/LLM annotation for training data, (2) Verifiability - LLM outputs are hard to verify automatically, and (3) Controllable Difficulty - most tasks lack fine-grained difficulty control for progressive training.

Method: Saturn uses Boolean Satisfiability (SAT) problems as RL tasks, enabling scalable construction, rule-based verification, and precise difficulty control. It features a curriculum learning pipeline that constructs SAT tasks of increasing difficulty and trains LLMs from easy to hard, with a principled mechanism for stable difficulty transitions.

Result: Created Saturn-2.6k dataset of 2,660 SAT problems with varying difficulty. Applied to DeepSeek-R1-Distill-Qwen to create Saturn-1.5B and Saturn-7B models, achieving: +14.0/+28.1 average pass@3 improvements on SAT problems, +4.9/+1.8 score improvements on math/programming benchmarks, and +8.8% improvement over SOTA RL task construction approaches.

Conclusion: Saturn provides an effective framework for training LLM reasoning capabilities through SAT-based RL tasks, addressing key limitations of existing approaches and demonstrating strong performance improvements across multiple reasoning domains.

Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs’ outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs’ reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.

[324] iPINNER: An Iterative Physics-Informed Neural Network with Ensemble Kalman Filter

Binghang Lu, Changhong Mou, Guang Lin

Main category: cs.LG

TL;DR: Proposed iPINNER framework combines PINNs with ensemble Kalman filter and NSGA-III to improve robustness against noisy data and missing physics in PDE problems.

DetailsMotivation: Standard PINNs struggle with noisy observational data and missing physics in real-world inverse problems, limiting their practical application.

Method: Iterative multi-objective PINN ensemble Kalman filter (iPINNER) using NSGA-III for generating diverse PINN ensembles along Pareto front, then EnKF for data assimilation, with iterative refinement of data loss.

Result: Tested on 1D viscous Burgers equation and time-fractional mixed diffusion-wave equation, showing superior performance over standard PINNs in handling noisy data and missing physics.

Conclusion: iPINNER framework enhances PINN robustness and accuracy for both forward and inverse PDE problems with noisy data and incomplete physics.

Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving forward and inverse problems involving partial differential equations (PDEs) by incorporating physical laws into the training process. However, the performance of PINNs is often hindered in real-world scenarios involving noisy observational data and missing physics, particularly in inverse problems. In this work, we propose an iterative multi-objective PINN ensemble Kalman filter (iPINNER) framework that improves the robustness and accuracy of PINNs in both forward and inverse problems by using the \textit{ensemble Kalman filter} and the \textit{non-dominated sorting genetic algorithm} III (NSGA-III). Specifically, NSGA-III is used as a multi-objective optimizer that can generate various ensemble members of PINNs along the optimal Pareto front, while accounting the model uncertainty in the solution space. These ensemble members are then utilized within the EnKF to assimilate noisy observational data. The EnKF’s analysis is subsequently used to refine the data loss component for retraining the PINNs, thereby iteratively updating their parameters. The iterative procedure generates improved solutions to the PDEs. The proposed method is tested on two benchmark problems: the one-dimensional viscous Burgers equation and the time-fractional mixed diffusion-wave equation (TFMDWE). The numerical results show it outperforms standard PINNs in handling noisy data and missing physics.

[325] REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

Annabelle Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh

Main category: cs.LG

TL;DR: LLM-guided reasoning compiler using Monte Carlo tree search achieves substantial speedups with fewer samples than existing neural compilers.

DetailsMotivation: High cost of serving large-scale models is a barrier to accessibility and innovation. Existing compilers struggle with neural workloads due to exponentially large optimization spaces, and stochastic search techniques are sample-inefficient and lack context awareness.

Method: Introduces Reasoning Compiler framework that formulates optimization as sequential, context-aware decision process guided by LLM and structured Monte Carlo tree search (MCTS). LLM acts as proposal mechanism suggesting hardware-informed transformations based on current program state and performance feedback, while MCTS balances exploration/exploitation.

Result: Achieves substantial speedups with markedly fewer samples than leading neural compilers, demonstrating improved sample efficiency.

Conclusion: LLM-guided reasoning has potential to transform compiler optimization landscape by leveraging context-aware decision spaces and improving sample efficiency without retraining.

Abstract: While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed Reasoning Compiler) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.

[326] GoalLadder: Incremental Goal Discovery with Vision-Language Models

Alexey Zakharov, Shimon Whiteson

Main category: cs.LG

TL;DR: GoalLadder uses vision-language models to train RL agents from single language instructions by incrementally discovering progress states and ranking them with ELO-based ratings, achieving ~95% success vs ~45% for competitors.

DetailsMotivation: Natural language offers concise, interpretable RL task specification, but extracting rewards from language in visual environments remains challenging. Existing approaches using large language models either need non-visual representations, require excessive feedback, or produce noisy rewards.

Method: GoalLadder leverages VLMs to discover states showing task progress incrementally. It queries VLMs to identify improvement states and ranks them via pairwise comparisons using an ELO-based rating system. Agents minimize distance to top-ranked goals in a learned embedding space trained on unlabeled visual data.

Result: GoalLadder outperforms existing methods on classic control and robotic manipulation environments, achieving ~95% average final success rate compared to only ~45% for the best competitor.

Conclusion: GoalLadder enables effective RL training from single language instructions by using VLMs with ELO-based ranking to handle noisy feedback, bypassing the need for abundant accurate feedback typically required for well-shaped reward functions.

Abstract: Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, GoalLadder, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent’s task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM’s feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $\sim$95% compared to only $\sim$45% of the best competitor.

[327] Koopman operator-based discussion on partial observation in stochastic systems

Jun Ohkubo

Main category: cs.LG

TL;DR: The paper discusses partial observation effects in stochastic systems using Koopman operator theory, showing delay-embedding benefits and power-law error behavior with additive noise.

DetailsMotivation: Partial observations are often necessary when complete observation of all system observables is difficult. While Mori-Zwanzig formalism handles partial observations for deterministic systems, and Koopman operator theory has made progress, there's a need to understand partial observation effects in stochastic systems using Koopman operator theory.

Method: The authors use Koopman operator theory to analyze partial observation effects in stochastic systems. They emphasize distinguishing state space and function space, apply delay-embedding techniques for partial observations, and conduct numerical experiments to study error behavior with additive noise.

Result: Delay-embedding techniques are beneficial for partial observations even in stochastic systems. Numerical experiments reveal power-law behavior of error with respect to additive noise amplitude. The paper also discusses the relationship between the power-law exponent and partial observation effects.

Conclusion: The Koopman operator theory provides a useful framework for understanding partial observation effects in stochastic systems. Distinguishing between state space and function space is crucial, and delay-embedding remains effective. The power-law error behavior with noise amplitude reveals important relationships between observation completeness and system performance.

Abstract: It is sometimes difficult to achieve a complete observation for a full set of observables, and partial observations are necessary. For deterministic systems, the Mori-Zwanzig formalism provides a theoretical framework for handling partial observations. Recently, data-driven algorithms based on the Koopman operator theory have made significant progress, and there is a discussion to connect the Mori-Zwanzig formalism with the Koopman operator theory. In this work, we discuss the effects of partial observation in stochastic systems using the Koopman operator theory. The discussion clarifies the importance of distinguishing the state space and the function space in stochastic systems. Even in stochastic systems, the delay-embedding technique is beneficial for partial observation, and several numerical experiments show a power-law behavior of error with respect to the amplitude of the additive noise. We also discuss the relation between the exponent of the power-law behavior and the effects of partial observation.

[328] REDELEX: A Framework for Relational Deep Learning Exploration

Jakub Peleška, Gustav Šír

Main category: cs.LG

TL;DR: REDELEX is a comprehensive framework for evaluating Relational Deep Learning models on diverse relational databases, analyzing performance factors like model complexity and database characteristics.

DetailsMotivation: Relational databases are gold standards for structured data, and Relational Deep Learning (RDL) shows promise for predictive tasks, but there's a lack of analysis on how RDL model performance relates to underlying database characteristics.

Method: Developed REDELEX framework to evaluate RDL models of varying complexity on over 70 diverse relational databases, benchmarking against classic methods and analyzing performance factors.

Result: Confirmed generally superior performance of RDL compared to classic methods, with insights into main performance factors including model complexity, database sizes, and structural properties.

Conclusion: REDELEX provides a comprehensive evaluation framework for RDL models, offering valuable insights into performance relationships with database characteristics and making diverse RDB collection available to the community.

Abstract: Relational databases (RDBs) are widely regarded as the gold standard for storing structured information. Consequently, predictive tasks leveraging this data format hold significant application promise. Recently, Relational Deep Learning (RDL) has emerged as a novel paradigm wherein RDBs are conceptualized as graph structures, enabling the application of various graph neural architectures to effectively address these tasks. However, given its novelty, there is a lack of analysis into the relationships between the performance of various RDL models and the characteristics of the underlying RDBs. In this study, we present REDELEX$-$a comprehensive exploration framework for evaluating RDL models of varying complexity on the most diverse collection of over 70 RDBs, which we make available to the community. Benchmarked alongside key representatives of classic methods, we confirm the generally superior performance of RDL while providing insights into the main factors shaping performance, including model complexity, database sizes and their structural properties.

[329] Class-wise Balancing Data Replay for Federated Class-Incremental Learning

Zhuang Qi, Ying-Peng Tang, Lei Meng, Han Yu, Xiaoxiao Li, Xiangxu Meng

Main category: cs.LG

TL;DR: FedCBDR is a federated class incremental learning method that addresses class imbalance in data replay through global coordination and task-aware temperature scaling.

DetailsMotivation: Current federated class incremental learning methods using data replay suffer from class imbalance issues: (1) within replay buffers due to limited global awareness, and (2) between replayed and newly arrived classes, which limits their performance.

Method: FedCBDR has two key components: 1) Global-perspective data replay module that reconstructs global representations of prior tasks in a privacy-preserving way, then uses class-aware and importance-sensitive sampling for balanced replay; 2) Task-aware temperature scaling module that adaptively adjusts logit temperatures at both class and instance levels based on task dynamics to reduce overconfidence in majority classes and enhance sensitivity to minority classes.

Result: FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

Conclusion: FedCBDR effectively addresses class imbalance in federated class incremental learning through coordinated global replay and adaptive temperature scaling, significantly outperforming existing methods in accuracy.

Abstract: Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model’s overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

[330] Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces

Hyo-Jeong Jang, Hye-Bin Shin, Seong-Whan Lee

Main category: cs.LG

TL;DR: Proposes a cross-modal knowledge distillation framework for EEG learning that addresses modality gap and label misalignment issues by aligning feature semantics and resolving label inconsistencies, improving EEG-based emotion analysis performance.

DetailsMotivation: EEG signals are prone to intrinsic errors and human labeling errors, causing label noise that degrades model performance. While multimodal knowledge distillation can transfer knowledge from visual models to EEG models, it faces modality gap (heterogeneous feature spaces) and soft label misalignment (inconsistencies between ground truth and distillation targets) challenges.

Method: A novel cross-modal knowledge distillation framework with two key components: 1) A prototype-based similarity module to align feature semantics across modalities, addressing the modality gap; 2) A task-specific distillation head to resolve label-induced inconsistency in supervision, handling soft label misalignment.

Result: Experimental results show the approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset.

Conclusion: The proposed framework effectively addresses semantic uncertainty from ambiguous features and weakly defined labels, demonstrating potential for BCI applications by enhancing EEG learning through better cross-modal knowledge transfer.

Abstract: Electroencephalography (EEG) is a fundamental modality for cognitive state monitoring in brain-computer interfaces (BCIs). However, it is highly susceptible to intrinsic signal errors and human-induced labeling errors, which lead to label noise and ultimately degrade model performance. To enhance EEG learning, multimodal knowledge distillation (KD) has been explored to transfer knowledge from visual models with rich representations to EEG-based models. Nevertheless, KD faces two key challenges: modality gap and soft label misalignment. The former arises from the heterogeneous nature of EEG and visual feature spaces, while the latter stems from label inconsistencies that create discrepancies between ground truth labels and distillation targets. This paper addresses semantic uncertainty caused by ambiguous features and weakly defined labels. We propose a novel cross-modal knowledge distillation framework that mitigates both modality and label inconsistencies. It aligns feature semantics through a prototype-based similarity module and introduces a task-specific distillation head to resolve label-induced inconsistency in supervision. Experimental results demonstrate that our approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset. These findings highlight the potential of our framework for BCI applications.

[331] Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

Rodrigo Tertulino, Ricardo Almeida

Main category: cs.LG

TL;DR: A Federated Learning framework for early dropout prediction in distance education that preserves student data privacy while achieving strong predictive performance.

DetailsMotivation: Address persistently high dropout rates in distance education while respecting student data privacy and sovereignty, as traditional centralized approaches raise privacy concerns.

Method: Proposes a Federated Learning framework using the OULAD dataset, simulating privacy-centric scenarios where models are trained locally on early academic performance and digital engagement patterns. Investigates trade-offs between model complexity (Logistic Regression vs. Deep Neural Network) and impact of local data balancing.

Result: The federated model achieves strong predictive power with ROC AUC approximately 85%, demonstrating FL is a practical and scalable solution for early-warning systems.

Conclusion: Federated Learning provides an effective privacy-preserving solution for student dropout prediction that respects data sovereignty while maintaining strong predictive performance, making it suitable for institutional adoption.

Abstract: This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.

[332] Towards Practical Multi-label Causal Discovery in High-Dimensional Event Sequences via One-Shot Graph Aggregation

Hugo Math, Rainer Lienhart

Main category: cs.LG

TL;DR: CARGO: A scalable multi-label causal discovery method for high-dimensional event sequences using pretrained causal Transformers and adaptive frequency fusion to infer causal graphs.

DetailsMotivation: Understanding causality in event sequences (like symptoms leading to diseases or error codes leading to system failures) is critical but remains unsolved across domains like healthcare and vehicle diagnostics, especially with sparse, high-dimensional data with thousands of unique event types.

Method: CARGO uses two pretrained causal Transformers as domain-specific foundation models for event sequences. It infers in parallel, per sequence one-shot causal graphs and aggregates them using adaptive frequency fusion to reconstruct the global Markov boundaries of labels, enabling efficient probabilistic reasoning at scale.

Result: Tested on a challenging real-world automotive fault prediction dataset with over 29,100 unique event types and 474 imbalanced labels, CARGO demonstrates ability to perform structured reasoning while bypassing intractable full-dataset conditional independence testing.

Conclusion: CARGO provides a scalable solution for multi-label causal discovery in high-dimensional event sequences, enabling efficient probabilistic reasoning at scale for domains like healthcare and vehicle diagnostics.

Abstract: Understanding causality in event sequences where outcome labels such as diseases or system failures arise from preceding events like symptoms or error codes is critical. Yet remains an unsolved challenge across domains like healthcare or vehicle diagnostics. We introduce CARGO, a scalable multi-label causal discovery method for sparse, high-dimensional event sequences comprising of thousands of unique event types. Using two pretrained causal Transformers as domain-specific foundation models for event sequences. CARGO infers in parallel, per sequence one-shot causal graphs and aggregates them using an adaptive frequency fusion to reconstruct the global Markov boundaries of labels. This two-stage approach enables efficient probabilistic reasoning at scale while bypassing the intractable cost of full-dataset conditional independence testing. Our results on a challenging real-world automotive fault prediction dataset with over 29,100 unique event types and 474 imbalanced labels demonstrate CARGO’s ability to perform structured reasoning.

[333] DFCA: Decentralized Federated Clustering Algorithm

Jonas Kirch, Sebastian Becker, Tiago Koketsu Rodrigues, Stefan Harmeling

Main category: cs.LG

TL;DR: DFCA is a fully decentralized clustered federated learning algorithm that eliminates central server dependency, enabling clients to collaboratively train cluster-specific models through sequential running average aggregation from neighbors.

DetailsMotivation: Existing clustered FL methods like IFCA rely on central servers, creating bottlenecks and single points of failure that limit applicability in realistic decentralized settings where central coordination may not be available or desirable.

Method: DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing communication-efficient decentralized clustering without batch aggregation. Clients collaboratively train cluster-specific models through peer-to-peer interactions rather than central coordination.

Result: Experiments on various datasets show DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity conditions.

Conclusion: DFCA demonstrates robustness and practicality for dynamic real-world decentralized networks by eliminating central server dependency while maintaining clustering performance through efficient decentralized communication.

Abstract: Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the Iterative Federated Clustering Algorithm (IFCA), rely on a central server to coordinate model updates, which creates a bottleneck and a single point of failure, limiting their applicability in more realistic decentralized learning settings. In this work, we introduce DFCA, a fully decentralized clustered FL algorithm that enables clients to collaboratively train cluster-specific models without central coordination. DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing a communication-efficient alternative to batch aggregation while maintaining clustering performance. Our experiments on various datasets demonstrate that DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity, highlighting its robustness and practicality for dynamic real-world decentralized networks.

[334] MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu

Main category: cs.LG

TL;DR: The paper proposes a new benchmark for evaluating LLM continual learning abilities using simulated user feedback across multiple domains, languages, and task types, revealing current methods are inadequate.

DetailsMotivation: Traditional scaling approaches for LLMs (data, parameters, compute) are reaching limits, while human/AI systems learn from practice and memory. Existing benchmarks focus on homogeneous reading comprehension rather than learning from accumulated user feedback during service.

Method: Proposes a user feedback simulation framework and comprehensive benchmark covering multiple domains, languages, and task types to evaluate continual learning abilities of LLM systems.

Result: Experiments show state-of-the-art baselines have unsatisfactory effectiveness and efficiency in continual learning from user feedback.

Conclusion: The benchmark aims to pave the way for future research on LLM memory and optimization algorithms, addressing the gap in evaluating continual learning from real-world user interactions.

Abstract: Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

[335] More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning

Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, Li Yang

Main category: cs.LG

TL;DR: ZO optimization for continual learning addresses plasticity-stability-efficiency trade-off, showing ZO leads to flatter loss landscapes reducing forgetting but hurts plasticity; proposed ZO-FC uses ZO for adapter modules with FO classifier to balance both.

DetailsMotivation: Investigate zeroth-order optimization as a novel approach to address the plasticity-stability-efficiency trilemma in continual learning, leveraging ZO's memory efficiency and potential stability benefits.

Method: Theoretical analysis and empirical evaluation of ZO optimization applied to various CL methods; propose ZO-FC: ZO optimization for adapter-based PEFT modules with FO-optimized classifier.

Result: ZO optimization naturally leads to flatter loss landscapes reducing forgetting (stability), but impairs plasticity due to imprecise gradient estimates; ZO-FC achieves effective balance between stability and plasticity with negligible memory overhead.

Conclusion: ZO-FC offers practical memory-efficient solution for on-device continual learning by leveraging ZO stability benefits while preserving FO adaptability through hybrid optimization approach.

Abstract: Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma. Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in CL. However, this stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge, particularly under constrained training budgets. To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used with learnable classifiers. Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for on-device CL.

[336] A Unified Model for Multi-Task Drone Routing in Post-Disaster Road Assessment

Huatian Gong, Jiuh-Biing Sheu, Zheng Wang, Xiaoguang Yang, Ran Yan

Main category: cs.LG

TL;DR: A unified deep reinforcement learning model for drone routing in post-disaster road assessment that handles eight problem variants with a single neural network, outperforming specialized models and traditional methods.

DetailsMotivation: Current drone routing methods for post-disaster road assessment face scalability issues: exact/heuristic methods don't scale well and require expertise, while existing DRL approaches need separate models for each problem variant and lack adaptability to changing operational needs.

Method: Proposes a unified transformer encoder-decoder architecture trained across multiple problem configurations, using a lightweight adapter mechanism for efficient finetuning to unseen attributes without full retraining.

Result: Reduces training time and parameters by 8x vs separate models; outperforms single-task DRL by 6-14%, heuristics by 22-42%, and commercial solvers by 24-82% in solution quality; handles networks up to 1,000 nodes in 1-10 seconds.

Conclusion: The unified model provides an efficient, scalable solution for drone routing in dynamic disaster scenarios, with adaptability to new attributes through minimal finetuning while maintaining high performance.

Abstract: Post-disaster road assessment (PDRA) is essential for emergency response, enabling rapid evaluation of infrastructure conditions and efficient allocation of resources. Although drones provide a flexible and effective tool for PDRA, routing them in large-scale networks remains challenging. Exact and heuristic optimization methods scale poorly and demand domain expertise, while existing deep reinforcement learning (DRL) approaches adopt a single-task paradigm, requiring separate models for each problem variant and lacking adaptability to evolving operational needs. This study proposes a unified model (UM) for drone routing that simultaneously addresses eight PDRA variants. By training a single neural network across multiple problem configurations, UM captures shared structural knowledge while adapting to variant-specific constraints through a modern transformer encoder-decoder architecture. A lightweight adapter mechanism further enables efficient finetuning to unseen attributes without retraining, enhancing deployment flexibility in dynamic disaster scenarios. Extensive experiments demonstrate that the UM reduces training time and parameters by a factor of eight compared with training separate models, while consistently outperforming single-task DRL methods by 6-14%, heuristic algorithms by 22-42%, and commercial solvers by 24-82% in terms of solution quality (total collected information value). The model achieves rapid solutions (1-10 seconds) across networks of up to 1,000 nodes, with robustness confirmed through sensitivity analyses. Moreover, finetuning experiments show that unseen attributes can be effectively incorporated with minimal cost while retaining high solution quality. The source code for UM is publicly available at https://github.com/PJ-HTU/UM_PDRA.

[337] How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis

Main category: cs.LG

TL;DR: Spectral optimizers (SpecGD, Muon, Shampoo) outperform vanilla GD on imbalanced data by learning all principal components equally, unlike GD which prioritizes dominant components first.

DetailsMotivation: To understand when spectrum-aware matrix-valued optimizers (like Muon and Shampoo) outperform competitive algorithms in deep learning, particularly their generalization properties on imbalanced data.

Method: Study Spectral Gradient Descent (SpecGD) as canonical form of such optimizers, analyze Gaussian mixture data model with linear/bilinear models, extend to deep linear models, and empirically validate on imbalanced datasets comparing Muon, Shampoo vs Euclidean counterparts and Adam.

Result: SpecGD learns all principal components at equal rates while GD prioritizes dominant components first, creating early training gap in class balanced loss favoring SpecGD. Depth amplifies these effects. Empirical validation shows spectral optimizers achieve superior generalization by balanced learning.

Conclusion: Spectral optimizers outperform Euclidean counterparts on imbalanced data due to their ability to learn all data components equally rather than prioritizing dominant ones, leading to better generalization.

Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) – each update step is $UV^T$ where $UΣV^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in class balanced loss favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data’s underlying components.

[338] CaberNet: Causal Representation Learning for Cross-Domain HVAC Energy Prediction

Kaiyuan Zhai, Jiacheng Cui, Zhehao Zhang, Junyu Xue, Yang Deng, Kui Wu, Guoming Tang

Main category: cs.LG

TL;DR: CaberNet: A causal interpretable deep sequence model for cross-domain HVAC energy prediction that learns invariant representations to handle data scarcity and heterogeneity across buildings in different climates.

DetailsMotivation: Cross-domain HVAC energy prediction is challenging due to costly labeled data collection for each new building, data scarcity and heterogeneity across different buildings, climate zones, and seasons. Existing methods overfit to spurious correlations, require expert intervention, or compromise on data diversity.

Method: CaberNet integrates: 1) a global feature gate with self-supervised Bernoulli regularization to distinguish causal features from inferior ones, and 2) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. It learns invariant (Markov blanket) representations in a purely data-driven fashion without prior knowledge.

Result: CaberNet consistently outperforms all baselines on real-world datasets from three buildings in climatically diverse cities, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark.

Conclusion: CaberNet provides a robust, interpretable solution for cross-domain HVAC energy prediction by learning invariant causal representations, addressing data heterogeneity challenges without requiring expert intervention or compromising data diversity.

Abstract: Cross-domain HVAC energy prediction is essential for scalable building energy management, particularly because collecting extensive labeled data for every new building is both costly and impractical. Yet, this task remains highly challenging due to the scarcity and heterogeneity of data across different buildings, climate zones, and seasonal patterns. In particular, buildings situated in distinct climatic regions introduce variability that often leads existing methods to overfit to spurious correlations, rely heavily on expert intervention, or compromise on data diversity. To address these limitations, we propose CaberNet, a causal and interpretable deep sequence model that learns invariant (Markov blanket) representations for robust cross-domain prediction. In a purely data-driven fashion and without requiring any prior knowledge, CaberNet integrates i) a global feature gate trained with a self-supervised Bernoulli regularization to distinguish superior causal features from inferior ones, and ii) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. We evaluate CaberNet on real-world datasets collected from three buildings located in three climatically diverse cities, and it consistently outperforms all baselines, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark. Our code is available at https://github.com/SusCom-Lab/CaberNet-CRL.

[339] Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering

Crystal Su, Kuai Yu, Jingrui Zhang, Mingyuan Shao, Daniel Bauer

Main category: cs.LG

TL;DR: An ontology-integrated LLM framework for chemical engineering that combines structured domain knowledge (COPE ontology) with generative AI for process control and safety applications.

DetailsMotivation: To create a transparent, auditable approach for applying LLMs to critical engineering contexts like process control and safety analysis by integrating symbolic structure with neural generation.

Method: A pipeline that aligns LLM training/inference with COPE ontology through data acquisition, semantic preprocessing, information extraction, and ontology mapping to produce templated QA pairs for fine-tuning. Includes control-focused decoding and citation gate for syntactic/factual grounding.

Result: The framework produces outputs constrained to ontology-linked terms, with evaluation metrics quantifying both linguistic quality and ontological accuracy. Future extensions include semantic retrieval and iterative validation.

Conclusion: Integrating symbolic structure (ontology) with neural generation (LLMs) provides a transparent, auditable approach for applying AI to critical chemical engineering applications, enhancing interpretability and reliability.

Abstract: This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system’s interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.

[340] Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli

Main category: cs.LG

TL;DR: The paper proposes using well-designed behavior policies for off-policy data collection to achieve provably lower-variance return estimates in online RL, improving sample efficiency and training stability.

DetailsMotivation: Many RL algorithms suffer from poor sample efficiency and training instability due to high-variance return estimates. Traditional on-policy data collection is not variance-optimal, and recent off-policy evaluation results show that well-designed behavior policies can provide lower-variance estimates.

Method: Extends off-policy evaluation insights to online RL by using a single behavior policy to collect data for policy improvement with provably lower-variance return estimates. The approach is applied to policy-gradient methods, focusing on variance reduction through optimal behavior policy design rather than parallel worker reconciliation.

Result: Experiments extending two policy-gradient methods with this regime demonstrate better sample efficiency and performance across diverse environments compared to standard approaches.

Conclusion: On-policy data collection is not variance-optimal for RL; using well-designed behavior policies for off-policy data collection can provide provably lower-variance return estimates, leading to improved sample efficiency and training stability in online RL settings.

Abstract: Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.

[341] Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle

Crystal Su, Kuai Yu, Mingyuan Shao, Daniel Bauer

Main category: cs.LG

TL;DR: A Bayesian model deconvolves bulk RNA-seq data using single-cell reference to reveal cell-type-specific expression and proportions in human endometrial tissue across menstrual cycle phases.

DetailsMotivation: Bulk RNA-seq averages gene expression across heterogeneous cell types, obscuring cell-specific dynamics. This is particularly problematic in tissues like endometrium with dramatic hormone-driven cellular composition changes during menstrual cycle.

Method: Probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions using high-resolution single-cell reference. Extended framework for inferring cell type proportions and cell-specific gene expression changes across biological conditions.

Result: Model reveals dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases. Identifies cell-type-specific differential gene expression (e.g., decidualization markers in stromal cells during secretory phase). Bayesian approach shows resilience to reference mismatches and noise.

Conclusion: The Bayesian deconvolution framework successfully uncovers cell-type-specific dynamics in heterogeneous tissues. Findings have biological significance for endometrial function, potential clinical implications for fertility/endometrial disorders, and future integration with spatial transcriptomics.

Abstract: Bulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model’s structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.

[342] Sensitivity Analysis for Climate Science with Generative Flow Models

Alex Dobra, Jakiw Pidstrigach, Tim Reichelt, Paolo Fraccaro, Anne Jones, Johannes Jakubik, Christian Schroeder de Witt, Philip Torr, Philip Stier

Main category: cs.LG

TL;DR: Adjoint state method enables efficient sensitivity analysis in generative flow models for climate science, reducing computation from weeks on supercomputers to hours on GPUs.

DetailsMotivation: Traditional physical models for climate sensitivity analysis are computationally expensive, while AI-based generative models lack efficient gradient computation methods for sensitivity analysis.

Method: Applied adjoint state method to compute gradients in generative flow models, specifically using the cBottle model trained on ERA5 and ICON data for sensitivity analysis of atmospheric variables with respect to sea surface temperatures.

Result: Quantitatively validated computed sensitivities against model outputs, showing reliable gradient computation with dramatic speedup from weeks on supercomputers to hours on GPUs.

Conclusion: Adjoint method enables efficient and reliable sensitivity analysis in generative climate models, simplifying critical climate science workflows while maintaining accuracy.

Abstract: Sensitivity analysis is a cornerstone of climate science, essential for understanding phenomena ranging from storm intensity to long-term climate feedbacks. However, computing these sensitivities using traditional physical models is often prohibitively expensive in terms of both computation and development time. While modern AI-based generative models are orders of magnitude faster to evaluate, computing sensitivities with them remains a significant bottleneck. This work addresses this challenge by applying the adjoint state method for calculating gradients in generative flow models. We apply this method to the cBottle generative model, trained on ERA5 and ICON data, to perform sensitivity analysis of any atmospheric variable with respect to sea surface temperatures. We quantitatively validate the computed sensitivities against the model’s own outputs. Our results provide initial evidence that this approach can produce reliable gradients, reducing the computational cost of sensitivity analysis from weeks on a supercomputer with a physical model to hours on a GPU, thereby simplifying a critical workflow in climate science. The code can be found at https://github.com/Kwartzl8/cbottle_adjoint_sensitivity.

[343] Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games

Runyu Lu, Peng Zhang, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao, Yang Liu, Dong Wang, Cesare Alippi

Main category: cs.LG

TL;DR: EPG framework enables zero-shot cross-graph generalization for pursuit-evasion games using equilibrium-guided RL training across different graph structures.

DetailsMotivation: Existing RL methods for pursuit-evasion games require recomputation or fine-tuning when graph structures change, which is time-consuming and impairs real-time applicability. There's a need for policies that generalize across different graph structures without retraining.

Method: Proposes Equilibrium Policy Generalization (EPG) framework: 1) Uses dynamic programming algorithm to generate pure-strategy Nash equilibrium policies for single graphs, 2) Trains RL policy across different graph structures against these equilibrium policies, 3) Introduces grouping mechanism and sequence model for scalability with multiple pursuers, 4) Uses distance features for cross-graph training.

Result: EPG achieves robust zero-shot performance on various unseen real-world graphs. The generalized pursuer policy matches performance of fine-tuned state-of-the-art methods when trained with equilibrium heuristic for graphs with exits.

Conclusion: EPG framework successfully enables cross-graph generalization for pursuit-evasion games, achieving zero-shot performance comparable to fine-tuned methods while being applicable to both pursuer/evader sides and both no-exit/multi-exit scenarios.

Abstract: Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.

[344] Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study

Marco Roccetti

Main category: cs.LG

TL;DR: Advanced ML/big data analyses amplify methodological flaws rather than correct them; applying basic descriptive statistics to a COVID-19 vaccine study reveals selection bias invalidating cancer risk claims.

DetailsMotivation: To demonstrate that sophisticated analytical tools (ML/big data) cannot overcome fundamental methodological flaws in study design, and that advanced analyses actually amplify rather than correct basic errors in epidemiological research.

Method: Applied simple descriptive statistical methods and compared results to established national epidemiological benchmarks to re-analyze a published cohort study on COVID-19 vaccine outcomes and severe adverse events (like cancer). Focused on verifying basic methodological coherence and adherence to STROBE Statement protocols before considering advanced analyses.

Result: Revealed statistically irreconcilable paradoxes in the original study: contradictory findings of increased cancer incidence in an exposure subgroup while showing suppressed overall Crude Incidence Rate compared to national standards. Demonstrated these effects are mathematical artifacts from uncorrected selection bias in cohort construction, invalidating the reported risk of increased cancer in the total population.

Conclusion: Complex health studies must first pass basic epidemiological consistency tests before conclusions from advanced statistical modeling can be considered valid. Methodological rigor in study design is foundational - sophisticated analyses amplify rather than correct fundamental flaws, making adherence to established protocols like STROBE essential for credible research.

Abstract: The integration of advanced analytical tools, including Machine Learning (ML) and massive data processing, has revolutionized health research, promising unprecedented accuracy in diagnosis and risk prediction. However, the rigor of these complex methods is fundamentally dependent on the quality and integrity of the underlying datasets and the validity of their statistical design. We propose an emblematic case where advanced analysis (ML/Big Data) must necessarily be subsequent to the verification of basic methodological coherence and adherence to established medical protocols, such as the STROBE Statement. This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws rooted in basic design choices, leading to misleading or contradictory findings. By applying simple, standard descriptive statistical methods and established national epidemiological benchmarks to a recently published cohort study on COVID-19 vaccine outcomes and severe adverse events, like cancer, we expose multiple, statistically irreconcilable paradoxes. These paradoxes, specifically the contradictory finding of an increased cancer incidence within an exposure subgroup, concurrent with a suppressed overall Crude Incidence Rate compared to national standards, definitively invalidate the reported risk of increased cancer in the total population. We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction. This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency before any conclusion drawn from subsequent advanced statistical modeling can be considered valid or publishable.

[345] The Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

S Sairam, Sara Girdhar, Shivam Soni

Main category: cs.LG

TL;DR: R-Learner’s performance on graph data is dominated by the inductive bias of the final-stage CATE estimator, not nuisance models. Graph-blind final stages cause catastrophic failure, while graph-aware final stages succeed and outperform baselines.

DetailsMotivation: The R-Learner framework assumes a well-specified final-stage model, but this assumption breaks down on network data where causal heterogeneity depends on graph structure. There's a need to systematically understand how R-Learners perform on graphs and identify bottlenecks.

Method: Large-scale empirical study analyzing R-Learner on graphs. Proposed Graph R-Learner with graph-aware final stage. Used synthetic and semi-synthetic benchmarks. Conducted “Hub-Periphery Trade-off” analysis to explain topology-dependent nuisance bottlenecks.

Result: 1) Final-stage inductive bias dominates performance, not nuisance models. 2) Graph-blind final stages fail catastrophically (MSE > 4.0, p < 0.001). 3) Graph R-Learner succeeds and outperforms GNN T-Learner baseline. 4) Identified topology-dependent “nuisance bottleneck” linked to GNN over-squashing.

Conclusion: The “final-stage bottleneck” is critical for R-Learner success on graphs. Graph-aware final stages are essential, and researchers must address both representation and nuisance bottlenecks for effective heterogeneous treatment effect estimation on network data.

Abstract: The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data, where causal heterogeneity is often graph-dependent, presents a critical challenge to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale empirical study to systematically dissect the R-Learner framework on graphs. We provide the first rigorous evidence that the primary driver of performance is the inductive bias of the final-stage CATE estimator, an effect that dominates the choice of nuisance models. Our central finding is the quantification of a catastrophic “representation bottleneck”: we prove with overwhelming statistical significance (p < 0.001) that R-Learners with a graph-blind final stage fail completely (MSE > 4.0), even when paired with powerful GNN nuisance models. Conversely, our proposed end-to-end Graph R-Learner succeeds and significantly outperforms a strong, non-DML GNN T-Learner baseline. Furthermore, we identify and provide a mechanistic explanation for a subtle, topology-dependent “nuisance bottleneck,” linking it to GNN over-squashing via a targeted “Hub-Periphery Trade-off” analysis. Our findings are validated across diverse synthetic and semi-synthetic benchmarks. We release our code as a reproducible benchmark to facilitate future research on this critical “final-stage bottleneck.”

[346] Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect

Yuwen Zhang, Viet Tran, Paul Weng

Main category: cs.LG

TL;DR: The paper addresses the Rashomon Effect in clinical ML where multiple models have similar performance, making selection uncertain. It proposes Intervention Efficiency (IE) and Perturbation Validation Framework (PVF) for robust model assessment that considers clinical utility and stability under data perturbations.

DetailsMotivation: Clinical ML faces the Rashomon Effect where multiple models show comparable performance, especially with small, imbalanced, noisy datasets and high-dimensional features. Conventional validation schemes become unreliable, and model selection becomes uncertain when resource constraints and operational priorities aren't considered by standard metrics like F1 score.

Method: Two complementary tools: 1) Intervention Efficiency (IE) - a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, linking predictive performance with clinical utility. 2) Perturbation Validation Framework (PVF) - a structured approach to assess model stability under data perturbations, identifying models with most invariant performance across noisy or shifted validation sets.

Result: Empirical results on synthetic and real-world healthcare datasets show that using IE and PVF facilitates selection of models that generalize more robustly and align with capacity constraints.

Conclusion: The proposed tools offer a new direction for tackling the Rashomon Effect in clinical settings by providing robust model assessment and selection methods that consider clinical utility and stability under data perturbations.

Abstract: In clinical machine learning, the coexistence of multiple models with comparable performance – a manifestation of the Rashomon Effect – poses fundamental challenges for trustworthy deployment and evaluation. Small, imbalanced, and noisy datasets, coupled with high-dimensional and weakly identified clinical features, amplify this multiplicity and make conventional validation schemes unreliable. As a result, selecting among equally performing models becomes uncertain, particularly when resource constraints and operational priorities are not considered by conventional metrics like F1 score. To address these issues, we propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF). IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, thereby linking predictive performance with clinical utility. PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets. Empirical results on synthetic and real-world healthcare datasets show that using these tools facilitates the selection of models that generalize more robustly and align with capacity constraints, offering a new direction for tackling the Rashomon Effect in clinical settings.

[347] Adversarial Signed Graph Learning with Differential Privacy

Haobin Ke, Sen Zhang, Qingqing Ye, Xun Ran, Haibo Hu

Main category: cs.LG

TL;DR: ASGL is a privacy-preserving adversarial signed graph learning method that achieves node-level differential privacy while maintaining high utility for signed graphs with positive/negative edges.

DetailsMotivation: Existing differential privacy methods for unsigned graphs are unsuitable for signed graphs because edge perturbation causes cascading errors in sign inference under balance theory, and gradient perturbation has high sensitivity due to node interdependence and sign flips.

Method: 1) Decompose signed graphs into positive/negative subgraphs based on edge signs; 2) Design gradient-perturbed adversarial module to approximate true signed connectivity distribution; 3) Use constrained breadth-first search tree strategy fused with balance theory to identify edge signs between generated node pairs; 4) Implement gradient decoupling to lower sensitivity.

Result: Extensive experiments on real-world datasets show ASGL achieves favorable privacy-utility trade-offs across multiple downstream tasks.

Conclusion: ASGL effectively addresses privacy concerns in signed graph learning by combining adversarial learning with differential privacy, overcoming limitations of existing methods through subgraph separation and gradient decoupling techniques.

Abstract: Signed graphs with positive and negative edges can model complex relationships in social networks. Leveraging on balance theory that deduces edge signs from multi-hop node pairs, signed graph learning can generate node embeddings that preserve both structural and sign information. However, training on sensitive signed graphs raises significant privacy concerns, as model parameters may leak private link information. Existing protection methods with differential privacy (DP) typically rely on edge or gradient perturbation for unsigned graph protection. Yet, they are not well-suited for signed graphs, mainly because edge perturbation tends to cascading errors in edge sign inference under balance theory, while gradient perturbation increases sensitivity due to node interdependence and gradient polarity change caused by sign flips, resulting in larger noise injection. In this paper, motivated by the robustness of adversarial learning to noisy interactions, we present ASGL, a privacy-preserving adversarial signed graph learning method that preserves high utility while achieving node-level DP. We first decompose signed graphs into positive and negative subgraphs based on edge signs, and then design a gradient-perturbed adversarial module to approximate the true signed connectivity distribution. In particular, the gradient perturbation helps mitigate cascading errors, while the subgraph separation facilitates sensitivity reduction. Further, we devise a constrained breadth-first search tree strategy that fuses with balance theory to identify the edge signs between generated node pairs. This strategy also enables gradient decoupling, thereby effectively lowering gradient sensitivity. Extensive experiments on real-world datasets show that ASGL achieves favorable privacy-utility trade-offs across multiple downstream tasks.

[348] A Variance-Based Analysis of Sample Complexity for Grid Coverage

Lyu Yuhuan

Main category: cs.LG

TL;DR: The paper presents a new sample complexity bound for uniform random sampling on d-dimensional unit hypercubes with logarithmic dependence on failure probability δ, improving over classical linear 1/δ bounds.

DetailsMotivation: Classical coverage analyses for verifying uniform conditions over continuous spaces yield conservative bounds, especially at small failure probabilities. This is problematic for algorithms relying on grid-based coverage guarantees in high-confidence regimes.

Method: Study uniform random sampling on d-dimensional unit hypercubes, analyze uncovered subcubes after discretization, apply concentration inequality to uncovered-count statistic, derive sample complexity bound with logarithmic δ dependence.

Result: Derived sample complexity bound M = O(Č ln(2Č/δ)) with logarithmic dependence on δ, contrasting with classical linear 1/δ dependence. Numerical studies show the bound tracks practical coverage requirements more tightly and scales favorably as δ→0.

Conclusion: The new bound offers a sharper theoretical tool for algorithms relying on grid-based coverage guarantees, enabling more efficient sampling especially in high-confidence regimes, with favorable scaling as failure probability decreases.

Abstract: Verifying uniform conditions over continuous spaces through random sampling is fundamental in machine learning and control theory, yet classical coverage analyses often yield conservative bounds, particularly at small failure probabilities. We study uniform random sampling on the $d$-dimensional unit hypercube and analyze the number of uncovered subcubes after discretization. By applying a concentration inequality to the uncovered-count statistic, we derive a sample complexity bound with a logarithmic dependence on the failure probability ($δ$), i.e., $M =O( \tilde{C}\ln(\frac{2\tilde{C}}δ))$, which contrasts sharply with the classical linear $1/δ$ dependence. Under standard Lipschitz and uniformity assumptions, we present a self-contained derivation and compare our result with classical coupon-collector rates. Numerical studies across dimensions, precision levels, and confidence targets indicate that our bound tracks practical coverage requirements more tightly and scales favorably as $δ\to 0$. Our findings offer a sharper theoretical tool for algorithms that rely on grid-based coverage guarantees, enabling more efficient sampling, especially in high-confidence regimes.

[349] HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs

Ningning Chen, Weicai Ye, Ying Jiang

Main category: cs.LG

TL;DR: HBLLM is a wavelet-enhanced 1-bit quantization method for LLMs that uses Haar wavelet transforms and structure-aware grouping to achieve high fidelity with minimal storage overhead.

DetailsMotivation: To develop an efficient 1-bit post-training quantization method for LLMs that maintains high fidelity while minimizing storage requirements, addressing the challenge of model compression for large language models.

Method: Uses Haar wavelet transforms for frequency decomposition, with two structure-aware grouping strategies: frequency-aware multi-parameter intra-row grouping and ℓ₂-norm-based saliency-driven column selection. Non-salient weights use shared means across quantization groups within each frequency band.

Result: Achieves state-of-the-art 1-bit quantization performance with perplexity of 6.71 on LLaMA2-13B while using only 1.08 bits average weight storage. Demonstrated effectiveness on OPT and LLaMA models.

Conclusion: HBLLM provides an effective wavelet-enhanced quantization approach that significantly improves fidelity in 1-bit LLM quantization with minimal storage overhead, making it a practical solution for deploying large language models efficiently.

Abstract: We introduce HBLLM, a wavelet-enhanced high-fidelity $1$-bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) $\ell_2$-norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in $1$-bit quantization, attaining a perplexity of $6.71$ on LLaMA$2$-$13$B with an average weight storage of only $1.08$ bits. Code available at: https://github.com/Yeyke/HBLLM.

[350] Diffusion for Fusion: Designing Stellarators with Generative AI

Misha Padidar, Teresa Huang, Andrew Giuliani, Marina Spivak

Main category: cs.LG

TL;DR: A machine learning approach using conditional diffusion models to rapidly generate high-quality stellarator designs with desirable characteristics, achieving less than 5% deviation from target parameters.

DetailsMotivation: Traditional stellarator design is time-consuming (hours on computing clusters), and machine learning approaches using large datasets of optimized stellarators offer potential for rapid design generation.

Method: Train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with specific characteristics (aspect ratio and mean rotational transform). The model is applied to design stellarators with characteristics not seen during training.

Result: Many generated stellarators show solid performance with less than 5% deviation from quasisymmetry and target characteristics. The modest deviation suggests potential to reach the sub 1% target.

Conclusion: The study presents a promising machine learning approach for rapid stellarator design and identifies multiple avenues for generative modeling to advance fusion research.

Abstract: Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.

[351] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.LG

TL;DR: CUDA-L2 combines LLMs and RL to automatically optimize HGEMM CUDA kernels, outperforming major baselines including torch.matmul, cuBLAS, and cuBLASLt by up to 28.7% in server mode.

DetailsMotivation: Even heavily-optimized performance-critical kernels like HGEMM can be further improved through automated exploration of configuration spaces at scales impractical for human optimization.

Method: Combines large language models (LLMs) and reinforcement learning (RL) with CUDA execution speed as the RL reward to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels across 1,000 configurations.

Result: Outperforms major matmul baselines: +22.0% over torch.matmul, +19.2% over cuBLAS, +16.8% over cuBLASLt-heuristic, +11.4% over cuBLASLt-AutoTuning in offline mode. Speedups increase to +28.7%, +26.0%, +22.4%, +15.9% respectively in server mode.

Conclusion: LLM-guided RL automation can systematically explore configuration spaces at scales impractical for humans, demonstrating that even the most performance-critical, heavily-optimized kernels like HGEMM can be further improved.

Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art Nvidia’s closed-source libraries, i.e., cuBLAS, cuBLASLt. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over torch.matmul on average; +19.2% over cuBLAS using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over cuBLASLt-heuristic, which queries cuBLASLt library and selects the algorithm based on the heuristic’s suggestion; and +11.4% over the most competitive cuBLASLt-AutoTuning model, which selects the fastest algorithm from up to 100 candidates from cuBLASLt’s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

[352] ASPEN: An Adaptive Spectral Physics-Enabled Network for Ginzburg-Landau Dynamics

Julian Evan Chrisnanto, Nurfauzi Fadillah, Yulison Herry Chrisnanto

Main category: cs.LG

TL;DR: ASPEN introduces adaptive spectral layers to overcome PINNs’ spectral bias, successfully solving stiff nonlinear PDEs like the Ginzburg-Landau equation where standard PINNs fail.

DetailsMotivation: Standard PINNs struggle with stiff, multi-scale, and nonlinear PDEs due to spectral bias in MLP architectures, which prevents adequate representation of high-frequency components needed for complex physical systems.

Method: ASPEN integrates an adaptive spectral layer with learnable Fourier features at the network’s input stage, allowing dynamic tuning of the spectral basis during training to efficiently learn required frequency content.

Result: ASPEN successfully solves the complex Ginzburg-Landau equation with exceptional accuracy (median physics residual: 5.10×10⁻³), while standard PINNs catastrophically fail. The solution captures emergent physical properties like free energy relaxation and domain wall stability.

Conclusion: Incorporating adaptive spectral basis enables robust, physically-consistent solvers for complex dynamical systems where standard PINNs fail, opening new possibilities for machine learning in challenging physical domains.

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful, mesh-free paradigm for solving partial differential equations (PDEs). However, they notoriously struggle with stiff, multi-scale, and nonlinear systems due to the inherent spectral bias of standard multilayer perceptron (MLP) architectures, which prevents them from adequately representing high-frequency components. In this work, we introduce the Adaptive Spectral Physics-Enabled Network (ASPEN), a novel architecture designed to overcome this critical limitation. ASPEN integrates an adaptive spectral layer with learnable Fourier features directly into the network’s input stage. This mechanism allows the model to dynamically tune its own spectral basis during training, enabling it to efficiently learn and represent the precise frequency content required by the solution. We demonstrate the efficacy of ASPEN by applying it to the complex Ginzburg-Landau equation (CGLE), a canonical and challenging benchmark for nonlinear, stiff spatio-temporal dynamics. Our results show that a standard PINN architecture catastrophically fails on this problem, diverging into non-physical oscillations. In contrast, ASPEN successfully solves the CGLE with exceptional accuracy. The predicted solution is visually indistinguishable from the high-resolution ground truth, achieving a low median physics residual of 5.10 x 10^-3. Furthermore, we validate that ASPEN’s solution is not only pointwise accurate but also physically consistent, correctly capturing emergent physical properties, including the rapid free energy relaxation and the long-term stability of the domain wall front. This work demonstrates that by incorporating an adaptive spectral basis, our framework provides a robust and physically-consistent solver for complex dynamical systems where standard PINNs fail, opening new options for machine learning in challenging physical domains.

[353] Advancing physiological time series reconstruction and imputation via mixture of receptive fields and experts fusion

Ci Zhang, Huayu Li, Changdi Yang, Jiangnan Xia, Yanzhi Wang, Xiaolong Ma, Jin Lu, Ao Li, Geng Yuan

Main category: cs.LG

TL;DR: A novel Mixture of Experts (MoE)-based diffusion framework for medical time series reconstruction that uses RFAMoE for adaptive receptive fields and Fusion MoE for parallel noise generation, achieving SOTA performance with single-inference efficiency.

DetailsMotivation: Diffusion models show promise for time series reconstruction but remain unexplored in medical domains. Medical physiological signals have unique challenges: multivariate, high temporal variability, noisy, and artifact-prone, making deep learning approaches difficult for tasks like imputation.

Method: Proposes a MoE-based noise estimator within a score-based diffusion framework. Two key components: 1) RFAMoE module enables each channel to adaptively select desired receptive fields throughout diffusion process, 2) Fusion MoE module leverages MoE nature to generate K noise signals in parallel, fuse them using routing mechanism, and complete reconstruction in single inference step.

Result: Extensive results demonstrate the framework consistently outperforms diffusion-based SOTA works on different tasks and datasets. The approach not only improves performance but also eliminates substantial computational cost and latency associated with multiple inference processes.

Conclusion: The proposed MoE-based diffusion framework effectively addresses challenges in medical time series reconstruction, achieving superior performance while maintaining computational efficiency through innovative single-inference parallel noise generation.

Abstract: Recent studies show that using diffusion models for time series signal reconstruction holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challenging for tasks such as imputation. Hence, we propose a novel Mixture of Experts (MoE)-based noise estimator within a score-based diffusion framework. Specifically, the Receptive Field Adaptive MoE (RFAMoE) module is designed to enable each channel to adaptively select desired receptive fields throughout the diffusion process. Moreover, recent literature has found that when generating a physiological signal, performing multiple inferences and averaging the reconstructed signals can effectively reduce reconstruction errors, but at the cost of significant computational and latency overhead. We design a Fusion MoE module and innovatively leverage the nature of MoE module to generate K noise signals in parallel, fuse them using a routing mechanism, and complete signal reconstruction in a single inference step. This design not only improves performance over previous methods but also eliminates the substantial computational cost and latency associated with multiple inference processes. Extensive results demonstrate that our proposed framework consistently outperforms diffusion-based SOTA works on different tasks and datasets.

[354] RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, Depei Qian

Main category: cs.LG

TL;DR: RLHFSpec accelerates RLHF generation stage using speculative decoding with workload-aware drafting strategy selection and sample reallocation, achieving higher throughput and overall RLHF speedup.

DetailsMotivation: The generation stage is the bottleneck in RLHF execution, comprising three stages (generation, inference, training). Optimizing this bottleneck is crucial for improving overall RLHF performance.

Method: Proposes RLHFSpec system that integrates speculative decoding into RLHF generation stage. Features: 1) workload-aware drafting strategy selection mechanism that chooses near-optimal strategy considering verification cost and accepted tokens, 2) sample reallocation to fully utilize GPU resources with efficient sample migration mechanism.

Result: RLHFSpec achieves higher throughput in the generation stage compared to state-of-the-art works. Also shows significant performance speedup in entire RLHF execution due to effective alleviation of the generation bottleneck.

Conclusion: RLHFSpec successfully optimizes the RLHF generation bottleneck through speculative decoding with adaptive strategy selection and resource optimization, demonstrating substantial performance improvements in both generation stage and overall RLHF pipeline.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with efficient speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.

[355] Empowering GNNs for Domain Adaptation via Denoising Target Graph

Haiyang Yu, Meng-Chieh Lee, Xiang song, Qi Zhu, Christos Faloutsos

Main category: cs.LG

TL;DR: GraphDeT framework improves GNN generalization in graph domain adaptation by adding an auxiliary edge denoising loss on target graphs, which tightens the generalization bound and enhances performance on time/regional domain shifts.

DetailsMotivation: Graph domain adaptation faces challenges with structure domain shifts when graphs are collected at different times or from varying areas, causing poor GNN performance on target graphs. The authors discovered that simple edge denoising on target graphs can significantly improve GNN generalization.

Method: Proposed GraphDeT framework integrates an auxiliary loss function for denoising graph edges on target graphs into GNN training for node classification under domain adaptation. The auxiliary edge task is theoretically connected to graph generalization bound with -distance, showing it imposes constraints that tighten the bound.

Result: Experimental results demonstrate superior performance compared to existing baselines in handling both time and regional domain graph shifts, showing the effectiveness of the simple edge denoising approach.

Conclusion: A simple auxiliary edge denoising task on target graphs can significantly enhance GNN generalization in graph domain adaptation scenarios, with theoretical justification and empirical validation across different types of domain shifts.

Abstract: We explore the node classification task in the context of graph domain adaptation, which uses both source and target graph structures along with source labels to enhance the generalization capabilities of Graph Neural Networks (GNNs) on target graphs. Structure domain shifts frequently occur, especially when graph data are collected at different times or from varying areas, resulting in poor performance of GNNs on target graphs. Surprisingly, we find that simply incorporating an auxiliary loss function for denoising graph edges on target graphs can be extremely effective in enhancing GNN performance on target graphs. Based on this insight, we propose our framework, GraphDeT, a framework that integrates this auxiliary edge task into GNN training for node classification under domain adaptation. Our theoretical analysis connects this auxiliary edge task to the graph generalization bound with -distance, demonstrating such auxiliary task can imposes a constraint which tightens the bound and thereby improves generalization. The experimental results demonstrate superior performance compared to the existing baselines in handling both time and regional domain graph shifts.

[356] Small-Gain Nash: Certified Contraction to Nash Equilibria in Differentiable Games

Vedansh Sharma

Main category: cs.LG

TL;DR: SGN introduces a block small-gain condition and custom block-weighted geometry to certify convergence in non-monotone games where Euclidean monotonicity fails.

DetailsMotivation: Classical convergence guarantees require pseudo-gradient monotonicity in Euclidean geometry, which often fails in games with strong cross-player couplings, limiting analysis of many practical game scenarios.

Method: Introduces Small-Gain Nash (SGN), a block small-gain condition that converts local curvature and cross-player Lipschitz bounds into contraction certificates. Constructs weighted block metrics where pseudo-gradient becomes strongly monotone even when non-monotone in Euclidean sense.

Result: Continuous flow is exponentially contracting in designed geometry; projected Euler and RK4 discretizations converge under explicit step-size bounds. Framework successfully certifies convergence in quadratic games where Euclidean analysis fails, and extends to mirror/Fisher geometries for entropy-regularized policy gradient in Markov games.

Conclusion: Provides offline certification pipeline that estimates parameters, optimizes block weights, and returns structural convergence certificates (metric, contraction rate, safe step-sizes) for non-monotone games, offering a TTUR-like “timescale band” without requiring asymptotic timescale separation.

Abstract: Classical convergence guarantees for gradient-based learning in games require the pseudo-gradient to be (strongly) monotone in Euclidean geometry as shown by rosen(1965), a condition that often fails even in simple games with strong cross-player couplings. We introduce Small-Gain Nash (SGN), a block small-gain condition in a custom block-weighted geometry. SGN converts local curvature and cross-player Lipschitz coupling bounds into a tractable certificate of contraction. It constructs a weighted block metric in which the pseudo-gradient becomes strongly monotone on any region where these bounds hold, even when it is non-monotone in the Euclidean sense. The continuous flow is exponentially contracting in this designed geometry, and projected Euler and RK4 discretizations converge under explicit step-size bounds derived from the SGN margin and a local Lipschitz constant. Our analysis reveals a certified “timescale band”, a non-asymptotic, metric-based certificate that plays a TTUR-like role: rather than forcing asymptotic timescale separation via vanishing, unequal step sizes, SGN identifies a finite band of relative metric weights for which a single-step-size dynamics is provably contractive. We validate the framework on quadratic games where Euclidean monotonicity analysis fails to predict convergence, but SGN successfully certifies it, and extend the construction to mirror/Fisher geometries for entropy-regularized policy gradient in Markov games. The result is an offline certification pipeline that estimates curvature, coupling, and Lipschitz parameters on compact regions, optimizes block weights to enlarge the SGN margin, and returns a structural, computable convergence certificate consisting of a metric, contraction rate, and safe step-sizes for non-monotone games.

[357] A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

Main category: cs.LG

TL;DR: Novel weighted sampling algorithm for multi-label datasets that accounts for label dependencies using multivariate Bernoulli distribution to create balanced samples while preserving frequency order and reducing imbalance.

DetailsMotivation: Multi-label datasets often have imbalanced label frequencies and label dependencies, making it challenging to obtain representative samples that include sufficient observations of scarcer labels for reliable inference.

Method: Proposes a sampling algorithm using multivariate Bernoulli distribution to model multi-label data. Estimates distribution parameters from observed label frequencies, calculates weights for each label combination, and performs weighted sampling that accounts for label dependencies while achieving target distribution characteristics.

Result: Applied to Web of Science biomedical research articles with 64 topic categories. Successfully created more balanced sub-samples that preserved category frequency order, reduced frequency differences between most and least common categories, and accounted for category dependencies, enhancing representation of minority categories.

Conclusion: The proposed approach effectively addresses multi-label sampling challenges by incorporating label dependencies through multivariate Bernoulli modeling, producing balanced samples that improve minority category representation while maintaining important distribution characteristics.

Abstract: Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

[358] Learning Unmasking Policies for Diffusion Language Models

Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, João Monteiro, Victor Turrisi, Jason Ramapuram, Marco Cuturi

Main category: cs.LG

TL;DR: Training RL-based sampling policies for masked diffusion language models outperforms heuristic methods in full diffusion settings and shows transferability across models and sequence lengths.

DetailsMotivation: Current heuristic sampling strategies for masked discrete diffusion language models require manual tuning, degrade with larger buffer sizes, and have limitations in balancing quality and efficiency. There's a need for learned sampling procedures that can automatically optimize the accuracy-efficiency trade-off.

Method: Formalize masked diffusion sampling as a Markov decision process where the dLLM serves as the environment. Train lightweight policy networks (single-layer transformers) using reinforcement learning to map dLLM token confidences to unmasking decisions, replacing heuristic strategies.

Result: Trained RL policies match state-of-the-art heuristic performance with semi-autoregressive generation and outperform them in full diffusion settings. Policies show transferability to new dLLMs and longer sequences, but degrade on out-of-domain data and have challenges with fine-grained accuracy-efficiency tuning.

Conclusion: Reinforcement learning offers a promising alternative to heuristic sampling for masked diffusion language models, providing competitive performance and transferability, though challenges remain with domain adaptation and precise trade-off control.

Abstract: Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model’s vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.

[359] Decoupled Q-Chunking

Qiyang Li, Seohong Park, Sergey Levine

Main category: cs.LG

TL;DR: Proposes decoupling critic and policy chunk lengths to address bootstrapping bias in TD methods, using distilled critics for partial action chunks to maintain reactivity while benefiting from multi-step value propagation.

DetailsMotivation: TD methods suffer from bootstrapping bias where errors accumulate across steps. While chunked critics speed up value backup by estimating values for action sequences, they force policies to output entire chunks open-loop, which is sub-optimal for reactive environments and challenging to model for long chunks.

Method: Decouples critic chunk length from policy chunk length. Optimizes policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate maximum value achievable when partial chunks are extended to complete ones.

Result: Method reliably outperforms prior methods on challenging, long-horizon offline goal-conditioned tasks, retaining benefits of multi-step value propagation while avoiding open-loop sub-optimality and difficulties of learning long action chunking policies.

Conclusion: The proposed decoupling approach successfully addresses limitations of chunked critics by allowing policies to operate over shorter action chunks while maintaining the efficiency benefits of multi-step value estimation through distilled critics.

Abstract: Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences (“chunks”) rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.

[360] Text2Graph: Combining Lightweight LLMs and GNNs for Efficient Text Classification in Label-Scarce Scenarios

João Lucas Luz Lima Sarcinelli, Ricardo Marcondes Marcacini

Main category: cs.LG

TL;DR: Text2Graph is an open-source Python package that combines LLM-based partial annotation with GNN label propagation for sustainable, energy-efficient zero-shot text classification.

DetailsMotivation: LLMs are effective zero-shot classifiers but have high computational requirements and environmental costs that limit their practicality for large-scale annotation in HPC environments. There's a need for more sustainable workflows.

Method: Text2Graph provides modular implementation of text-to-graph classification approaches, combining LLM-based partial annotation with Graph Neural Network label propagation. The framework allows flexible swapping of components like feature extractors, edge construction methods, and sampling strategies.

Result: Benchmarked on five datasets spanning topic classification and sentiment analysis tasks, graph-based propagation achieves competitive results at a fraction of the energy and environmental cost compared to other zero-shot approaches.

Conclusion: Text2Graph enables sustainable, energy-efficient zero-shot text classification by combining LLM partial annotation with GNN label propagation, offering competitive performance with significantly reduced computational and environmental costs.

Abstract: Large Language Models (LLMs) have become effective zero-shot classifiers, but their high computational requirements and environmental costs limit their practicality for large-scale annotation in high-performance computing (HPC) environments. To support more sustainable workflows, we present Text2Graph, an open-source Python package that provides a modular implementation of existing text-to-graph classification approaches. The framework enables users to combine LLM-based partial annotation with Graph Neural Network (GNN) label propagation in a flexible manner, making it straightforward to swap components such as feature extractors, edge construction methods, and sampling strategies. We benchmark Text2Graph on a zero-shot setting using five datasets spanning topic classification and sentiment analysis tasks, comparing multiple variants against other zero-shot approaches for text classification. In addition to reporting performance, we provide detailed estimates of energy consumption and carbon emissions, showing that graph-based propagation achieves competitive results at a fraction of the energy and environmental cost.

cs.MA

[361] Multi-Objective Reinforcement Learning for Large-Scale Mixed Traffic Control

Iftekharul Islam, Weizi Li

Main category: cs.MA

TL;DR: Hierarchical framework combining multi-objective RL for intersection control with strategic routing improves fairness, safety, and efficiency in mixed traffic, reducing wait times by 53%, starvation by 86%, and conflicts by 86%.

DetailsMotivation: Existing mixed traffic control approaches optimize efficiency and safety but lack fairness mechanisms, leading to systematic starvation of vehicles on low-demand approaches. There's a need for equitable service across all traffic streams in mixed-autonomy environments.

Method: Hierarchical framework with multi-objective reinforcement learning for local intersection control and strategic routing for network-level coordination. Introduces Conflict Threat Vector for proactive risk signals and queue parity penalty for equitable service.

Result: Up to 53% reduction in average wait time, 86% reduction in maximum starvation, 86% reduction in conflict rate compared to baselines, while maintaining fuel efficiency. Strategic routing effectiveness scales with robot vehicle penetration rates.

Conclusion: Multi-objective optimization through curated reward functions paired with strategic robot vehicle routing yields significant benefits in fairness and safety metrics critical for equitable mixed-autonomy deployment.

Abstract: Effective mixed traffic control requires balancing efficiency, fairness, and safety. Existing approaches excel at optimizing efficiency and enforcing safety constraints but lack mechanisms to ensure equitable service, resulting in systematic starvation of vehicles on low-demand approaches. We propose a hierarchical framework combining multi-objective reinforcement learning for local intersection control with strategic routing for network-level coordination. Our approach introduces a Conflict Threat Vector that provides agents with explicit risk signals for proactive conflict avoidance, and a queue parity penalty that ensures equitable service across all traffic streams. Extensive experiments on a real-world network across different robot vehicle (RV) penetration rates demonstrate substantial improvements: up to 53% reductions in average wait time, up to 86% reductions in maximum starvation, and up to 86% reduction in conflict rate compared to baselines, while maintaining fuel efficiency. Our analysis reveals that strategic routing effectiveness scales with RV penetration, becoming increasingly valuable at higher autonomy levels. The results demonstrate that multi-objective optimization through well-curated reward functions paired with strategic RV routing yields significant benefits in fairness and safety metrics critical for equitable mixed-autonomy deployment.

[362] Evaluating Cooperative Resilience in Multiagent Systems: A Comparison Between Humans and LLMs

Manuela Chacon-Chamorro, Juan Sebastián Pinzón, Rubén Manrique, Luis Felipe Giraldo, Nicanor Quijano

Main category: cs.MA

TL;DR: Comparative analysis shows human groups with communication achieve highest cooperative resilience in social dilemmas, outperforming LLM agents even when they communicate. Human decision-making under adversity can inform design of more prosocial AI agents.

DetailsMotivation: To systematically compare cooperative resilience between human groups and LLM-based agents in mixed-motive social dilemmas, establishing a benchmark for evaluating agent architectures and interaction modalities, and understanding how human decision-making under adverse conditions can inform AI agent design.

Method: Used Tragedy of the Commons environment from Melting Pot suite with mixed-motive social dilemmas. Compared human groups vs LLM-based agents, each evaluated with/without explicit communication. Assessed cooperative resilience under continuous disruption (unsustainable consumption bot) plus intermittent environmental shocks (stochastic resource removal). Also examined long-horizon setting with harsher conditions.

Result: Human groups with communication achieved highest cooperative resilience compared to all other groups. Communication improved LLM agent resilience but performance remained below human levels. In long-horizon harsh conditions, humans sustained shared resources and maintained high resilience across diverse disruption scenarios.

Conclusion: Human decision-making under adverse social conditions provides valuable insights for designing artificial agents that exhibit more prosocial and resilient behaviors, suggesting that current LLM agents still lag behind human cooperative capabilities in challenging social dilemmas.

Abstract: This paper presents a comparative analysis of cooperative resilience in multi-agent systems, defined as the ability to anticipate, resist, recover from, and transform to disruptive events that affect collective well-being. We focus on mixed-motive social dilemmas instantiated as a \textit{Tragedy of the Commons} environment from the Melting Pot suite, where we systematically compare human groups and Large Language Model (LLM)-based agents, each evaluated with and without explicit communication. Cooperative resilience is assessed under a continuously disruptive condition induced by a persistent unsustainable consumption bot, together with intermittent environmental shocks implemented as stochastic removal of shared resources across scenarios. This experimental design establishes a benchmark for cooperative resilience across agent architectures and interaction modalities, constituting a key step toward systematically comparing humans and LLM-based agents. Using this framework, we find that human groups with communication achieve the highest cooperative resilience compared to all other groups. Communication also improves the resilience of LLM agents, but their performance remains below human levels. Motivated by the performance of humans, we further examine a long-horizon setting with harsher environmental conditions, where humans sustain the shared resource and maintain high resilience in diverse disruption scenarios. Together, these results suggest that human decision-making under adverse social conditions can inform the design of artificial agents that promote prosocial and resilient behaviors.

[363] CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

Main category: cs.MA

TL;DR: CREW-Wildfire is a new benchmark for evaluating LLM-based multi-agent systems in complex wildfire response scenarios, addressing limitations of existing benchmarks by featuring large-scale, partially observable environments with heterogeneous agents and long-horizon planning.

DetailsMotivation: Current benchmarks for LLM-based multi-agent systems are inadequate for evaluating scalability, robustness, and coordination in complex real-world tasks. Existing environments focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing next-generation multi-agent Agentic AI frameworks.

Method: Built on the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios with large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules.

Result: Evaluation of state-of-the-art LLM-based multi-agent frameworks reveals significant performance gaps, highlighting unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty.

Conclusion: CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence by providing realistic complexity, scalable architecture, and behavioral evaluation metrics. All code, environments, data, and baselines will be released to support future research.

Abstract: Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

[364] Osprey: Production-Ready Agentic AI for Safety-Critical Control Systems

Thorsten Hellert, João Montenegro, Antonin Sulc

Main category: cs.MA

TL;DR: Osprey is a framework for deploying agentic AI in safety-critical facility operations, featuring plan-first orchestration, coordination layer, dynamic tool selection, and connector abstractions for production use.

DetailsMotivation: Large-scale scientific facilities need to coordinate diverse subsystems, translate operator intent into hardware actions, and maintain safety oversight. Language model-driven agents offer a natural interface but existing approaches lack reliability and safety for production use.

Method: Osprey addresses challenges through: 1) Plan-first orchestrator generating complete execution plans for human review, 2) Coordination layer managing data flows and consistency, 3) Dynamic tool classifier for compact prompts, 4) Connector abstractions and deployment patterns across control systems.

Result: Demonstrated through two case studies: control-assistant tutorial showing semantic channel mapping and historical data integration, and production deployment at Advanced Light Source managing real-time operations across hundreds of thousands of control channels.

Conclusion: Osprey establishes as a production-ready framework for deploying agentic AI in complex, safety-critical environments, addressing reliability and safety concerns for facility operations.

Abstract: Operating large-scale scientific facilities requires coordinating diverse subsystems, translating operator intent into precise hardware actions, and maintaining strict safety oversight. Language model-driven agents offer a natural interface for these tasks, but most existing approaches are not yet reliable or safe enough for production use. In this paper, we introduce Osprey, a framework for using agentic AI in large, safety-critical facility operations. Osprey is built around the needs of control rooms and addresses these challenges in four ways. First, it uses a plan-first orchestrator that generates complete execution plans, including all dependencies, for human review before any hardware is touched. Second, a coordination layer manages complex data flows, keeps data types consistent, and automatically downsamples large datasets when needed. Third, a classifier dynamically selects only the tools required for a given task, keeping prompts compact as facilities add capabilities. Fourth, connector abstractions and deployment patterns work across different control systems and are ready for day-to-day use. We demonstrate the framework through two case studies: a control-assistant tutorial showing semantic channel mapping and historical data integration, and a production deployment at the Advanced Light Source, where Osprey manages real-time operations across hundreds of thousands of control channels. These results establish Osprey as a production-ready framework for deploying agentic AI in complex, safety-critical environments.

[365] The Emergence of Complex Behavior in Large-Scale Ecological Environments

Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley, Aaron Walsman

Main category: cs.MA

TL;DR: Researchers scale evolutionary multi-agent simulations to 60k+ agents to study emergent behaviors in ecological environments without explicit rewards, finding that larger scales enable more complex and stable behaviors.

DetailsMotivation: To understand how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments, moving beyond optimizing single policies to examine natural evolution through competition and environmental pressures.

Method: Use modern hardware with a new multi-agent simulator to scale environments and populations to over 60,000 agents, each with evolved neural network policies, using unsupervised evolution through reproduction, mutation, and selection in dynamic ecological settings.

Result: Identified emergent behaviors like long-range resource extraction, vision-based foraging, and predation that arise under competitive pressures; found these behaviors appear only in sufficiently large environments/populations, with larger scales increasing behavioral stability and consistency.

Conclusion: Scaling evolutionary simulations on modern hardware provides promising new directions for exploring ecology as an instrument of machine learning, leveraging abundant computational resources to study emergent behaviors at unprecedented scales.

Abstract: We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. We use modern hardware along with a new multi-agent simulator to scale the environment and population to sizes much larger than previously attempted, reaching populations of over 60,000 agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors and find that some of them appear only in sufficiently large environments and populations, and that larger scales increase the stability and consistency of these emergent behaviors. While there is a rich history of research in evolutionary settings, our scaling results on modern hardware provide promising new directions to explore ecology as an instrument of machine learning in an era of increasingly abundant computational resources and efficient machine frameworks. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.

[366] MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Barak Or

Main category: cs.MA

TL;DR: The paper introduces MTTR-A (Mean Time-to-Recovery for Agentic Systems) to quantify cognitive recovery latency in multi-agent systems, adapting classical reliability metrics to measure how quickly agentic workflows restore reasoning coherence after drift.

DetailsMotivation: Existing observability tools monitor system outputs but cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. There's a need to measure cognitive stability and recovery in autonomous multi-agent systems.

Method: Adapt classical reliability metrics (MTTR, MTBF) into the cognitive domain, defining MTTR-A as a runtime measure of cognitive recovery latency. Conduct benchmark simulation using AG News corpus and LangGraph orchestration framework, modeling recovery latencies across multiple reflex modes (automated vs human-approval interventions).

Result: Automated reflexes restored stability within ~6s on average, while human-approval interventions required ~12s. Across 200 runs: median simulated MTTR-A = 6.21±2.14s, MTBF = 6.7±2.14s, NRR = 0.08, demonstrating measurable runtime resilience across reflex strategies.

Conclusion: Formalizes recovery latency as a quantifiable property of distributed reasoning, establishing a foundation for runtime dependability in agentic cognition. Transforms cognitive recovery from an ad-hoc process into a standardized, interpretable performance metric with reliability bounds linking recovery time and cognitive uptime.

Abstract: Ensuring cognitive stability in autonomous multi-agent systems (MAS) is a central challenge for large-scale, distributed AI. While existing observability tools monitor system outputs, they cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. We adapt classical reliability metrics-Mean Time-to-Recovery (MTTR), Mean Time Between Failures (MTBF), and related ratios-into the cognitive domain, defining MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime measure of cognitive recovery latency. MTTR-A quantifies the time required for a MAS to detect reasoning drift and restore consistent operation, capturing the recovery of reasoning coherence rather than infrastructural repair. A benchmark simulation using the AG~News corpus and the LangGraph orchestration framework was conducted, modeling recovery latencies across multiple reflex modes. Automated reflexes restored stability within approximately 6s on average, while human-approval interventions required about 12s. Across 200 runs, the median simulated MTTR-A was 6.21+-2.14s, MTBF=6.7+-2.14s, and NRR=0.08, demonstrating measurable runtime resilience across reflex strategies. By formalizing recovery latency as a quantifiable property of distributed reasoning-and deriving reliability bounds linking recovery time and cognitive uptime-this work establishes a foundation for runtime dependability in agentic cognition, transforming cognitive recovery from an ad-hoc process into a standardized, interpretable performance

[367] Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics

Trung-Kiet Huynh, Duy-Minh Dao-Sy, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

Main category: cs.MA

TL;DR: The paper extends FAIRGAME framework to evaluate LLM strategic behavior in repeated social dilemmas using Prisoner’s Dilemma and Public Goods Game, revealing systematic cooperation biases and linguistic effects on decision-making.

DetailsMotivation: As LLMs become autonomous decision-makers in multi-agent systems and human societies, understanding their strategic behavior is crucial for safety, coordination, and AI-driven social/economic infrastructure design. Current methods need to capture not just outputs but underlying intentions guiding LLM decisions.

Method: Extended FAIRGAME framework with two complementary advances: 1) payoff-scaled Prisoner’s Dilemma to isolate sensitivity to incentive magnitude, and 2) integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. Trained traditional supervised classification models on canonical repeated-game strategies and applied them to FAIRGAME trajectories.

Result: Revealed consistent behavioral signatures across models and languages: incentive-sensitive cooperation, cross-linguistic divergence, and end-game alignment toward defection. LLMs exhibit systematic, model- and language-dependent behavioral intentions, with linguistic framing sometimes having effects as strong as architectural differences.

Conclusion: Provides unified methodological foundation for auditing LLMs as strategic agents and reveals systematic cooperation biases with direct implications for AI governance, collective decision-making, and safe multi-agent system design.

Abstract: As Large Language Models (LLMs) increasingly operate as autonomous decision-makers in interactive and multi-agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI-driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff-scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive-sensitive cooperation, cross-linguistic divergence and end-game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated-game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision-making, and the design of safe multi-agent systems.

cs.MM

[368] Q-BAR: Blogger Anomaly Recognition via Quantum-enhanced Manifold Learning

Maida Wang

Main category: cs.MM

TL;DR: Quantum-enhanced framework detects semantic mutations in creator content using variational quantum circuits, achieving robust anomaly detection with minimal training data and parameters.

DetailsMotivation: Creators face semantic mutation attacks where malicious edits preserve visual appearance but alter meaning, requiring detection of anomalies in individual creators' semantic manifolds. Classical methods struggle with data scarcity as creators typically have fewer than 50 training samples.

Method: Proposes Q-BAR (quantum-enhanced blogger anomaly recognition), a hybrid quantum-classical framework using variational quantum circuits. Maps multimodal features into Hilbert space hypersphere with parameter-efficient quantum anomaly detection strategy for low-data regimes.

Result: On curated dataset of 100 creators, achieves robust detection performance with significantly fewer trainable parameters compared to classical baselines. Uses only hundreds of quantum parameters, effectively mitigating overfitting.

Conclusion: Demonstrates potential of quantum machine learning for personalized media forensics, showing quantum-enhanced approaches can effectively detect semantic anomalies in low-data scenarios where classical methods struggle.

Abstract: In recommendation-driven online media, creators increasingly suffer from semantic mutation, where malicious secondary edits preserve visual fidelity while altering the intended meaning. Detecting these mutations requires modeling a creator’s unique semantic manifold. However, training robust detector models for individual creators is challenged by data scarcity, as a distinct blogger may typically have fewer than 50 representative samples available for training. We propose quantum-enhanced blogger anomaly recognition (Q-BAR), a hybrid quantum-classical framework that leverages the high expressivity and parameter efficiency of variational quantum circuits to detect semantic anomalies in low-data regimes. Unlike classical deep anomaly detectors that often struggle to generalize from sparse data, our method employs a parameter-efficient quantum anomaly detection strategy to map multimodal features into a Hilbert space hypersphere. On a curated dataset of 100 creators, our quantum-enhanced approach achieves robust detection performance with significantly fewer trainable parameters compared to classical baselines. By utilizing only hundreds of quantum parameters, the model effectively mitigates overfitting, demonstrating the potential of quantum machine learning for personalized media forensics.

eess.AS

[369] All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR

Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, Atsunori Ogawa

Main category: eess.AS

TL;DR: A unified ASR framework that supports multiple paradigms (CTC, AED, Transducer) in both offline and streaming modes within a single model, reducing deployment costs while maintaining or improving performance.

DetailsMotivation: Different ASR architectures have distinct advantages for different applications, but maintaining separate models for each scenario incurs substantial development and deployment costs. There's a need for a unified solution that can handle multiple paradigms efficiently.

Method: Proposes All-in-One ASR framework with a multi-mode joiner that enables seamless integration of various ASR modes (CTC, attention-based encoder-decoder, and Transducer) within a single unified model, supporting both offline and streaming modes.

Result: The unified model significantly reduces total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Joint decoding leverages complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.

Conclusion: All-in-One ASR provides an efficient unified solution that reduces deployment complexity and costs while maintaining or improving ASR performance across multiple paradigms and modes, demonstrating the viability of multi-paradigm integration in speech recognition.

Abstract: This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.

[370] End-to-end transfer learning for speaker-independent cross-language and cross-corpus speech emotion recognition

Duowei Tang, Peter Kuppens, Lucca Geurts, Toon van Waterschoot

Main category: eess.AS

TL;DR: Proposes transfer learning DNN with wav2vec 2.0 and Deep-WCCN layer for cross-language/cross-corpus speech emotion recognition, achieving state-of-the-art performance across English, German, and Chinese datasets.

DetailsMotivation: Current SER models perform poorly when testing data differs from training data in language or dataset source. There's a need for robust models that can handle cross-language and cross-corpus scenarios while reducing variabilities like language, speaker, and channel differences.

Method: End-to-end DNN with transfer learning using wav2vec 2.0 pre-trained model to create language-shared feature space, plus novel Deep-WCCN layer to reduce speaker/channel variabilities. Fine-tuned with combined loss on multi-language datasets.

Result: Outperforms baseline acoustic feature models in both within-language and cross-language settings. Deep-WCCN further improves performance. Achieves 15.6% improvement with only 160s of target language data. Beats state-of-the-art models in cross-language SER.

Conclusion: The proposed transfer learning approach with wav2vec 2.0 and Deep-WCCN effectively addresses cross-language/cross-corpus SER challenges, demonstrating strong performance, data efficiency, and superiority over existing methods.

Abstract: Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are often based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language than the training set or when these sets are taken from different datasets. To alleviate these problems, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language and cross-corpus SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby reducing the language variabilities in the speech embeddings. Next, we propose a new Deep-Within-Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and aims to reduce other variabilities including speaker variability, channel variability and so on. The entire model is fine-tuned in an end-to-end manner on a combined loss and is validated on datasets from three languages (i.e. English, German, Chinese). Experimental results show that our proposed method outperforms the baseline model that is based on common acoustic feature sets for SER in the within-language setting and the cross-language setting. In addition, we also experimentally validate the effectiveness of Deep-WCCN, which can further improve the model performance. Next, we show that the proposed transfer learning method has good data efficiency when merging target language data into the fine-tuning process. The model speaker-independent SER performance increases with up to 15.6% when only 160s of target language data is used. Finally, our proposed model shows significantly better performance than other state-of-the-art models in cross-language SER.

[371] Recent Advances in Discrete Speech Tokens: A Review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

Main category: eess.AS

TL;DR: Survey paper on discrete speech tokens for speech representation in LLM era, covering acoustic vs semantic tokens, comparing strengths/limitations, and proposing future directions.

DetailsMotivation: Speech generation technologies have advanced with LLMs, making discrete speech tokens a key paradigm. They offer efficient transmission/storage and compatibility with text-based LLM architectures, but need systematic analysis of different token types and their applications.

Method: Systematic survey synthesizing existing taxonomy and innovations in discrete speech tokenization. Critical examination of strengths/limitations of acoustic vs semantic tokens, with experimental comparisons across token types.

Result: Comprehensive analysis of two principal classes: acoustic tokens (capturing acoustic features) and semantic tokens (capturing linguistic meaning). Each has evolved into rich research domains with unique design philosophies and methodological approaches.

Conclusion: Identifies persistent challenges in discrete speech tokenization field and proposes potential research directions to inspire future advancements in development and application of discrete speech tokens.

Abstract: The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

eess.IV

[372] An Open Source Realtime GPU Beamformer for Row-Column and Top Orthogonal to Bottom Electrode (TOBE) Arrays

Randy Palamar, Darren Dahunsi, Tyler Henry, Mohammad Rahim Sobhani, Roger Zemp

Main category: eess.IV

TL;DR: Open-source GPU-accelerated ultrasound reconstruction software integrated with programmable platform and novel TOBE arrays enables real-time navigation for 2D arrays like row-column arrays.

DetailsMotivation: Research ultrasound platforms lack real-time navigation capabilities for emerging 2D arrays such as row-column arrays, limiting the practical application of next-generation imaging sequences.

Method: Developed an open-source, GPU-accelerated reconstruction and rendering software suite integrated with a programmable ultrasound platform and novel electrostrictive TOBE arrays, using OpenGL compute shaders for beamforming and rendering kernels.

Result: The system supports advanced real-time modes including cross-plane aperture-encoded synthetic-aperture imaging and aperture-encoded volumetric scanning, with TOBE-enabled methods demonstrating improved image quality and expanded field of view compared to conventional RCA techniques.

Conclusion: The software suite provides maximum data throughput with minimized stalls and latency, and includes sample datasets and example scripts to facilitate external testing and adoption of real-time ultrasound navigation for 2D arrays.

Abstract: Research ultrasound platforms have enabled many next-generation imaging sequences but have lacked realtime navigation capabilities for emerging 2D arrays such as row-column arrays (RCAs). We present an open-source, GPU-accelerated reconstruction and rendering software suite integrated with a programmable ultrasound platform and novel electrostrictive Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays. The system supports advanced real-time modes, including cross-plane aperture-encoded synthetic-aperture imaging and aperture-encoded volumetric scanning. TOBE-enabled methods demonstrate improved image quality and expanded field of view compared with conventional RCA techniques. The software implements beamforming and rendering kernels using OpenGL compute shaders and is designed for maximum data throughput helping to minimize stalls and latency. Accompanying sample datasets and example scripts for offline reconstruction are provided to facilitate external testing.

[373] Feature Compression for Machines with Range-Based Channel Truncation and Frame Packing

Juan Merlos, Fabien Racapé, Hyomin Choi, Mateen Ulhaq, Hari Kalva

Main category: eess.IV

TL;DR: Proposes channel truncation and packing method for MPEG-FCM standard to improve feature compression for split computing, achieving 10.59% average rate reduction while preserving task accuracy.

DetailsMotivation: The MPEG-FCM standard aims to provide interoperable compressed bitstreams of features for split computing scenarios where neural network inference is divided between devices. Current methods convert 3D feature tensors to 2D video frames for compression, but there's a need to better preserve relevant channels and reduce bandwidth while maintaining task performance.

Method: Introduces an additional channel truncation and packing method that preserves relevant channels based on feature statistics at inference time. This method is integrated into the MPEG-FCM test model, working alongside existing neural layer reduction and video compression components to optimize feature compression for split computing.

Result: The proposed method yields an average 10.59% reduction in rate for a given accuracy across multiple computer vision tasks and datasets when implemented within the MPEG-FCM test model.

Conclusion: The channel truncation and packing method effectively enhances compression performance for the MPEG-FCM standard, enabling better bandwidth efficiency while preserving computer vision task accuracy in split computing scenarios.

Abstract: This paper proposes a method that enhances the compression performance of the current model under development for the upcoming MPEG standard on Feature Coding for Machines (FCM). This standard aims at providing inter-operable compressed bitstreams of features in the context of split computing, i.e., when the inference of a large computer vision neural-network (NN)-based model is split between two devices. Intermediate features can consist of multiple 3D tensors that can be reduced and entropy coded to limit the required bandwidth of such transmission. In the envisioned design for the MPEG-FCM standard, intermediate feature tensors may be reduced using Neural layers before being converted into 2D video frames that can be coded using existing video compression standards. This paper introduces an additional channel truncation and packing method which enables the system to preserve the relevant channels, depending on the statistics of the features at inference time, while preserving the computer vision task performance at the receiver. Implemented within the MPEG-FCM test model, the proposed method yields an average reduction in rate by 10.59% for a given accuracy on multiple computer vision tasks and datasets.

[374] mViSE: A Visual Search Engine for Analyzing Multiplex IHC Brain Tissue Images

Liqiang Huang, Rachel W. Mills, Saikiran Mandula, Lin Bai, Mahtab Jeyhani, John Redell, Hien Van Nguyen, Saurabh Prasad, Dragan Maric, Badrinath Roysam

Main category: eess.IV

TL;DR: mViSE is a query-driven visual search engine for whole-slide multiplex brain imaging that enables programming-free analysis by learning tissue architecture and retrieving similar cellular communities.

DetailsMotivation: Whole-slide multiplex imaging generates massive, information-dense brain tissue images that are challenging to analyze and require custom software, creating a need for more accessible analysis tools.

Method: Divide-and-conquer strategy organizing data into molecular marker panels, using self-supervised learning to train multiplex encoders for each panel with visual confirmation, then combining panels for visual queries using information-theoretic methods.

Result: Validated ability to retrieve single cells, proximal cell pairs, tissue patches, delineate cortical layers, brain regions and sub-regions without programming.

Conclusion: mViSE provides an open-source, programming-free solution for analyzing multiplex brain imaging data through visual search capabilities, enabling diverse tissue exploration and analysis tasks.

Abstract: Whole-slide multiplex imaging of brain tissue generates massive information-dense images that are challenging to analyze and require custom software. We present an alternative query-driven programming-free strategy using a multiplex visual search engine (mViSE) that learns the multifaceted brain tissue chemoarchitecture, cytoarchitecture, and myeloarchitecture. Our divide-and-conquer strategy organizes the data into panels of related molecular markers and uses self-supervised learning to train a multiplex encoder for each panel with explicit visual confirmation of successful learning. Multiple panels can be combined to process visual queries for retrieving similar communities of individual cells or multicellular niches using information-theoretic methods. The retrievals can be used for diverse purposes including tissue exploration, delineating brain regions and cortical cell layers, profiling and comparing brain regions without computer programming. We validated mViSE’s ability to retrieve single cells, proximal cell pairs, tissue patches, delineate cortical layers, brain regions and sub-regions. mViSE is provided as an open-source QuPath plug-in.

[375] MarsQE: Semantic-Informed Quality Enhancement for Compressed Martian Image

Chengfeng Liu, Mai Xu, Qunliang Xing, Xin Zou

Main category: eess.IV

TL;DR: MarsQE is a semantic-informed quality enhancement approach for Martian images that uses texture-similar reference images to reduce compression artifacts, outperforming Earth-focused methods.

DetailsMotivation: Lossy image compression for Mars missions introduces artifacts that hinder geological analysis. Existing Earth-focused enhancement methods fail to account for unique Martian semantics and textures.

Method: Two-phase approach: 1) Semantic-based matching of texture-similar reference images, 2) Texture pattern transfer from references to compressed images, plus a post-enhancement network for artifact reduction.

Result: MarsQE significantly outperforms existing Earth-focused approaches, establishing a new benchmark for quality enhancement on Martian images.

Conclusion: The semantic-informed MarsQE approach effectively addresses the unique challenges of Martian image enhancement, providing superior compression artifact reduction for geological analysis.

Abstract: Lossy image compression is essential for Mars exploration missions, due to the limited bandwidth between Earth and Mars. However, the compression may introduce visual artifacts that complicate the geological analysis of the Martian surface. Existing quality enhancement approaches, primarily designed for Earth images, fall short for Martian images due to a lack of consideration for the unique Martian semantics. In response to this challenge, we conduct an in-depth analysis of Martian images, yielding two key insights based on semantics: the presence of texture similarities and the compact nature of texture representations in Martian images. Inspired by these findings, we introduce MarsQE, an innovative, semantic-informed, two-phase quality enhancement approach specifically designed for Martian images. The first phase involves the semantic-based matching of texture-similar reference images, and the second phase enhances image quality by transferring texture patterns from these reference images to the compressed image. We also develop a post-enhancement network to further reduce compression artifacts and achieve superior compression quality. Our extensive experiments demonstrate that MarsQE significantly outperforms existing approaches for Earth images, establishing a new benchmark for the quality enhancement on Martian images.

[376] Multimodal Learning for Scalable Representation of High-Dimensional Medical Data

Areej Alsaafin, Abubakr Shafique, Saghir Alfasly, Krishna R. Kalari, H. R. Tizhoosh

Main category: eess.IV

TL;DR: MarbliX is a self-supervised multimodal framework that learns binary embeddings (monograms) from whole slide images and immunogenomic data, enabling efficient patient similarity retrieval and outperforming unimodal approaches in cancer diagnostics.

DetailsMotivation: Current AI diagnostic models typically use unimodal data, missing critical cross-modal interactions between histopathology (WSIs) and genomics that could provide richer clinical insights. There's a need for scalable, interpretable frameworks to leverage multimodal healthcare data effectively.

Method: MarbliX uses self-supervised learning with a triplet contrastive objective to embed WSIs and immunogenomic profiles into compact binary codes called “monograms.” This creates a unified latent space capturing high-resolution patient similarity across modalities.

Result: In lung cancer: MarbliX achieves 85-89% across all metrics, outperforming histopathology alone (69-71%) and immunogenomics alone (73-76%). In kidney cancer: Real-valued monograms perform best (F1: 80-83%, Accuracy: 87-90%), with binary monograms slightly lower (F1: 78-82%).

Conclusion: MarbliX successfully integrates multimodal healthcare data through binary embeddings, enabling efficient case retrieval and case-based reasoning while significantly outperforming unimodal approaches in cancer diagnostics.

Abstract: Integrating artificial intelligence (AI) with healthcare data is rapidly transforming medical diagnostics and driving progress toward precision medicine. However, effectively leveraging multimodal data, particularly digital pathology whole slide images (WSIs) and genomic sequencing, remains a significant challenge due to the intrinsic heterogeneity of these modalities and the need for scalable and interpretable frameworks. Existing diagnostic models typically operate on unimodal data, overlooking critical cross-modal interactions that can yield richer clinical insights. We introduce MarbliX (Multimodal Association and Retrieval with Binary Latent Indexed matriX), a self-supervised framework that learns to embed WSIs and immunogenomic profiles into compact, scalable binary codes, termed ``monogram.’’ By optimizing a triplet contrastive objective across modalities, MarbliX captures high-resolution patient similarity in a unified latent space, enabling efficient retrieval of clinically relevant cases and facilitating case-based reasoning. \textcolor{black}{In lung cancer, MarbliX achieves 85-89% across all evaluation metrics, outperforming histopathology (69-71%) and immunogenomics (73-76%). In kidney cancer, real-valued monograms yield the strongest performance (F1: 80-83%, Accuracy: 87-90%), with binary monograms slightly lower (F1: 78-82%).

[377] Denoising Diffusion Models for Anomaly Localization in Medical Images

Cosmin I. Bercea, Philippe C. Cattin, Julia A. Schnabel, Julia Wolleb

Main category: eess.IV

TL;DR: Review paper on using denoising diffusion models for anomaly localization in medical images, covering methods, datasets, evaluation metrics, supervision schemes, and open challenges.

DetailsMotivation: To provide a comprehensive overview of current state-of-the-art approaches using denoising diffusion models for anomaly localization in medical images, identify research gaps, and highlight the potential of these models for robust anomaly detection.

Method: Literature review methodology covering: 1) Background on denoising diffusion models for image reconstruction and conditioning mechanisms, 2) Available datasets and evaluation metrics for medical anomaly localization, 3) Analysis of supervision schemes from fully supervised to unsupervised methods, 4) Discussion of effectiveness and limitations of different approaches.

Result: Provides systematic overview of diffusion model applications for medical anomaly localization, identifies key supervision schemes, discusses effectiveness/limitations of approaches, and highlights open challenges including detection bias, domain shift, computational cost, and model interpretability.

Conclusion: Denoising diffusion models show significant potential for robust anomaly localization in medical images, but several challenges remain. The review outlines current state-of-the-art, identifies research gaps, and provides direction for future work in this emerging field.

Abstract: This review explores anomaly localization in medical images using denoising diffusion models. After providing a brief methodological background of these models, including their application to image reconstruction and their conditioning using guidance mechanisms, we provide an overview of available datasets and evaluation metrics suitable for their application to anomaly localization in medical images. In this context, we discuss supervision schemes ranging from fully supervised segmentation to semi-supervised, weakly supervised, self-supervised, and unsupervised methods, and provide insights into the effectiveness and limitations of these approaches. Furthermore, we highlight open challenges in anomaly localization, including detection bias, domain shift, computational cost, and model interpretability. Our goal is to provide an overview of the current state of the art in the field, outline research gaps, and highlight the potential of diffusion models for robust anomaly localization in medical images.

[378] Bayesian Multifractal Image Segmentation

Kareth M. León-López, Abderrahim Halimi, Jean-Yves Tourneret, Herwig Wendt

Main category: eess.IV

TL;DR: Unsupervised Bayesian multifractal segmentation method that jointly estimates multifractal parameters and pixel-level labels using wavelet leaders and multiscale Potts Markov random field.

DetailsMotivation: Natural images often contain multiple textures with different multifractal properties, but existing multifractal analysis methods assume homogeneous textures and cannot handle segmentation of multiple multifractal textures within the same image.

Method: 1) Develop efficient multifractal parameter estimation for wavelet leaders with region-specific parameters; 2) Introduce multiscale Potts Markov random field to model spatial and cross-scale correlations between wavelet leader labels; 3) Use Gibbs sampling to sample from posterior distribution of unknown parameters.

Result: The method achieves superior performance compared to traditional unsupervised segmentation techniques and modern deep learning-based approaches on synthetic multifractal images.

Conclusion: The proposed Bayesian multifractal segmentation framework effectively models and segments multiple multifractal textures in images, demonstrating practical value for texture analysis applications.

Abstract: Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.

Last updated: 2025-12-19
Built with Hugo, theme modified on Stack