Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 56]
cs.CV [Total: 156]
cs.AI [Total: 36]
cs.SD [Total: 8]
cs.LG [Total: 116]
cs.MA [Total: 4]
cs.MM [Total: 1]
eess.AS [Total: 4]
eess.IV [Total: 14]

cs.CL

[1] A Women’s Health Benchmark for Large Language Models

Victoria-Elisabeth Gruber, Razvan Marinescu, Diego Fajardo, Amin H. Nassar, Christopher Arkfeld, Alexandria Ludlow, Shama Patel, Mehrnoosh Samaei, Valerie Klug, Anna Huber, Marcel Gühner, Albert Botta i Orfila, Irene Lagoja, Kimya Tarr, Haleigh Larson, Mary Beth Howard

Main category: cs.CL

TL;DR: First benchmark (WHB) evaluating LLMs in women’s health reveals ~60% failure rates across 13 models, with significant gaps in handling medical urgency and specialty-specific queries.

Details

Motivation: As LLMs become primary health information sources for millions, their accuracy in women's health remains critically unexamined, creating potential risks for users seeking reliable medical advice.

Method: Created Women’s Health Benchmark (WHB) with 96 rigorously validated model stumps covering 5 medical specialties, 3 query types, and 8 error types. Evaluated 13 state-of-the-art LLMs.

Result: Current models show ~60% failure rates on women’s health benchmark. Performance varies dramatically across specialties and error types. Models universally struggle with “missed urgency” indicators. Newer models like GPT-5 show improvements in avoiding inappropriate recommendations.

Conclusion: AI chatbots are not yet fully capable of providing reliable advice in women’s health, highlighting critical safety gaps that need to be addressed before widespread clinical use.

Abstract: As large language models (LLMs) become primary sources of health information for millions, their accuracy in women’s health remains critically unexamined. We introduce the Women’s Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women’s health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60% failure rates on the women’s health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with “missed urgency” indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women’s health.

[2] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Khushboo Thaker, Yony Bresler

Main category: cs.CL

TL;DR: Struct-SQL: A knowledge distillation framework that uses structured query execution plans instead of unstructured CoT traces to train small language models for Text-to-SQL, achieving 8.1% improvement over unstructured CoT distillation.

Details

Motivation: Enterprise Text-to-SQL systems face a trilemma between cost, security, and performance. Current solutions force choosing between expensive proprietary LLMs and low-performing SLMs. Unstructured CoT distillation for SLMs is ambiguous, while structured reasoning could provide clearer teaching signals for precise SQL generation.

Method: Propose Struct-SQL, a knowledge distillation framework that trains SLMs to emulate powerful LLMs using structured reasoning. Uses query execution plans as formal blueprints for structured reasoning instead of unstructured CoT traces.

Result: SLM distilled with structured CoT achieves 8.1% absolute improvement over unstructured CoT distillation baseline. Error analysis shows key factor is marked reduction in syntactic errors.

Conclusion: Teaching models to reason using structured logical blueprints is beneficial for reliable SQL generation in SLMs, demonstrating the value of formal structured reasoning over unstructured approaches.

Abstract: Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.

[3] XLM: A Python package for non-autoregressive language models

Dhruvesh Patel, Durga Prasad Maram, Sai Sreenivas Chintha, Benjamin Rozonoyer, Andrew McCallum

Main category: cs.CL

TL;DR: XLM package provides standardized implementation framework for non-autoregressive language models to address reproducibility and component reuse challenges.

Details

Motivation: Non-autoregressive text generation lacks standardized training/inference libraries compared to autoregressive approaches, leading to bespoke implementations that hinder systematic comparisons and component reuse.

Method: Developed XLM Python package with standardized data collation, loss functions, and prediction logic to streamline implementation of small non-autoregressive language models.

Result: Created open-source XLM package with companion xlm-models package offering pre-trained small models for research community use.

Conclusion: XLM package addresses standardization gap in non-autoregressive language modeling, facilitating faster implementation and reproducible research through shared components and pre-trained models.

Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at https://github.com/dhruvdcoder/xlm-core.

[4] Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Pranav Shetty, Mirazul Haque, Petr Babkin, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso

Main category: cs.CL

TL;DR: SPECTRA is a watermarking method that makes training data detectable even when it’s less than 0.001% of the corpus, using LLM paraphrasing and scoring to avoid distribution shifts.

Details

Motivation: To enforce copyright and data licensing for LLMs by enabling reliable detection of training data, addressing the challenge of massive internet-scraped corpora.

Method: Paraphrase text using an LLM, assign scores via a separate scoring model, select paraphrases matching original text scores to avoid distribution shifts, then compare suspect model’s token probabilities against scoring model.

Result: Achieves consistent p-value gap over nine orders of magnitude between training vs non-training data detection, outperforming all baselines tested.

Conclusion: SPECTRA provides scalable, deploy-before-release watermarking that survives large-scale LLM training, empowering data owners with effective copyright enforcement tools.

Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

[5] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Michael H. Coen

Main category: cs.CL

TL;DR: The paper critiques current dialogue topic segmentation evaluation metrics, showing that reported performance differences are driven by annotation granularity mismatches rather than model quality, and proposes treating boundary density and segment coherence as primary criteria alongside window-tolerant F1.

Details

Motivation: Current evaluation practice in dialogue topic segmentation relies on strict boundary matching and F1-based metrics, which are inadequate for modern LLM-based conversational systems that use segmentation to manage conversation history beyond fixed context windows. These metrics fail to capture the true utility of segmentation for maintaining efficiency and coherence.

Method: The paper introduces a new evaluation objective focusing on boundary density and segment coherence alongside window-tolerant F1 (W-F1). It conducts extensive cross-dataset empirical evaluation across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions, evaluating multiple structurally distinct segmentation strategies.

Result: The study shows that reported performance differences across dialogue segmentation benchmarks are driven by annotation granularity mismatches and sparse boundary labels, not by model quality. Many reported improvements arise from evaluation artifacts rather than improved boundary detection. High segment coherence is observed alongside extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores.

Conclusion: Topic segmentation should be understood as selecting an appropriate granularity rather than predicting a single correct boundary set. The paper operationalizes this view by explicitly separating boundary scoring from boundary selection, proposing that evaluation should focus on boundary density and segment coherence as primary criteria.

Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model’s fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

[6] Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

Salar Hashemitaheri, Ian Harris

Main category: cs.CL

TL;DR: Two-level data augmentation (synthetic + real) improves intent classification for smoking cessation chatbots by 32% F1 score when high-quality training data is scarce.

Details

Motivation: Online smoking cessation support groups suffer from low engagement and stigma; conversational agents could help but need better intent classification models, which are limited by insufficient high-quality training data.

Method: Two-level data augmentation: 1) Synthetic - fine-tuned LLM to identify low-F1 intents, used GPT to generate synthetic posts (87% high quality), 2) Real - scraped 10,000+ posts from related contexts (73% validated). Combined validated new data with original posts to retrain intent classifier.

Result: Retrained model showed 32% improvement in F1 score; synthetic and real augmentation provided similar performance improvements; 43% of original posts augmented with 140% synthetic expansion.

Conclusion: The study provides a replicable framework for enhancing conversational agent performance in data-scarce domains through combined synthetic and real data augmentation strategies.

Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43% of the original posts being selected for augmentation, followed by 140% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

[7] Enhancing Long Document Long Form Summarisation with Self-Planning

Xiaotang Du, Rohit Saxena, Laura Perez-Beltrachini, Pasquale Minervini, Ivan Titov

Main category: cs.CL

TL;DR: A highlight-guided generation approach for long context summarization that uses sentence-level information as content plans to improve traceability and faithfulness of summaries.

Details

Motivation: To improve factual consistency and faithfulness in long-form summarization by making the generation process more traceable and content-aware.

Method: Uses self-planning methods to identify important content as sentence-level highlights, then generates summaries conditioned on these content plans. Explores both end-to-end and two-stage variants.

Result: Consistently improves factual consistency while preserving relevance and overall quality. On GovReport, achieves 4.1 point improvement in ROUGE-L and ~35% gains in SummaC scores.

Conclusion: Highlight-guided summarization helps preserve important details, leading to more accurate and insightful summaries across domains, with two-stage pipeline performing better on long, information-dense documents.

Abstract: We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.

[8] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

Main category: cs.CL

TL;DR: MiA-RAG introduces mindscape-aware retrieval-augmented generation that uses hierarchical summarization to create global semantic context, improving long-context understanding and reasoning in LLMs.

Details

Motivation: Current RAG systems lack holistic semantic representation of long documents, making them struggle with long-context tasks. Humans use "mindscape-aware capability" to organize knowledge and integrate dispersed evidence, which RAG systems need to emulate.

Method: MiA-RAG builds a mindscape through hierarchical summarization to create explicit global context awareness. This mindscape conditions both retrieval (forming enriched query embeddings) and generation (reasoning over retrieved evidence within coherent global context).

Result: MiA-RAG consistently surpasses baselines across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It aligns local details with coherent global representation, enabling more human-like long-context retrieval and reasoning.

Conclusion: The mindscape-aware approach equips RAG systems with explicit global context awareness, addressing limitations of current systems and enabling more human-like understanding of long and complex texts.

Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

[9] LookAhead Tuning: Safer Language Models via Partial Answer Previews

Kangwei Liu, Mengru Wang, Yujie Luo, Lin Yuan, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Bryan Hooi, Shumin Deng

Main category: cs.CL

TL;DR: LookAhead Tuning is a lightweight data-driven method that preserves LLM safety during fine-tuning by previewing partial answer prefixes to minimize perturbations to initial token distributions.

Details

Motivation: Fine-tuning LLMs for specific domains often compromises their previously established safety alignment, creating a need for methods that preserve safety during adaptation.

Method: Introduces two simple strategies that modify training data by previewing partial answer prefixes, minimizing perturbations to the model’s initial token distributions and maintaining built-in safety mechanisms.

Result: Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks.

Conclusion: LookAhead Tuning is positioned as a reliable and efficient solution for the safe and effective adaptation of LLMs.

Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model’s initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.

[10] Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

Zahra Rahmani, Hossein Sameti

Main category: cs.CL

TL;DR: A noise-sensitive ASR error correction framework for Persian speech that uses multiple hypotheses and Error Level Noise (ELN) embeddings to improve Whisper’s performance in noisy environments.

Details

Motivation: ASR systems degrade significantly in noisy environments, especially for low-resource languages like Persian. Even state-of-the-art models like Whisper struggle with varying noise levels, creating a need for robust noise-aware correction methods.

Method: Generates 5-best hypotheses from modified Whisper-large decoder, introduces Error Level Noise (ELN) to capture semantic/token-level disagreement across hypotheses, and evaluates three models: base LLaMA-2-7B, fine-tuned text-only variant, and noise-conditioned model with ELN embeddings at sentence/word levels.

Result: ELN-conditioned model achieves substantial WER reduction: from 31.10% (Raw Whisper) to 24.84% on Mixed Noise test set, significantly outperforming fine-tuned text-only baseline (30.79%). Base LLaMA-2-7B increased WER to 64.58%, showing it cannot correct Persian errors alone.

Conclusion: Combining multiple hypotheses with noise-aware ELN embeddings effectively improves Persian ASR robustness in noisy real-world scenarios, demonstrating the value of noise-conditioned modeling for low-resource language ASR systems.

Abstract: Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10% (Raw Whisper) to 24.84%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79%, whereas the original LLaMA-2-7B model increased the WER to 64.58%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.

[11] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

Main category: cs.CL

TL;DR: Speech-FT: A two-stage fine-tuning framework that maintains cross-task generalization while benefiting from task-specific fine-tuning by reducing representational drift and using weight-space interpolation.

Details

Motivation: Standard fine-tuning of speech representation models improves task-specific performance but degrades cross-task generalization due to excessive representational drift, losing valuable pre-trained knowledge.

Method: Two-stage approach: 1) Fine-tuning designed to minimize representational drift, 2) Weight-space interpolation between fine-tuned and pre-trained models to restore cross-task generalization.

Result: Consistent improvements across HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ models; superior cross-task generalization compared to weight-space regularization and LoRA; significant SUPERB benchmark gains (e.g., HuBERT: PER 5.17%→3.94%, WER 6.38%→5.75%, SID 81.86%→84.11%).

Conclusion: Speech-FT provides a simple yet powerful solution for refining speech representation models after pre-training, maintaining feature similarity to pre-trained models while allowing beneficial weight updates.

Abstract: Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

[12] Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu

Main category: cs.CL

TL;DR: Seed-Prover 1.5 is a formal theorem-proving model using large-scale agentic reinforcement learning and test-time scaling to efficiently solve mathematical problems in formal languages like Lean, achieving state-of-the-art results on undergraduate to PhD-level benchmarks.

Details

Motivation: While LLMs have advanced in generating mathematical proofs, formal theorem proving in languages like Lean remains challenging and computationally expensive, especially for undergraduate-level problems and beyond. There's a need for more efficient formal theorem proving systems.

Method: Uses large-scale agentic reinforcement learning where the model continuously interacts with Lean and other tools to accumulate experience. Also employs an efficient test-time scaling workflow that bridges natural and formal languages by leveraging recent advancements in natural language proving.

Result: Achieves 88% on PutnamBench (undergraduate), 80% on Fate-H (graduate), and 33% on Fate-X (PhD-level) problems. Notably solves 11 out of 12 problems from Putnam 2025 within 9 hours. Outperforms state-of-the-art methods with smaller compute budget.

Conclusion: Scaling learning from experience driven by high-quality formal feedback shows immense potential for the future of formal mathematical reasoning. The approach demonstrates that efficient formal theorem proving is achievable through agentic reinforcement learning and bridging natural-formal language gaps.

Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88% of PutnamBench} (undergraduate-level), \textbf{80% of Fate-H} (graduate-level), and \textbf{33% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.

[13] Fun-ASR Technical Report

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou, Yanqiao Zhu

Main category: cs.CL

TL;DR: Fun-ASR is a production-optimized LLM-based ASR system that combines massive data, large models, LLM integration, and reinforcement learning to achieve SOTA performance in real-world applications while addressing LLM hallucination issues.

Details

Motivation: While ASR has advanced through data scaling, model scaling, and LLM integration, LLMs are prone to hallucination which degrades user experience in real-world applications. Most LLM-based ASR systems perform well on benchmarks but underperform on real industry datasets.

Method: Fun-ASR synergistically combines massive data scaling, large model capacity, LLM integration, and reinforcement learning. It’s specifically optimized for practical deployment with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and other real-world requirements.

Result: Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating effectiveness and robustness in practical settings, outperforming other LLM-based ASR systems that perform well on benchmarks but underperform on industry evaluation sets.

Conclusion: Fun-ASR successfully addresses LLM hallucination issues in ASR and delivers production-ready performance through comprehensive optimizations for real-world deployment, making it an effective solution for diverse and complex speech recognition scenarios.

Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .

[14] AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang

Main category: cs.CL

TL;DR: AutoMetrics is a framework that synthesizes evaluation metrics for AI applications using retrieval from a curated metric bank and LLM-generated criteria, optimized via regression to correlate with human feedback, requiring fewer than 100 feedback points.

Details

Motivation: User feedback and behavioral signals are gold standards for evaluating AI applications but are often scarce in prototypes/research or too slow for system optimization. There's a need for automatic metrics that correlate well with human judgment under low-data constraints.

Method: Combines retrieval from MetricBank (48 curated metrics) with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signals.

Result: Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. Can be used as a proxy reward equal to verifiable rewards.

Conclusion: AutoMetrics provides an effective framework for synthesizing interpretable automatic metrics from limited human feedback, accelerating adaptive evaluation of LLM applications. The toolkit and MetricBank are released to facilitate adoption.

Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

[15] Subjective Question Generation and Answer Evaluation using NLP

G. M. Refatul Islam, Safwan Shaheer, Yaseen Nur, Mohammad Rafid Hamid

Main category: cs.CL

TL;DR: This paper aims to develop NLP models for automated subjective question generation and answer evaluation from text input to assist teachers and enhance student learning.

Details

Motivation: While NLP has advanced in objective question generation, automated subjective question generation and answer evaluation remain underdeveloped. Such a system could help teachers assess student work and enable students to self-assess their understanding after reading educational materials.

Method: The research aims to either improve existing NLP models or create novel ones specifically designed for automated subjective question generation and answer evaluation from text input.

Result: The abstract doesn’t present specific results, but outlines the research goal of developing improved or novel NLP models for subjective question generation and answer evaluation.

Conclusion: Automated subjective question generation and answer evaluation systems have significant potential to transform educational assessment by assisting teachers and enhancing student learning experiences through self-assessment capabilities.

Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student’s learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.

[16] Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao

Main category: cs.CL

TL;DR: Hybrid fine-tuning framework combining gradient-aligned low-rank updates with structured orthogonal transformations and unitary constraints for stable multilingual LLM adaptation under tight compute budgets, enhanced by lightweight data governance steps.

Details

Motivation: To enable resource-efficient multilingual adaptation of large language models for low-resource scenarios while maintaining accuracy, calibration, and cross-language parity under tight compute constraints.

Method: Governance-aware hybrid fine-tuning framework combining gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing, introducing unitary constraints in selected sub-layers to stabilize optimization, paired with lightweight data governance steps (language identification, near-duplicate removal, quality filtering).

Result: Consistent gains over strong PEFT baselines on XNLI and FLORES benchmarks, improved probability calibration, resilience to orthographic variants, additive benefits from governance steps, modest training overhead, and favorable cost-quality trade-offs.

Conclusion: Hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance, offering improved performance while maintaining directional balance and calibration under compute constraints.

Abstract: We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.

[17] Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates

Mohamed Chenene, Jeanne Rouhier, Jean Daniélou, Mihir Sarkar, Elena Cabrio

Main category: cs.CL

TL;DR: Stakeholder Suite: A framework for mapping actors, topics, and arguments in public infrastructure debates using AI-powered analytics for operational decision-making.

Details

Motivation: Public infrastructure and energy projects involve complex stakeholder networks and evolving narratives that are difficult to analyze with existing media intelligence tools, which offer limited transparency and descriptive analytics.

Method: Combines actor detection, topic modeling, argument extraction, and stance classification in a unified pipeline deployed in operational contexts for mapping public debates.

Result: Achieves 75% argument relevance in pilot use cases with strong retrieval precision and stance accuracy; proven effective for visualizing influence networks, identifying controversies, and supporting evidence-based decisions.

Conclusion: Stakeholder Suite provides fine-grained, source-grounded insights adaptable to diverse domains, offering a practical tool for anticipating controversies and informing stakeholder engagement strategies.

Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.

[18] Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Zeyuan Allen-Zhu

Main category: cs.CL

TL;DR: The paper introduces Canon Layers - lightweight architectural components that promote horizontal information flow across neighboring tokens, enhancing reasoning capabilities and making weak architectures competitive with state-of-the-art models.

Details

Motivation: Architectural differences in language models are hard to evaluate at academic-scale pretraining due to noise and randomness. There's a need for controlled synthetic tasks to isolate and evaluate core model capabilities.

Method: Introduces controlled synthetic pretraining tasks and Canon Layers - lightweight components that compute weighted sums of nearby token representations. These layers integrate seamlessly into various sequence architectures (Transformers, linear attention, state-space models).

Result: 12 key results show Canon Layers enhance reasoning depth (2x improvement), reasoning breadth, and knowledge manipulation. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA models like Mamba2/GDN, validated through both synthetic tasks and real-world academic-scale pretraining.

Conclusion: The synthetic playground offers an economical, principled approach to isolate core model capabilities. With infinite high-quality data, it may predict how future architectures will behave as training pipelines improve, unlocking deeper reasoning and hierarchical inference.

Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components – named after the musical term “canon” – that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN – validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve – e.g., through better data curation or RL-based post-training – unlocking deeper reasoning and hierarchical inference.

[19] UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, Xianglong Liu

Main category: cs.CL

TL;DR: IPC is an unsupervised framework that probes LLMs’ internal knowledge for code generation without external data, using self-consistency and representation-based quality estimation to train UCoder.

Details

Motivation: LLMs for code generation require expensive supervised training with labeled or unlabeled datasets that are difficult to obtain at scale. The paper aims to reduce dependency on external data and computational resources.

Method: IPC uses internal probing of LLMs through problem space probing, test understanding probing, solution space probing, and knowledge consolidation. It identifies reliable code candidates via self-consistency mechanisms and representation-based quality estimation to train UCoder.

Result: The unsupervised approach achieves competitive performance compared to supervised methods across multiple code benchmarks while significantly reducing dependency on labeled data and computational resources.

Conclusion: Internal model states contain rich signals about code quality and correctness, and properly harnessing these signals enables effective unsupervised learning for code generation, opening new directions for training code LLMs in resource-constrained scenarios.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

[20] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng

Main category: cs.CL

TL;DR: CulturalToM-VQA is a new benchmark with 5095 questions for evaluating cross-cultural Theory of Mind reasoning in Vision-Language Models through visual question answering.

Details

Motivation: Existing Vision-Language Models are increasingly used in socially grounded tasks, but their capacity for cross-cultural Theory of Mind reasoning remains largely unexplored, with current benchmarks being Western-centric.

Method: Created a VLM-assisted human-in-the-loop pipeline: human experts curate culturally rich images across traditions and social interactions, then a VLM generates structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning six ToM tasks and four complexity levels.

Result: Developed CulturalToM-VQA benchmark containing 5095 questions capturing culturally grounded cues (rituals, attire, gestures, interpersonal dynamics) to systematically evaluate ToM reasoning beyond Western-centric benchmarks.

Conclusion: The CulturalToM-VQA benchmark enables systematic evaluation of cross-cultural Theory of Mind reasoning in VLMs, addressing a major gap in assessing social intelligence across diverse cultural contexts.

Abstract: Theory of Mind (ToM) – the ability to attribute beliefs, desires, and emotions to others – is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

[21] Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

Menna Elgabry, Ali Hamdi

Main category: cs.CL

TL;DR: A confidence-weighted, credibility-aware ensemble framework for emotion detection using diverse small transformer models, achieving state-of-the-art performance with high parameter efficiency.

Details

Motivation: To overcome limitations of conventional homogeneous ensembles and large LLMs by creating a parameter-efficient, robust emotion detection system that leverages diverse model architectures while preserving error diversity.

Method: Combines architecturally diverse small transformer models (BERT, RoBERTa, DistilBERT, DeBERTa, ELECTRA) fine-tuned for emotion classification. Uses dual-weighted voting mechanism integrating global credibility (validation F1) and local confidence (instance-level probability) to weight model contributions dynamically.

Result: Achieves 93.5% macro F1 score on DAIR-AI dataset, surpassing state-of-the-art benchmarks and outperforming large LLMs (Falcon, Mistral, Qwen, Phi) even after LoRA fine-tuning. With only 595M total parameters, outperforms models up to 7B parameters.

Conclusion: Carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized NLP tasks like emotion detection, demonstrating superior parameter efficiency and robustness.

Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet’s Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.

[22] Linear Personality Probing and Steering in LLMs: A Big Five Study

Michel Frising, Daniel Balcells

Main category: cs.CL

TL;DR: Linear directions in LLM activation space can effectively probe personality traits but have limited steering capabilities, especially in open-ended contexts.

Details

Motivation: LLMs exhibit distinct personalities affecting trust and engagement, but current personality control methods are either costly (post-training) or brittle (prompt engineering). Linear directions offer a cheap, efficient alternative for probing and steering personality traits.

Method: Used Llama 3.3 70B to generate descriptions of 406 fictional characters with Big Five trait scores. Prompted model with these descriptions and Alpaca questionnaire questions, sampled hidden activations, then used linear regression to learn per-layer directions in activation space for personality traits.

Result: Linear directions aligned with trait-scores are effective probes for personality detection. Steering capabilities strongly depend on context: reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in prompts.

Conclusion: Linear directions provide effective personality probing but have context-dependent steering limitations, suggesting they work best for constrained tasks rather than open-ended generation.

Abstract: Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs’ behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.

[23] Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

Marco Gaido, Sara Papi, Mauro Cettolo, Matteo Negri, Luisa Bentivogli

Main category: cs.CL

TL;DR: simulstream is a new open-source framework for evaluating and demonstrating Streaming Speech-to-Text Translation systems, addressing limitations of the outdated SimulEval tool by supporting long-form audio, revision capabilities, and interactive demos.

Details

Motivation: Current research on Streaming Speech-to-Text Translation relies on SimulEval, which is no longer maintained, doesn't support output revision, is designed for short segments rather than long-form audio, and lacks easy demonstration capabilities.

Method: Introduces simulstream, an open-source framework specifically designed for unified evaluation and demonstration of StreamST systems. It supports both incremental decoding and re-translation methods, handles long-form speech processing, and includes an interactive web interface for demos.

Result: simulstream provides the first comprehensive framework that enables comparison of different StreamST approaches (incremental decoding and re-translation) in terms of both translation quality and latency, while also offering demonstration capabilities.

Conclusion: simulstream addresses critical gaps in StreamST research infrastructure by providing a maintained, flexible framework that supports modern requirements including long-form processing, output revision, and interactive demonstration, enabling more robust evaluation and comparison of streaming translation systems.

Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.

[24] Peeking Into The Future For Contextual Biasing

Ramaneswaran Selvakumar, Cindy Tseng, Eesung Kim, Vijendra Raj Apsingekar, Yun Tang

Main category: cs.CL

TL;DR: A contextual biasing method for ASR models that uses multi-token prediction to improve recognition of rare named entities without adding complex architecture.

Details

Motivation: End-to-end ASR models struggle with rare/unseen named entities (contact names, locations) which are critical for virtual assistants and downstream applications.

Method: Proposes contextual biasing for attention-based encoder-decoder models using candidate entity lists. Instead of predicting only next token, simultaneously predicts multiple future tokens to “peek into the future” and score potential entities. Uses multi-token prediction logits directly without additional entity encoders or cross-attention layers.

Result: Experiments on Librispeech show up to 50.34% relative improvement in named entity word error rate compared to baseline AED model.

Conclusion: The approach effectively improves rare entity recognition in ASR while maintaining architectural simplicity by leveraging multi-token prediction without complex additions.

Abstract: While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to “peek into the future” and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.

[25] Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering

Riccardo Di Sipio

Main category: cs.CL

TL;DR: Bayesian methods for uncertainty quantification in neural QA systems, from MLPs to transformers, showing how uncertainty calibration enables selective prediction and “I don’t know” responses.

Details

Motivation: To quantify uncertainty in neural networks for question answering, enabling models to express confidence in predictions and abstain when uncertain, contributing to more responsible and ethical AI deployment.

Method: Progressive Bayesian approach: starting with MLP on Iris dataset, then extending to language models with Bayesian inference on frozen head, and finally LoRA-adapted transformers. Uses Laplace approximations vs MAP estimates for uncertainty calibration on CommonsenseQA benchmark.

Result: Demonstrates how posterior inference conveys confidence in predictions, enables selective prediction where models abstain when confidence is low, and improves uncertainty calibration compared to standard MAP estimates.

Conclusion: Bayesian methods provide effective uncertainty quantification for neural QA systems, allowing “I don’t know” responses that improve interpretability and support more responsible, ethical deployment of question-answering models.

Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don’t know’’ response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.

[26] When the Gold Standard isn’t Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

Main category: cs.CL

TL;DR: This paper examines how to handle non-standard language in user-generated content (UGC) translation, proposing a taxonomy of non-standard phenomena and translation actions, and showing that LLM translation performance depends on alignment with dataset guidelines.

Details

Motivation: User-generated content contains non-standard language (spelling errors, slang, emojis, etc.), making translation evaluation challenging because "good" translation depends on desired output standardness level. Current approaches lack clear guidelines for handling UGC style.

Method: Analyzed human translation guidelines from four UGC datasets to derive taxonomy of 12 non-standard phenomena and 5 translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Conducted case study with large language models to test sensitivity to prompts with explicit UGC translation instructions.

Result: Found notable differences in UGC treatment across datasets, creating spectrum of standardness in reference translations. LLM translation scores are highly sensitive to prompts with explicit UGC instructions, improving when prompts align with dataset guidelines.

Conclusion: When preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Calls for clear guidelines during dataset creation and development of controllable, guideline-aware evaluation frameworks for UGC translation.

Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a “good” translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset’s guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

[27] Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad

Main category: cs.CL

TL;DR: ABCDE dataset provides 400M+ text utterances annotated with affective, behavioral, cognitive, demographic, and emotional features to facilitate interdisciplinary research across computational social sciences.

Details

Motivation: Current computational affective and social science research faces substantial barriers in discovering, accessing, and using annotation resources for language data, particularly for non-computer science practitioners.

Method: Created the ABCDE dataset by collecting over 400 million text utterances from diverse sources including social media, blogs, books, and AI-generated content, then annotating them with comprehensive features relevant to affective and social science research.

Result: Produced a large-scale, annotated dataset that enables easier access to labeled language data for researchers across multiple disciplines, addressing the resource accessibility problem in computational social sciences.

Conclusion: The ABCDE dataset serves as a valuable resource that lowers barriers to entry for interdisciplinary research in affective science, cognitive science, digital humanities, sociology, political science, and computational linguistics.

Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.

[28] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

Zhihan Zhou, Daqian Shi, Rui Song, Lida Shi, Xiaolei Diao, Hao Xu

Main category: cs.CL

TL;DR: AncientBench is a new benchmark for evaluating LLMs’ comprehension of ancient Chinese excavated documents across four dimensions: glyph, pronunciation, meaning, and contextual understanding.

Details

Motivation: Existing Chinese benchmarks focus on modern Chinese and transmitted ancient documents, but lack evaluation for excavated ancient documents, which is crucial for archaeology and understanding Chinese history.

Method: Created AncientBench with four dimensions (glyph, pronunciation, meaning, contextual comprehension) and ten tasks including radical, phonetic radical, homophone, cloze, and translation tasks. Involved archaeological researchers for evaluation and proposed an ancient model as baseline.

Result: Experimental evaluations on top-performing LLMs show great potential in ancient textual scenarios but reveal gaps compared to human performance.

Conclusion: AncientBench promotes LLM development and application in archaeology and ancient Chinese language by providing comprehensive evaluation framework for excavated document comprehension.

Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.

[29] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

Tanjim Taharat Aurpa, Farzana Akter, Md. Mehedi Hasan, Shakil Ahmed, Shifat Ara Rafiq, Fatema Khan

Main category: cs.CL

TL;DR: The paper proposes a Multi-BERT Ensemble approach for Bangla Medical Entity Recognition (MedER) that achieves 89.58% accuracy, outperforming single transformer models and addressing the lack of annotated datasets for low-resource languages.

Details

Motivation: Medical Entity Recognition is crucial for developing automated medical systems, but while well-researched in English, low-resource languages like Bangla remain underexplored. There's also a lack of annotated datasets for Bangla MedER.

Method: The study first examined several transformer models (BERT, DistilBERT, ELECTRA, RoBERTa) for Bangla MedER, then proposed a novel Multi-BERT Ensemble approach. The authors also developed a high-quality annotated dataset specifically for Bangla MedER to address the data scarcity issue.

Result: The Multi-BERT Ensemble achieved the highest accuracy of 89.58%, providing an 11.80% improvement over single-layer BERT. The model demonstrated robustness through multiple performance metrics evaluated on the newly created dataset.

Conclusion: The findings highlight the potential of Multi-BERT Ensemble models for improving MedER in Bangla and set a foundation for further advancements in low-resource medical NLP, demonstrating effectiveness in addressing both model architecture and data scarcity challenges.

Abstract: Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.

[30] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

Main category: cs.CL

TL;DR: DEER is a benchmark for evaluating expert-level research reports generated by LLMs, featuring multi-domain tasks, expert-grounded rubrics, and comprehensive fact-checking of all claims.

Details

Motivation: Existing benchmarks lack systematic criteria for expert reporting, LLM judges often fail to capture issues requiring expert judgment, and source verification typically only covers limited explicitly cited statements rather than report-wide factual reliability.

Method: DEER includes 50 report-writing tasks across 13 domains with expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimensions, 130 rubric items). It provides task-specific expert guidance for LLM judges and proposes a document-level fact-checking architecture that extracts and verifies all claims (both cited and uncited) while quantifying external-evidence quality.

Result: DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Conclusion: DEER provides a comprehensive benchmark for evaluating expert-level deep research reports generated by LLMs, addressing limitations of existing evaluation approaches through expert-grounded rubrics and comprehensive fact-checking.

Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

[31] ShareChat: A Dataset of Chatbot Conversations in the Wild

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

Main category: cs.CL

TL;DR: ShareChat is a large-scale dataset of 142,808 conversations from 5 major LLM platforms (ChatGPT, Claude, Gemini, Perplexity, Grok) that preserves native interface features like reasoning traces and source links, enabling analysis of real-world user-LLM interactions.

Details

Motivation: Existing public datasets treat LLMs as generic text generators, stripping away the interface context that shapes user interaction. There's a need for datasets that preserve platform-specific affordances to understand authentic user-LLM chatbot interactions.

Method: Collected 142,808 conversations and over 660,000 turns from publicly shared URLs across five major platforms (ChatGPT, Claude, Gemini, Perplexity, Grok) from April 2023 to October 2025. Preserved native platform features including reasoning traces, source links, and code artifacts across 101 languages.

Result: Created ShareChat dataset with substantially longer context windows and greater interaction depth than prior datasets. Demonstrated utility through three analyses: (1) conversation completeness to measure user intent satisfaction, (2) source citation behaviors in content generation, and (3) temporal analysis of evolving usage patterns.

Conclusion: ShareChat provides a vital resource for understanding authentic user-LLM chatbot interactions by preserving platform-specific interface context, enabling research on real-world usage patterns across multiple major LLM platforms.

Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset’s multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

[32] Studying the Effects of Collaboration in Interactive Theme Discovery Systems

Alvin Po-Chun Chen, Rohan Das, Dananjay Srinivas, Alexandra Barry, Maksim Seniw, Maria Leonor Pacheco

Main category: cs.CL

TL;DR: Proposes an evaluation framework for NLP-assisted qualitative analysis tools, comparing synchronous vs. asynchronous collaboration strategies across different tools.

Details

Motivation: NLP tools are increasingly used for qualitative analysis but lack a unified evaluation framework that accounts for different research settings and collaboration strategies.

Method: Developed an evaluation framework to study how different tools produce different outcomes based on collaboration strategy. Compared synchronous vs. asynchronous collaboration using two NLP-assisted qualitative research tools.

Result: Found significant differences in consistency, cohesiveness, and correctness of outputs between synchronous and asynchronous collaboration approaches.

Conclusion: The proposed framework provides a first step toward standardized evaluation of NLP-assisted qualitative analysis tools, highlighting the importance of collaboration strategy in tool performance.

Abstract: NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.

[33] Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

Benjamin Litterer, David Jurgens, Dallas Card

Main category: cs.CL

TL;DR: Researchers introduce a massive dataset of 1.1M podcast transcripts with audio features, speaker data, and metadata, enabling large-scale computational analysis of the podcast ecosystem.

Details

Motivation: Podcasts are a popular and diverse medium, but limited data has prevented large-scale computational analysis of the podcast ecosystem.

Method: Created a comprehensive dataset of over 1.1M English podcast transcripts from public RSS feeds (May-June 2020), including audio features, speaker turns, role inferences, and metadata for subsets.

Result: Produced the largest podcast dataset to date, enabling foundational investigation into content, structure, and responsiveness of the podcast ecosystem.

Conclusion: This dataset opens doors for continued computational research on podcasts, addressing previous data limitations and enabling systematic analysis of this impactful medium.

Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.

[34] Generating Completions for Broca’s Aphasic Sentences Using Large Language Models

Sijbren van Vaals, Yevgen Matusevych, Frank Tsiwah

Main category: cs.CL

TL;DR: LLMs fine-tuned on synthetic Broca’s aphasic data can reconstruct agrammatic sentences, showing potential for communication aids in aphasia treatment.

Details

Motivation: Traditional aphasia treatments are time-consuming, labor-intensive, and don't reflect real-world conversations. LLMs could improve existing approaches by providing more natural language processing-based solutions for Broca's aphasia.

Method: 1. Generated synthetic Broca’s aphasic data using rule-based system mimicking linguistic characteristics. 2. Fine-tuned four pre-trained LLMs on completing agrammatic sentences using only synthetic data. 3. Evaluated models on both synthetic and authentic Broca’s aphasic data.

Result: LLMs demonstrated capability for reconstructing agrammatic sentences, with performance improving with longer input utterances. Models showed potential despite being trained only on synthetic data.

Conclusion: LLMs have significant potential for advancing communication aids for individuals with Broca’s aphasia and possibly other clinical populations, offering a promising alternative to traditional treatment methods.

Abstract: Broca’s aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca’s aphasic sentences. We first generate synthetic Broca’s aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca’s aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca’s aphasic data. We demonstrate LLMs’ capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs’ potential in advancing communication aids for individuals with Broca’s aphasia and possibly other clinical populations.

[35] Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

David Rodriguez, William Seymour, Jose M. Del Alamo, Jose Such

Main category: cs.CL

TL;DR: Automated method detects policy violations in OpenAI GPT Store chatbots with 58.7% violation rate, showing current review mechanisms are insufficient.

Details

Motivation: Centralized chatbot marketplaces like OpenAI's GPT Store face challenges in systematically enforcing usage policies due to scale and opacity, allowing policy-violating chatbots to remain publicly accessible despite existing review processes.

Method: Fully automated black-box evaluation combining large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using LLM-as-a-judge approach, focusing on Romantic, Cybersecurity, and Academic GPT domains.

Result: Method achieved 0.975 F1 score for binary policy violation detection against human-annotated ground truth. Large-scale study of 782 Custom GPTs found 58.7% exhibited at least one policy-violating response. Most violations originate from base model behavior, with customization amplifying rather than creating new failure modes.

Conclusion: Current review mechanisms for user-configured chatbots have significant limitations, but scalable behavior-based policy compliance evaluation is feasible and reveals widespread policy violations in chatbot marketplaces.

Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI’s GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI’s usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.

[36] Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee

Main category: cs.CL

TL;DR: LCoW is a framework that trains a separate contextualization module to transform complex web pages into comprehensible formats, enhancing LLM agents’ decision-making in web automation tasks.

Details

Motivation: LLM-based agents struggle with real-world web tasks due to limited capability to understand complex web page structures, necessitating a better approach to web page comprehension.

Method: LCoW decouples web page understanding from decision making by training a contextualization module to transform complex web pages into comprehensible formats, which are then used by decision-making agents.

Result: LCoW improves success rates of closed-source LLMs by 15.6% and open-source LLMs by 23.7% on WorkArena benchmark, and achieves state-of-the-art results on WebShop benchmark, outperforming human experts.

Conclusion: The contextualization module effectively enhances LLM agents’ decision-making capabilities in web automation by transforming complex web pages into more comprehensible formats.

Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

[37] Sample, Don’t Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria, Noah A. Smith

Main category: cs.CL

TL;DR: QAlign is a test-time alignment method that uses Markov chain Monte Carlo to converge to optimal aligned distributions as compute scales, outperforming existing methods without model modifications.

Details

Motivation: Existing test-time search methods degrade with increased compute due to over-optimization of imperfect reward models, creating a need for methods that improve alignment without model finetuning or weight access.

Method: QAlign uses Markov chain Monte Carlo for text generation to sample from optimal aligned distributions for individual prompts, requiring no model modifications or logit access while scaling test-time compute.

Result: QAlign consistently outperforms best-of-n, majority voting, and weighted majority voting on mathematical reasoning benchmarks, and beats DPO and other methods across diverse datasets when using realistic preference-trained RMs.

Conclusion: QAlign provides a practical test-time alignment solution that expands capabilities of off-the-shelf language models without training, avoiding degradation while scaling compute.

Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

[38] Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

Danqing Wang, Zhuorui Ye, Xinran Zhao, Fei Fang, Lei Li

Main category: cs.CL

TL;DR: TreeDebater is a novel competitive debate framework that uses rehearsal and flow trees to strategically allocate time and track debate status, outperforming state-of-the-art multi-agent debate systems.

Details

Motivation: Competitive debates have unique challenges: time constraints force strategic choices about which arguments to pursue, and persuasiveness depends on back-and-forth interactions that can't be evaluated by a single final status.

Method: Introduces two tree structures: Rehearsal Tree (anticipates attacks/defenses to evaluate claim strength) and Debate Flow Tree (tracks debate status to identify active actions). Uses time budget allocation, speech time controller, and simulated audience feedback to revise statements.

Result: Outperforms state-of-the-art multi-agent debate system with +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Shows better strategies in limiting time to important debate actions, aligning with human debate expert strategies.

Conclusion: TreeDebater effectively addresses competitive debate challenges through strategic tree-based reasoning and time management, demonstrating superior performance and alignment with human expert debate strategies.

Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

[39] ResSVD: Residual Compensated SVD for Large Language Model Compression

Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang

Main category: cs.CL

TL;DR: ResSVD is a new post-training SVD-based LLM compression method that addresses truncation loss by leveraging residual matrices and selectively compressing only the last few layers to mitigate error propagation.

Details

Motivation: LLMs have impressive capabilities but their large sizes and memory demands hinder practical deployment. Current SVD-based compression methods neglect residual matrices from truncation, causing significant truncation loss, and compressing all layers results in severe performance degradation.

Method: ResSVD leverages the residual matrix generated during truncation to reduce truncation loss. Under a fixed overall compression ratio, it selectively compresses only the last few layers of the model to mitigate error propagation and improve performance.

Result: Comprehensive evaluations on diverse LLM families and multiple benchmark datasets show that ResSVD consistently achieves superior performance over existing counterpart methods.

Conclusion: ResSVD demonstrates practical effectiveness as an efficient LLM compression method that addresses key limitations of current SVD-based approaches through residual matrix utilization and selective layer compression.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models. Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

[40] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

Main category: cs.CL

TL;DR: LLM-as-a-qualitative-judge transforms LLM evaluation from numerical scoring to structured qualitative reports identifying common issues in NLG systems, helping developers improve their systems.

Details

Motivation: Current LLM-as-a-judge approaches focus on quantitative numerical scores, but developers need more meaningful insights about what specific issues exist in their NLG systems and how to improve them.

Method: Two-step approach: 1) open-ended per-instance issue analysis using LLMs, 2) clustering discovered issues using an intuitive cumulative algorithm to generate structured reports of common issue types.

Result: LLM-as-a-qualitative-judge matches human-annotated issues in 2/3 cases, produces error type reports resembling human reports, and in a case study substantially improves NLG system performance.

Conclusion: LLM-as-a-qualitative-judge provides valuable qualitative insights for NLG system improvement, complementing traditional quantitative evaluation methods with actionable feedback for developers.

Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

[41] Hybrid and Unitary PEFT for Resource-Efficient Large Language Models

Haomin Qi, Zihan Dai, Chengbo Huang

Main category: cs.CL

TL;DR: Hybrid PEFT method combining BOFT’s orthogonal stability with LoRA-GA’s gradient alignment achieves near-full-finetuning quality with 2.1x faster training and 50% less memory, plus novel uRNN adaptation for Transformers.

Details

Motivation: Fine-tuning large language models is computationally expensive due to their scale and memory demands, creating a bottleneck for practical deployment under resource constraints.

Method: Comprehensive evaluation of PEFT techniques (LoRA, BOFT, LoRA-GA, uRNN) plus novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence using per-layer adaptive updates guided by gradient norms. Also adapts uRNN principles to Transformer-based LLMs for gradient stability.

Result: Hybrid approach yields consistent gains across GLUE, GSM8K, MT-Bench, and HumanEval with models from 7B to 405B parameters, approaching full fine-tuning quality while reducing training time by ~2.1x and peak memory usage by ~50%. Multilingual/low-resource study on XNLI and FLORES with 32 examples per language shows consistent gains with small footprint.

Conclusion: The hybrid PEFT method provides a practical and scalable path toward accessible LLM fine-tuning under resource constraints, demonstrating significant efficiency improvements while maintaining performance quality.

Abstract: Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to Transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Across GLUE, GSM8K, MT-Bench, and HumanEval, using models ranging from 7B to 405B parameters, the hybrid approach yields consistent gains across three independent runs per task and model, approaching the quality of full fine-tuning while reducing training time by approximately 2.1 times and peak memory usage by nearly 50 percent, indicating practical significance under resource constraints. A compact multilingual and low-resource study on XNLI and FLORES, using 32 examples per language, further demonstrates consistent gains under the same budget with a small and stable footprint. These results indicate a practical and scalable path toward accessible LLM fine-tuning under resource constraints.

Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

Main category: cs.CL

TL;DR: RLfR is a new preference-learning method for machine translation that replaces static preference triplets with on-policy refinements from a frozen teacher model, improving generalization and performance across multiple languages.

Details

Motivation: Existing preference-learning methods like DPO rely on large, carefully curated preference datasets and struggle to generalize beyond their tuning domains, creating a need for more flexible, model-aware learning approaches.

Method: RLfR uses a frozen teacher model to refine actor-generated translations on-policy, then reinforces the actor to close the gap using a composite reward combining scaled negative edit distance (for lexical/structural fidelity) and COMET (for semantic adequacy).

Result: Experiments on FLORES-200 across five language pairs show RLfR consistently outperforms MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation with superior LLM-based judge evaluations.

Conclusion: RLfR provides a stable, model-aware learning signal without explicit preference datasets, offering a more effective approach to preference learning in machine translation that generalizes better across domains.

Abstract: Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish, Chinese, Korean, and Japanese) show that RLfR consistently outperforms strong MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation, and also achieves superior performance under LLM-based judge evaluations.

[43] Text-to-SQL Task-oriented Dialogue Ontology Construction

Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic

Main category: cs.CL

TL;DR: TeQoDO is a method for LLMs to autonomously construct task-oriented dialogue ontologies from scratch using SQL programming capabilities and modular TOD concepts, outperforming transfer learning and scaling to large datasets.

Details

Motivation: LLMs rely on parametric knowledge which limits explainability and trustworthiness, while traditional TOD systems require manual ontology construction through supervised training or labeling.

Method: TeQoDO uses LLMs’ inherent SQL programming capabilities combined with modular TOD system concepts provided in prompts to autonomously build TOD ontologies from scratch without manual supervision.

Result: TeQoDO outperforms transfer learning approaches, produces competitive ontologies for downstream dialogue state tracking, scales to construct larger ontologies on Wikipedia/arXiv datasets, and ablation studies show modular TOD concepts are crucial.

Conclusion: TeQoDO represents a step toward broader ontology application by enabling autonomous ontology construction using LLMs’ SQL capabilities, improving explainability and controllability in task-oriented dialogue systems.

Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch using only its inherent SQL programming capabilities combined with concepts from modular TOD systems provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of modular TOD system concepts. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and arXiv dataset. We view this as a step towards broader application of ontologies.

[44] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar

Main category: cs.CL

TL;DR: ResearchQA is a new evaluation resource that converts survey articles from 75 research fields into 21K queries and 160K rubric items for comprehensive LLM evaluation, showing current systems struggle with citation coverage and other research-specific requirements.

Details

Motivation: Current evaluation of long-form responses to research queries relies heavily on expert annotators, limiting evaluation to fields like AI where researchers can easily find colleagues. However, research expertise exists broadly through survey articles that consolidate knowledge across literature.

Method: Created ResearchQA by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Rubric items specify query-specific evaluation criteria like citing papers, making explanations, and describing limitations. Used 31 Ph.D. annotators across 8 fields to validate quality.

Result: 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. Evaluation of 18 systems in 7.6K head-to-head comparisons shows no parametric or retrieval-augmented system exceeds 70% coverage of rubric items, with highest-ranking system at 75% coverage. Error analysis reveals poor performance on citation items (less than 11% fully addressed), limitation items (48%), and comparison items (49%).

Conclusion: ResearchQA enables more comprehensive multi-field evaluation of LLM systems, revealing significant gaps in current systems’ ability to handle research-specific requirements like proper citation and limitation discussion. The resource is released to facilitate broader evaluation across research domains.

Abstract: Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

[45] Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference

Han Yuan, Yue Zhao, Li Zhang, Wuqiong Luo, Zheng Ma

Main category: cs.CL

TL;DR: Causal analysis reveals structured output has minimal causal impact on LLM generation quality, contradicting previous mixed findings from coarse metrics.

Details

Motivation: Prior studies show conflicting results about structured output's impact on LLMs - some find benefits for completeness/accuracy, others find restrictions on reasoning. These studies have limitations: restricted testing scenarios, weakly controlled settings, and reliance on coarse metrics.

Method: Used causal inference with one assumed and two guaranteed constraints to derive five potential causal structures. Tested across seven public and one developed reasoning tasks on GPT-4o. Further experiments compared OpenAI-o3 with GPT-4o and GPT-4.1.

Result: Coarse metrics showed mixed results (positive/negative/neutral effects), but causal inference revealed no causal impact in 43/48 scenarios. In remaining 5 scenarios, 3 involved multifaceted causal structures influenced by concrete instructions. OpenAI-o3 was more resilient to output formats than general-purpose models.

Conclusion: Structured output has minimal causal impact on LLM generation quality when properly analyzed. Reasoning models (OpenAI-o3) are more resilient to output format variations than general-purpose models, revealing an “unaware advantage” of reasoning-focused architectures.

Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs’ generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs’ generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o’s generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions. Further experiments show that OpenAI-o3 are more resilient to output formats than general-purpose GPT-4o and GPT-4.1, highlighting an unaware advantage of reasoning models.

[46] Same Content, Different Representations: A Controlled Study for Table QA

Yue Zhang, Seiji Maekawa, Nikita Bhutani

Main category: cs.CL

TL;DR: First controlled study isolating table representation’s role in Table QA, showing structured vs. semi-structured formats affect model performance differently across SQL, LLM, and hybrid approaches.

Details

Motivation: Existing Table QA benchmarks are tied to fixed data formats and haven't systematically examined how table representation itself affects model performance, despite real-world settings requiring operation over both structured databases and semi-structured tables with textual fields.

Method: Created RePairTQA diagnostic benchmark with controlled table representation variations (structured vs. semi-structured) using verbalization pipeline to generate paired tables with identical content but different structures. Benchmark includes splits along table size, join requirements, query complexity, and schema quality.

Result: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data; LLMs show flexibility but reduced precision; hybrid approaches strike a balance, especially under noisy schemas. Effects intensify with larger tables and more complex queries.

Conclusion: No single method excels across all conditions, highlighting the central role of representation in shaping Table QA performance. Findings provide actionable insights for model selection and design, paving way for more robust hybrid approaches suited for diverse real-world data formats.

Abstract: Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.

[47] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Gaurav Singh, Abhishek Dey, Janit Bidhan, Tanu Kansal, Paras Kath, Saurabh Srivastava

Main category: cs.CL

TL;DR: Batch prompting in LLMs not only reduces inference cost but also regularizes reasoning behavior, improving accuracy while cutting token usage by 3x-5x through suppression of overthinking and encouraging decisive answers.

Details

Motivation: Previous work focused on batch prompting primarily for amortizing inference costs, but this paper investigates an underappreciated benefit: batching as an inference-time regularizer that improves reasoning quality and efficiency in Large Reasoning Models.

Method: Conducted comprehensive study across 13 diverse benchmarks, analyzing batch prompting effects on LLM reasoning. Used detailed behavioral analysis to examine how batching affects model behavior, including suppression of overthinking, reduction of hedging language, and emergence of collective effects in batched inference.

Result: Batching improves accuracy while substantially reducing reasoning token usage (often 3x-5x reduction). It suppresses overthinking, reduces repetitive self-corrections, encourages more decisive answers, and shows emergent collective effects where models generalize patterns from earlier examples to solve harder ones in the same batch.

Conclusion: Batching should be repositioned not just as a throughput optimization technique, but as a powerful inference-time regularizer that enables more efficient and reliable LLM reasoning through behavioral regularization and emergent collective effects.

Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.

[48] LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: LLMs show systematic demographic biases in biomedical summarization, with poorest age-related information retention for adult-focused studies and increased hallucinations for underrepresented populations.

Details

Motivation: Clinical interventions depend critically on age distinctions, but it's unclear whether language models preserve these demographic details when summarizing biomedical evidence, creating potential safety risks.

Method: Created DemogSummary dataset with age-stratified systematic review studies; evaluated Qwen, Longformer, and GPT-4.1 Nano using standard metrics and novel Demographic Salience Score (DSS) measuring age entity retention and hallucination.

Result: Models show systematic disparities: demographic fidelity lowest for adult-focused summaries, underrepresented populations more prone to hallucinations, revealing limitations in faithful, bias-free biomedical summarization.

Conclusion: Current LLMs have significant limitations in preserving demographic information, highlighting need for fairness-aware evaluation frameworks and summarization pipelines in biomedical NLP applications.

Abstract: Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies. We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination. Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and under-represented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.

[49] Adaptive Focus Memory for Language Models

Christopher Cruz

Main category: cs.CL

TL;DR: AFM is a lightweight context management system that dynamically assigns fidelity levels to past messages (Full, Compressed, or Placeholder) based on relevance, temporal decay, and importance, enabling reliable constraint preservation in multi-turn dialogues under fixed token budgets.

Details

Motivation: Current LLM dialogue systems use naive history management strategies - either replaying full conversations (costly) or using truncation/summarization (causes early important constraints to drift out of context). Models retain text but don't reliably apply critical constraints when needed.

Method: Adaptive Focus Memory (AFM) dynamically assigns each past message one of three fidelity levels: Full, Compressed, or Placeholder. It uses semantic relevance, temporal decay, and importance classification to pack messages chronologically under a fixed token budget, preserving critical constraints at high fidelity while allowing low-importance context to degrade gracefully.

Result: AFM succeeds in 83.3% of allergy scenario runs where all baseline strategies fail, and preserves correct refusal behavior on tax compliance benchmark. It enables reliable constraint preservation without modifying model weights or external retrieval infrastructure.

Conclusion: Effective dialogue memory requires more than retaining prior text - selectively allocating fidelity across past messages enables reliable constraint preservation under bounded context growth. AFM provides a practical solution compatible with existing chat APIs.

Abstract: Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, yet their behavior remains bottlenecked by naive history management strategies. Replaying the full conversation at every turn is simple but costly, while recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context. As a result, models may retain text without reliably applying it when it matters. We present Adaptive Focus Memory (AFM), a lightweight context management system that dynamically assigns each past message one of three fidelity levels: Full, Compressed, or Placeholder, based on semantic relevance, temporal decay, and importance classification. AFM packs messages chronologically under a fixed token budget, preserving critical constraints at high fidelity while allowing low-importance context to degrade gracefully. We evaluate AFM on two multi-turn dialogue benchmarks designed to stress long-horizon constraint preservation: a safety-critical travel scenario involving a user with a severe peanut allergy, and a policy-critical tax compliance scenario involving an illegal evasion request. Under strict grading that requires both explicit constraint recall and appropriately conditioned generation, AFM succeeds in 83.3 percent of allergy runs where all baseline strategies fail, and preserves correct refusal behavior on the tax benchmark. These results demonstrate that effective dialogue memory requires more than retaining prior text. Selectively allocating fidelity across past messages enables reliable constraint preservation under bounded context growth, without modifying model weights or introducing external retrieval infrastructure. We release an open-source implementation of AFM compatible with OpenAI-style chat APIs to support reproducible research and practical deployment.

[50] When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan

Main category: cs.CL

TL;DR: Paper introduces “semantic confusion” - a failure mode where models inconsistently accept/reject paraphrases of the same intent, and proposes metrics to measure this local inconsistency beyond global refusal rates.

Details

Motivation: Current safety-aligned language models often refuse harmless prompts, but existing evaluations only measure global refusal rates, missing local inconsistencies where models accept one phrasing but reject semantically equivalent paraphrases. This gap limits diagnosis and tuning of safety systems.

Method: 1) Introduce “semantic confusion” as a failure mode capturing local inconsistency. 2) Build ParaGuard - a 10k-prompt corpus with controlled paraphrase clusters that vary surface form while keeping intent fixed. 3) Propose three model-agnostic token-level metrics: Confusion Index, Confusion Rate, and Confusion Depth that compare refusals to nearest accepted neighbors using token embeddings, next-token probabilities, and perplexity signals.

Result: Experiments across diverse model families and deployment guards show that global false-rejection rates hide critical structure. The metrics reveal: globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal doesn’t increase inconsistency. Confusion-aware auditing separates how often a system refuses from how sensibly it refuses.

Conclusion: The proposed semantic confusion framework and metrics provide developers with practical signals to reduce false refusals while preserving safety, addressing the limitation of current global evaluation metrics that miss local inconsistency patterns.

Abstract: Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce “semantic confusion,” a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.

[51] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng

Main category: cs.CL

TL;DR: NPR enables LLMs to self-evolve genuine parallel reasoning capabilities through a teacher-free framework, achieving significant performance gains and inference speedups while maintaining 100% parallel execution.

Details

Motivation: Current LLMs rely on sequential emulation rather than native parallel cognition, limiting their reasoning efficiency and scalability. Prior approaches often fall back to autoregressive decoding instead of genuine parallel execution.

Method: Three key innovations: 1) Self-distilled progressive training from format discovery to topological constraints, 2) Parallel-Aware Policy Optimization (PAPO) for adaptive decomposition learning, 3) NPR Engine refactoring memory management and flow control in SGLang for stable parallel RL training.

Result: On eight reasoning benchmarks, NPR trained on Qwen3-4B achieves up to 24.5% performance gains and 4.6x inference speedups. Demonstrates 100% genuine parallel execution unlike prior baselines.

Conclusion: NPR establishes a new standard for self-evolving, efficient, and scalable agentic reasoning by enabling LLMs to develop native parallel cognition without external supervision.

Abstract: We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start’’ format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

[52] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Kotaro Funakoshi, Manabu Okumura

Main category: cs.CL

TL;DR: MBR decoding improves generative error span detection by selecting hypotheses based on similarity to human annotations rather than maximum probability, with distillation to reduce computational cost.

Details

Motivation: Current generative ESD methods use MAP decoding which assumes model probabilities perfectly correlate with similarity to human annotations, but this assumption often fails when higher likelihood is assigned to incorrect annotations than human ones.

Method: Apply Minimum Bayes Risk (MBR) decoding to generative ESD using sentence- or span-level similarity functions to select candidate hypotheses based on approximate similarity to human annotations, with distillation to reduce computational cost.

Result: MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at system and sentence levels on WMT24 Metrics Shared Task.

Conclusion: MBR decoding is more effective than MAP for generative ESD by better aligning with human annotations, and distillation successfully addresses the computational bottleneck of MBR inference.

Abstract: Error Span Detection (ESD) extends automatic machine translation (MT) evaluation by localizing translation errors and labeling their severity. Current generative ESD methods typically use Maximum a Posteriori (MAP) decoding, assuming that the model-estimated probabilities are perfectly correlated with similarity to the human annotation, but we often observe higher likelihood assigned to an incorrect annotation than to the human one. We instead apply Minimum Bayes Risk (MBR) decoding to generative ESD. We use a sentence- or span-level similarity function for MBR decoding, which selects candidate hypotheses based on their approximate similarity to the human annotation. Experimental results on the WMT24 Metrics Shared Task show that MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at the system and sentence levels. To reduce the computational cost of MBR decoding, we further distill its decisions into a model decoded via greedy search, removing the inference-time latency bottleneck.

[53] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Hauke Licht

Main category: cs.CL

TL;DR: Multimodal AI shows promise for analyzing emotions in political videos but performs inconsistently - works well under ideal conditions but fails in real-world parliamentary debates, requiring careful evaluation.

Details

Motivation: While emotions are crucial in politics and multimodal AI offers new analysis capabilities, there's insufficient evidence about its effectiveness in analyzing emotions in political communication, creating a research gap.

Method: Evaluates current multimodal large language models (mLLMs) using two complementary datasets of human-labeled video recordings to analyze emotional arousal in political communication.

Result: Under ideal circumstances, mLLMs show highly reliable emotional arousal ratings with minimal demographic bias, but in real-world parliamentary debates, their arousal ratings fail to deliver reliable results, potentially affecting statistical inferences.

Conclusion: The study emphasizes the need for continued thorough evaluation of emerging generative AI methods in multimodal political analysis and provides a replicable framework for such evaluations.

Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze emotions, the emergence of multimodal generative Artificial Intelligence (AI) promises great advances. However, we lack evidence about the effectiveness of multimodal AI in analyzing emotions in political communication. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in the video-based analysis of emotional arousal, using two complementary datasets of human-labeled video recordings. It finds that under ideal circumstances, mLLMs’ emotional arousal ratings are highly reliable and exhibit little to no demographic bias. However, in recordings of real-world parliamentary debates, mLLMs’ arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in multimodal political analysis and contributes a suitable replicable framework.

[54] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

Kei Saito

Main category: cs.CL

TL;DR: NRR proposes a framework for AI systems to retain ambiguity rather than prematurely resolving it, enabling better handling of paradoxes, creative generation, and context-dependent reasoning.

Details

Motivation: Current AI systems have a fundamental limitation: they prematurely collapse multiple valid interpretations into single outputs due to classical identity assumptions in neural architectures. This prevents sophisticated ambiguity handling and creative reasoning.

Method: Introduces Non-Resolution Reasoning (NRR) with three core principles: Non-Identity (A≠A), Approximate Identity (A≈A), and Non-Resolution. Implements these through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking (CIT).

Result: NRR-lite model achieves 90.9% out-of-distribution accuracy on synthetic context-shift task vs 9.1% for standard architectures. Demonstrates improved paradox handling, creative generation, and context-dependent reasoning through case studies.

Conclusion: NRR challenges the assumption that meaning must collapse to be useful, offering a foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The key question becomes when, how, and under whose control ambiguity should be resolved.

Abstract: Current artificial intelligence systems, despite remarkable capabilities in text generation and pattern recognition, exhibit a fundamental architectural limitation: they resolve ambiguity prematurely. This premature semantic collapse – the tendency to collapse multiple valid interpretations into a single output – stems from classical identity assumptions embedded in standard neural architectures. We propose Non-Resolution Reasoning (NRR), a computational framework that treats ambiguity retention as a valid reasoning mode rather than a defect to be eliminated. NRR introduces three core principles: (1) Non-Identity ($A \neq A$) – the same symbol refers to different entities across contexts; (2) Approximate Identity ($A \approx A$) – entities share partial structural overlap without being identical; and (3) Non-Resolution – conflicting interpretations can coexist without forced convergence. We formalize these principles through three architectural components: Multi-Vector Embeddings for context-dependent representation, Non-Collapsing Attention for parallel interpretation retention, and Contextual Identity Tracking (CIT) for maintaining $A \neq A$ across inference. We demonstrate NRR’s advantages through case studies in paradox handling, creative generation, and context-dependent reasoning. Crucially, we provide a minimal empirical validation on a synthetic context-shift task where an NRR-lite model achieves 90.9% out-of-distribution accuracy compared to 9.1% for standard architectures, demonstrating that ambiguity preservation enables structural generalization. NRR challenges the assumption that meaning must collapse to be useful, offering a foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

[55] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

Main category: cs.CL

TL;DR: The paper proposes a fact-checking module and domain-specific summarization model to reduce hallucinations in LLM outputs for healthcare applications, achieving high precision (0.8904) and F1-score (0.8556) for fact verification.

Details

Motivation: LLM-generated outputs in healthcare are often unreliable due to hallucination risks, which is critical for decision-making and patient safety. There's a need for reliable, accurate outputs in medical contexts.

Method: Proposes a fact-checking module independent of LLMs that uses numerical tests and logical checks via discrete logic in NLP to validate facts against EHRs. Also develops a domain-specific summarization model fine-tuned using LoRa on MIMIC-III dataset.

Result: Fact-checking module achieves precision: 0.8904, recall: 0.8234, F1-score: 0.8556 on 3,786 propositions from 104 summaries. LLM summary model achieves ROUGE-1: 0.5797 and BERTScore: 0.9120 for summary quality.

Conclusion: The proposed approach effectively reduces hallucination in LLM outputs for healthcare by combining a specialized fact-checking module with a domain-specific summarization model, improving reliability for medical decision-making.

Abstract: In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

[56] Sigma-MoE-Tiny Technical Report

Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu, Yuting Jiang, Ruizhe Wang, Tianyu Chen, Zhongxin Guo, Yifan Xiong, Rui Gao, Lei Qu, Jinsong Su, Peng Cheng, Yeyun Gong

Main category: cs.CL

TL;DR: Sigma-MoE-Tiny achieves extreme sparsity with 96 experts per layer but only 1 expert activated per token, resulting in 20B total parameters with just 0.5B activated, while maintaining top-tier performance.

Details

Motivation: To push the boundaries of sparsity in Mixture-of-Experts models, addressing the challenge of expert load balancing in highly sparse settings where traditional methods fail in lower layers.

Method: Fine-grained expert segmentation (96 experts per layer), progressive sparsification schedule for load balancing, pre-training on diverse high-quality corpus, and post-training to unlock capabilities.

Result: Achieves highest sparsity among open-source models, stable training without irrecoverable loss spikes, and top-tier performance despite activating only 0.5B parameters compared to larger counterparts.

Conclusion: Sigma-MoE-Tiny demonstrates that extreme sparsity in MoE models is achievable with proper load balancing techniques, providing insights for advancing sparsity in future MoE architectures.

Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

cs.CV

[57] V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

Main category: cs.CV

TL;DR: V-Agent is a multi-agent platform for video search and conversation that uses a fine-tuned vision-language model with multimodal retrieval capabilities, achieving state-of-the-art zero-shot performance.

Details

Motivation: Traditional text-based retrieval systems have limitations in multimodal scenarios where both visual and spoken content need to be interpreted for context-aware video search.

Method: Fine-tunes a vision-language model with video preference data, enhances it with retrieval vectors from image-text retrieval models, and uses three collaborative agents (routing, search, chat) with multimodal embedding of video frames and audio transcriptions.

Result: Demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, showing superior video retrieval quality through the combined VLM-based retrieval and re-ranking approach.

Conclusion: V-Agent represents a promising framework for advanced video search and interactive conversations with potential applications in both academic research and real-world scenarios.

Abstract: We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.

[58] Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content

Reza Chandra, Adang Suhendra, Lintang Yuniar Banowosari, Prihandoko

Main category: cs.CV

TL;DR: CNN model outperforms VGG-16 for rapid pornographic content detection in websites, achieving 94.87% accuracy with 50 epochs and 0.001 learning rate.

Details

Motivation: Indonesian government blocked 59,741 websites in 2020 (14,266 pornographic), but VPNs allow public access, creating need for rapid pornographic content identification system.

Method: Developed deep learning system using CNN and VGG-16 models for pornographic image detection, conducted comprehensive comparison experiments with different epoch values and learning rates.

Result: CNN model achieved best performance: 94.87% accuracy with 50 epochs and 0.001 learning rate, outperforming VGG-16 for rapid pornographic content detection.

Conclusion: CNN model is more effective than VGG-16 for quickly and accurately detecting pornographic content in websites, providing solution to bypass VPN access to blocked content.

Abstract: In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.

[59] A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao

Main category: cs.CV

TL;DR: RSHR-Bench is a super-high-resolution remote sensing benchmark with 5,329 full-scene images (up to 300M pixels) to address flaws in existing RS benchmarks where text-only LLMs can compete with multimodal models, revealing poor visual understanding evaluation.

Details

Motivation: Current remote sensing benchmarks have limitations: most use low-resolution imagery, and some high-resolution benchmarks have flawed reasoning-task designs. The authors discovered that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without seeing images, indicating a critical mismatch between benchmarks and intended visual understanding evaluation.

Method: Created RSHR-Bench with 5,329 super-high-resolution full-scene images (long side ≥4,000 pixels, up to ~300M pixels) from widely used RS corpora and UAV collections. Designed four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. Applied adversarial filtering with strong LLMs followed by rigorous human verification to reduce language priors. Constructed 3,864 VQA tasks, 3,913 image captioning tasks, and 500 human-written/verified single-image evaluation VQA pairs.

Result: Evaluations across open-source, closed-source, and RS-specific vision-language models reveal persistent performance gaps in super-high-resolution scenarios, demonstrating that current models struggle with true visual understanding at very high resolutions.

Conclusion: RSHR-Bench enables faithful assessment of remote sensing visual understanding and reasoning by addressing resolution limitations and language prior issues in existing benchmarks. The benchmark reveals that current multimodal models still have significant room for improvement in handling super-high-resolution remote sensing imagery.

Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

[60] AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

Qi Xu, Shuai Gong, Xuming Ran, Haihua Luo, Yangfan Hu

Main category: cs.CV

TL;DR: AVM is a modular neural modeling framework that separates stable visual encoding from condition-specific adaptation using frozen Vision Transformer encoders and independent modulation paths, achieving better generalization across stimuli and individuals than existing methods.

Details

Motivation: Current deep learning models for neural response simulation fail to clearly separate stable visual encoding from condition-specific adaptation, limiting their ability to generalize across different stimuli and individuals.

Method: AVM uses a structure-preserving framework with a frozen Vision Transformer-based encoder for consistent visual features, plus independently trained modulation paths that account for neural response variations from stimulus content and subject identity.

Result: AVM outperforms state-of-the-art V1T model by ~2% in predictive correlation across two large-scale mouse V1 datasets, with 9.1% improvement in explained variance (FEVE) under cross-dataset adaptation, showing robust generalization and interpretable modulation.

Conclusion: AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering scalable solutions under structural constraints that could inform future cortical modeling in neuroscience and biologically inspired AI systems.

Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.

[61] Region-Constraint In-Context Generation for Instructional Video Editing

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei

Main category: cs.CV

TL;DR: ReCo introduces a new instructional video editing paradigm using in-context generation with joint denoising and regularization techniques to address inaccurate editing regions and token interference problems.

Details

Motivation: While in-context generation works well for image editing, applying it to video editing faces challenges: inaccurate editing regions and token interference between editing/non-editing areas during denoising.

Method: ReCo width-wise concatenates source and target videos for joint denoising. It uses two regularization terms: 1) latent regularization to increase discrepancy in editing regions while reducing it in non-editing areas, and 2) attention regularization to suppress attention between editing region tokens in source/target videos. Also creates ReCo-Data dataset with 500K instruction-video pairs.

Result: Extensive experiments on four major instruction-based video editing tasks demonstrate the superiority of ReCo over existing methods.

Conclusion: ReCo successfully addresses video editing challenges through constraint modeling between editing/non-editing regions, achieving accurate editing with minimal interference through joint denoising and regularization techniques.

Abstract: The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

[62] Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections

Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn, Johannes Breidenbach, Stefano Puliti, Nils Nölke

Main category: cs.CV

TL;DR: This paper proposes a novel method linking Finer-CAM explanations to TLS projection segments to evaluate which structural features drive tree species discrimination in deep learning models, finding models primarily rely on crown features with species-specific variations.

Details

Motivation: While new sensors and deep learning achieve high accuracy in tree species classification, their decision processes remain unclear. Existing methods like Finer-CAM can highlight features but are uncommon for similar-looking contrastive tree species, creating a need to understand what features drive species discrimination.

Method: Proposed a novel method linking Finer-CAM explanations to segments of TLS projections representing structural tree features. Used TLS data from 2,445 trees across seven European species, trained and validated five YOLOv8 models with cross-validation, and analyzed 630 saliency maps to evaluate feature contributions.

Result: Models achieved 96% mean accuracy (SD = 0.24%). Analysis showed models primarily rely on crown features for classification, with pronounced results for Silver Birch, European Beech, English oak, and Norway spruce. Stem features contributed more to differentiating European ash, Scots pine, and Douglas fir. Finer branches were particularly important for model decisions, and model similarity judgments aligned with human expert assessments.

Conclusion: The study highlights the need for improved understanding of decision processes in tree species classification models to reveal dataset and model limitations, biases, and build confidence in predictions. The proposed method successfully links explainable AI techniques to structural features for systematic evaluation of what drives species discrimination.

Abstract: Classifying tree species has been a core research area in forest remote sensing for decades. New sensors and classification approaches like TLS and deep learning achieve state-of-the art accuracy but their decision processes remain unclear. Methods such as Finer-CAM (Class Activation Mapping) can highlight features in TLS projections that contribute to the classification of a target species, yet are uncommon in similar looking contrastive tree species. We propose a novel method linking Finer-CAM explanations to segments of TLS projections representing structural tree features to systemically evaluate which features drive species discrimination. Using TLS data from 2,445 trees across seven European tree species, we trained and validated five YOLOv8 models with cross-validation, reaching a mean accuracy of 96% (SD = 0.24%). Analysis of 630 saliency maps shows the models primarily rely on crown features in TLS projections for species classification. While this result is pronounced in Silver Birch, European Beech, English oak, and Norway spruce, stem features contribute more frequently to the differentiation of European ash, Scots pine, and Douglas fir. Particularly representations of finer branches contribute to the decisions of the models. The models consider those tree species similar to each other which a human expert would also regard as similar. Furthermore, our results highlight the need for an improved understanding of the decision processes of tree species classification models to help reveal data set and model limitations, biases, and to build confidence in model predictions.

[63] Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

Main category: cs.CV

TL;DR: A multi-stage pipeline for generating long, cohesive video stories with consistent characters using LLM-generated scripts, text-to-image character anchors, and scene-by-scene video synthesis.

Details

Motivation: Current text-to-video AI struggles with generating long, cohesive video stories that maintain consistent characters throughout the narrative, requiring a more structured approach similar to filmmaking.

Method: Three-stage pipeline: 1) LLM generates detailed production script, 2) Text-to-image model creates consistent character visuals as anchors, 3) Video generation model synthesizes each scene individually using character anchors for consistency.

Result: Visual anchoring is crucial - removing it causes catastrophic drop in character consistency scores (from 7.99 to 0.55). Also reveals cultural disparities in current models with distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

Conclusion: Multi-stage decomposition with visual character anchoring is essential for maintaining character consistency in long video story generation, and current models exhibit cultural biases that need addressing.

Abstract: Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

[64] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

Main category: cs.CV

TL;DR: InfoTok is an adaptive video tokenization framework that uses information theory to dynamically allocate tokens based on content complexity, achieving better compression than fixed-rate methods.

Details

Motivation: Current video tokenizers use fixed compression rates for all content, leading to redundancy for simple scenes or information loss for complex scenes. Videos have variable information density, requiring adaptive token allocation.

Method: Developed a principled framework based on Shannon’s information theory, proved existing methods are suboptimal, and created an ELBO-based algorithm for theoretical optimality. Built a transformer-based adaptive compressor for adaptive tokenization.

Result: Achieved state-of-the-art compression: saves 20% tokens without performance loss, achieves 2.3x compression rates while outperforming prior heuristic adaptive approaches. Enables more compressed yet accurate video representation.

Conclusion: InfoTok provides a theoretically-grounded approach to adaptive video tokenization that allocates tokens according to informational richness, offering valuable insights for future video representation research.

Abstract: Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon’s information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

[65] FakeParts: a New Family of AI-Generated DeepFakes

Ziyi Liu, Firas Gabetni, Awais Hussain Sani, Xi Wang, Soobash Daiboo, Gaetan Brison, Gianni Franchi, Vicky Kalogeiton

Main category: cs.CV

TL;DR: FakeParts introduces subtle, localized deepfake manipulations that blend with real content, making detection harder. The paper presents FakePartsBench, a large-scale benchmark with 81K videos (44K FakeParts) to evaluate detection methods, showing both humans and AI models struggle with these partial manipulations.

Details

Motivation: Current deepfake detection focuses on fully synthetic content, leaving a critical gap for partial manipulations that blend real and fake elements. These subtle, localized manipulations are more deceptive and difficult to detect, creating an urgent vulnerability in current detection systems.

Method: The paper introduces FakeParts, a new class of partial deepfakes with localized manipulations. It presents FakePartsBench, the first large-scale benchmark dataset with over 81K videos (including 44K FakeParts) featuring pixel- and frame-level manipulation annotations. The method includes user studies and evaluation of state-of-the-art detection models.

Result: FakeParts reduces human detection accuracy by up to 26% compared to traditional deepfakes. State-of-the-art detection models show similar performance degradation. The benchmark enables comprehensive evaluation, revealing significant vulnerabilities in current detection approaches.

Conclusion: Partial deepfakes (FakeParts) represent a critical vulnerability in current detection systems. The FakePartsBench provides essential resources for developing more robust detection methods that can handle subtle, localized manipulations blending real and fake content.

Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations - ranging from altered facial expressions to object substitutions and background modifications - blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection, we present FakePartsBench, the first large-scale benchmark specifically designed to capture the full spectrum of partial deepfakes. Comprising over 81K (including 44K FakeParts) videos with pixel- and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by up to 26% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current detectors and provides the necessary resources to develop methods robust to partial manipulations.

[66] Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video

Hao Li, Daiwei Lu, Xing Yao, Nicholas Kavoussi, Ipek Oguz

Main category: cs.CV

TL;DR: Endo-SemiS is a semi-supervised segmentation framework for endoscopic video frames that uses cross-supervision, uncertainty-guided pseudo-labels, joint pseudolabel supervision, and mutual learning to effectively utilize limited labeled data and abundant unlabeled data.

Details

Motivation: Endoscopic video segmentation requires reliable segmentation but suffers from limited annotation availability. Manual annotation is time-consuming and expensive, creating a need for methods that can leverage both limited labeled data and abundant unlabeled data.

Method: Uses four key strategies: 1) Cross-supervision between two networks, 2) Uncertainty-guided pseudo-labels from unlabeled data, 3) Joint pseudolabel supervision aggregating reliable pixels from both networks, and 4) Mutual learning at feature and image levels. Also includes a separate corrective network using spatiotemporal information.

Result: Substantially superior results compared to state-of-the-art segmentation methods on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy, especially with limited labeled data.

Conclusion: Endo-SemiS effectively addresses the annotation scarcity problem in endoscopic video segmentation by leveraging unlabeled data through multiple complementary strategies, achieving state-of-the-art performance on clinical endoscopic applications.

Abstract: In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at https://github.com/MedICL-VU/Endo-SemiS

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

Main category: cs.CV

TL;DR: LongShOTBench is a diagnostic benchmark for long-form multimodal video understanding that combines temporal length with multimodal richness, featuring open-ended questions, dialogues, and tool use tasks with reference answers and graded rubrics for interpretable evaluation.

Details

Motivation: Existing benchmarks for long-form video understanding either focus on temporal length OR multimodal richness, but rarely both. They also mostly rely on single-score accuracy metrics that obscure failure modes, lacking interpretable evaluation.

Method: Created LongShOTBench with open-ended intent-driven questions, single/multi-turn dialogues, multimodal reasoning tasks, and agentic tool use across video/audio/speech. Used scalable human-validated pipeline for coverage and reproducibility. Also developed LongShOTAgent system with preprocessing, search, and iterative refinement.

Result: State-of-the-art MLLMs show large performance gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. Results demonstrate the difficulty of real-world long-form video understanding.

Conclusion: LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs on long-form multimodal video understanding, highlighting significant challenges in this domain.

Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

[68] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

Main category: cs.CV

TL;DR: 4D-RGPT: A multimodal LLM with enhanced 4D perception for video understanding, trained via perceptual distillation and evaluated on a new region-level 4D benchmark.

Details

Motivation: Current MLLMs have limited ability to reason over 3D structures and temporal dynamics due to weak 4D perception and temporal understanding. Existing 3D/4D VQA benchmarks focus on static scenes and lack region-level prompting.

Method: Three main contributions: (1) 4D-RGPT - a specialized MLLM for 4D video representation; (2) Perceptual 4D Distillation (P4D) - training framework transferring 4D representations from frozen expert model; (3) R4D-Bench - depth-aware dynamic scene benchmark with region-level prompting via hybrid automated/human-verified pipeline.

Result: 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Conclusion: The proposed approach addresses limitations in current MLLMs’ 4D perception and temporal understanding, providing a comprehensive solution through specialized architecture, training framework, and evaluation benchmark.

Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

[69] FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

Martin Schwartz, Fajwel Fogel, Nikola Besic, Damien Robert, Louis Geist, Jean-Pierre Renaud, Jean-Matthieu Monnet, Clemens Mosig, Cédric Vega, Alexandre d’Aspremont, Loic Landrieu, Philippe Ciais

Main category: cs.CV

TL;DR: FORMSpoT provides decade-long, high-resolution (1.5m) forest canopy height mapping and disturbance detection across France, significantly outperforming existing products for detecting small-scale disturbances.

Details

Motivation: European forest carbon sink is declining, requiring better monitoring tools. Existing satellite products are too coarse (typically >100m²) to detect individual tree-level changes, limiting ability to monitor subtle disturbances like thinning or selective logging.

Method: Used SPOT-6/7 satellite time series (2014-2024) with hierarchical transformer model (PVTv2) trained on airborne laser scanning data. Developed post-processing pipeline combining co-registration and spatio-temporal total variation denoising for robust change detection across heterogeneous acquisitions.

Result: FORMSpoT-Δ achieves F1-score of 0.44 in mountainous forests with small, fragmented disturbances - an order of magnitude better than existing benchmarks. Validated against 19 ALS sites and 5,087 National Forest Inventory plots.

Conclusion: FORMSpoT enables tree-level forest monitoring at national scale, providing crucial tool for analyzing management practices, detecting early forest decline signals, and quantifying carbon losses from subtle disturbances. Highlights importance of sustaining high-resolution satellite missions and open-data initiatives.

Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$Δ$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$Δ$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$Δ$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$Δ$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.

[70] Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo

Main category: cs.CV

TL;DR: InfCam: A depth-free framework for camera-controlled video-to-video generation that uses infinite homography warping and data augmentation to achieve high camera-pose fidelity without depth estimation.

Details

Motivation: Existing methods for camera-controlled novel-view video generation struggle with pose fidelity due to depth estimation errors and limited trajectory diversity in training data. There's a need for a solution that ensures accurate camera pose adherence while maintaining view consistency and handling occlusions.

Method: InfCam uses infinite homography warping to encode 3D camera rotations directly in 2D latent space of a video diffusion model, predicting residual parallax through end-to-end training. It also employs a data augmentation pipeline to transform existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths.

Result: InfCam outperforms baseline methods in both camera-pose accuracy and visual fidelity, demonstrating good generalization from synthetic to real-world data.

Conclusion: InfCam provides an effective depth-free solution for camera-controlled video generation that addresses the limitations of existing approaches, offering high pose fidelity and quality results that generalize well to real-world scenarios.

Abstract: Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/

[71] Interpretable Similarity of Synthetic Image Utility

Panagiota Gatoula, George Dimas, Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: Proposes Interpretable Utility Similarity (IUS), a novel measure to assess synthetic medical image quality by evaluating utility for DL-based clinical decision support systems, offering interpretability through clinically relevant features.

Details

Motivation: Current methods for assessing synthetic medical image quality (user studies, inception scores, classification performance) lack interpretability and don't directly measure utility for clinical decision support systems. There's a need for a quantitative, interpretable measure that explains why one synthetic dataset is more useful than another for medical AI applications.

Method: Develops Interpretable Utility Similarity (IUS) inspired by generalized neural additive models. Unlike inception-based measures, IUS is interpretable and explains synthetic dataset utility based on clinically relevant image features. The method assesses similarity between synthetic and real image sets specifically for DL-based clinical decision support system development.

Result: Experimental results on various medical imaging modalities (endoscopic, dermoscopic, fundus, X-ray, ultrasound) show that selecting synthetic images with high utility similarity using IUS can improve classification performance by up to 54.6%. The method demonstrates generality across both color and greyscale medical imaging modalities.

Conclusion: IUS provides an interpretable, quantitative measure for assessing synthetic medical image quality that directly relates to clinical utility. It enables better selection of synthetic images for training DL-based clinical decision support systems, leading to significant performance improvements while maintaining interpretability through clinically relevant features.

Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: “How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?”. Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius

[72] DGH: Dynamic Gaussian Hair

Junying Wang, Yuanlu Xu, Edith Tretschk, Ziyan Wang, Anastasia Ianina, Aljaz Bozic, Ulrich Neumann, Tony Tung

Main category: cs.CV

TL;DR: Dynamic Gaussian Hair (DGH) is a data-driven framework that learns hair dynamics and appearance using 3D Gaussian representations, enabling realistic animatable hair without physics simulation.

Details

Motivation: Existing methods for photorealistic dynamic hair rely on static capture and physics-based models that don't scale well, requiring manual parameter tuning and heavy computation for diverse hairstyles and motions.

Method: Proposes: (1) coarse-to-fine model for learning temporally coherent hair motion dynamics across hairstyles; (2) strand-guided optimization module for dynamic 3D Gaussian representation with differentiable rendering for view-consistent appearance learning.

Result: DGH achieves promising geometry and appearance results, provides scalable data-driven alternative to physics simulation, and can be integrated into 3D Gaussian avatar frameworks for realistic animatable hair.

Conclusion: DGH offers a fully data-driven approach that scales with training data, generalizes across hairstyles and motions, and enables high-fidelity animatable hair representation without physics simulation limitations.

Abstract: The creation of photorealistic dynamic hair remains a major challenge in digital human modeling because of the complex motions, occlusions, and light scattering. Existing methods often resort to static capture and physics-based models that do not scale as they require manual parameter fine-tuning to handle the diversity of hairstyles and motions, and heavy computation to obtain high-quality appearance. In this paper, we present Dynamic Gaussian Hair (DGH), a novel framework that efficiently learns hair dynamics and appearance. We propose: (1) a coarse-to-fine model that learns temporally coherent hair motion dynamics across diverse hairstyles; (2) a strand-guided optimization module that learns a dynamic 3D Gaussian representation for hair appearance with support for differentiable rendering, enabling gradient-based learning of view-consistent appearance under motion. Unlike prior simulation-based pipelines, our approach is fully data-driven, scales with training data, and generalizes across various hairstyles and head motion sequences. Additionally, DGH can be seamlessly integrated into a 3D Gaussian avatar framework, enabling realistic, animatable hair for high-fidelity avatar representation. DGH achieves promising geometry and appearance results, providing a scalable, data-driven alternative to physics-based simulation and rendering.

[73] Predictive Modeling of Maritime Radar Data Using Transformer Architecture

Bjorna Qesaraku, Jan Steckel

Main category: cs.CV

TL;DR: Survey identifies research gap: transformer architectures have been used for AIS trajectory prediction and sonar frame forecasting, but not for maritime radar frame prediction despite radar’s all-weather reliability.

Details

Motivation: Maritime autonomous systems need robust predictive capabilities for vessel motion and environmental dynamics. While transformers work for AIS trajectory prediction and sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation.

Method: Systematic review of predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting. Analyzes existing representative methods according to data type, architecture, and prediction horizon.

Result: The review shows that while literature demonstrates transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction.

Conclusion: Identifies a clear research gap and motivates a concrete research direction for future work in transformer-based maritime radar frame prediction.

Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar’s all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.

[74] SDUM: A Scalable Deep Unrolled Model for Universal MRI Reconstruction

Puyang Wang, Pengfei Guo, Keyi Chai, Jinyuan Zhou, Daguang Xu, Shanshan Jiang

Main category: cs.CV

TL;DR: SDUM is a universal MRI reconstruction framework that achieves state-of-the-art performance across diverse clinical protocols without task-specific fine-tuning, demonstrating foundation-model-like scaling behavior.

Details

Motivation: Current deep learning MRI reconstructions are protocol-specific, hindering generalization and deployment across diverse clinical imaging protocols (different anatomical targets, contrasts, sampling patterns, acceleration factors).

Method: SDUM combines Restormer-based reconstructor, learned coil sensitivity map estimator (CSME), sampling-aware weighted data consistency (SWDC), universal conditioning (UC) on cascade index and protocol metadata, and progressive cascade expansion training.

Result: SDUM shows foundation-model-like scaling: PSNR ~ log(parameters) with r=0.986 up to 18 cascades. A single SDUM achieves SOTA across all four CMRxRecon2025 challenge tracks, surpassing specialized baselines by up to +1.0 dB, and outperforms previous winners on CMRxRecon2024 and fastMRI brain.

Conclusion: SDUM establishes a practical path toward universal, scalable MRI reconstruction that can handle diverse clinical protocols without task-specific fine-tuning.

Abstract: Clinical MRI encompasses diverse imaging protocols–spanning anatomical targets (cardiac, brain, knee), contrasts (T1, T2, mapping), sampling patterns (Cartesian, radial, spiral, kt-space), and acceleration factors–yet current deep learning reconstructions are typically protocol-specific, hindering generalization and deployment. We introduce Scalable Deep Unrolled Model (SDUM), a universal framework combining a Restormer-based reconstructor, a learned coil sensitivity map estimator (CSME), sampling-aware weighted data consistency (SWDC), universal conditioning (UC) on cascade index and protocol metadata, and progressive cascade expansion training. SDUM exhibits foundation-model-like scaling behavior: reconstruction quality follows PSNR ${\sim}$ log(parameters) with correlation $r{=}0.986$ ($R^2{=}0.973$) up to 18 cascades, demonstrating predictable performance gains with model depth. A single SDUM trained on heterogeneous data achieves state-of-the-art results across all four CMRxRecon2025 challenge tracks–multi-center, multi-disease, 5T, and pediatric–without task-specific fine-tuning, surpassing specialized baselines by up to ${+}1.0$~dB. On CMRxRecon2024, SDUM outperforms the winning method PromptMR+ by ${+}0.55$~dB; on fastMRI brain, it exceeds PC-RNN by ${+}1.8$~dB. Ablations validate each component: SWDC ${+}0.43$~dB over standard DC, per-cascade CSME ${+}0.51$~dB, UC ${+}0.38$~dB. These results establish SDUM as a practical path toward universal, scalable MRI reconstruction.

[75] Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

Sandeep Mishra, Yasamin Jafarian, Andreas Lugmayr, Yingwei Li, Varsha Ramakrishnan, Srivatsan Varadharajan, Alan C. Bovik, Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: A method to transform casual photos into professional portraits by mapping to UV space and using multi-image fine-tuning, enabling pose changes, better lighting, and flattering presentation while preserving identity.

Details

Motivation: Professional photos have beautiful lighting, interesting poses, and flattering quality that casual self-photos lack. There's no large paired dataset of the same person photographed both casually and professionally, making this transformation challenging.

Method: 1) Transform input photo to canonical UV space coupled with reposing methodology to handle occlusions and novel view synthesis. 2) Personalize output via multi-image fine-tuning. UV space approach leverages existing unpaired datasets.

Result: The approach yields high-quality, reposed portraits with strong qualitative and quantitative performance on real-world imagery, successfully creating professional versions of casual photos.

Conclusion: The method effectively transforms casual photos into professional portraits by operating in UV space and using personalization techniques, addressing the lack of paired training data while preserving individual identity and features.

Abstract: Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both in the wild’’ and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person’s face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.

[76] Text-Conditioned Background Generation for Editable Multi-Layer Documents

Taewon Kang, Joseph K J, Chris Tensmeyer, Jihyung Kil, Wanrong Zhu, Ming C. Lin, Vlad I. Morariu

Main category: cs.CV

TL;DR: A training-free framework for document background generation with multi-page editing, text preservation via latent masking and automated readability optimization, and thematic continuity through summarization-based guidance.

Details

Motivation: To bridge generative modeling with natural design workflows by enabling automated document background generation that preserves text readability, maintains multi-page consistency, and allows flexible customization while keeping text regions intact.

Method: 1) Latent masking formulation inspired by smooth barrier functions to softly attenuate diffusion updates in text regions; 2) Automated Readability Optimization (ARO) that places semi-transparent rounded backing shapes behind text with minimal opacity meeting WCAG 2.2 standards; 3) Summarization-and-instruction process for multi-page consistency; 4) Structured document composition treating text, figures, and backgrounds as separate layers; 5) User prompts for stylistic adjustments.

Result: Produces visually coherent, text-preserving, and thematically aligned documents without requiring training, enabling targeted background editing while maintaining readability and aesthetic harmony across multi-page documents.

Conclusion: The framework successfully bridges generative modeling with design workflows by combining automated consistency mechanisms (latent masking, ARO, summarization guidance) with user customization, creating a practical solution for document-centric background generation that preserves both readability and visual continuity.

Abstract: We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.

[77] PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics

Nan Zhou, Huandong Wang, Jiahao Li, Yang Li, Xiao-Ping Zhang, Yong Li, Xinlei Chen

Main category: cs.CV

TL;DR: PhysFire-WM is a physics-informed world model that combines infrared thermal data with fire masks to predict fire spread dynamics more accurately by incorporating physical priors from simulators and using cross-task collaborative training.

Details

Motivation: Current fire prediction methods are limited to binary mask modeling with sparse signals, failing to capture complex fire dynamics. Existing world models for video generation have physical inconsistencies that make them unsuitable for accurate fire forecasting.

Method: PhysFire-WM encodes structured priors from a physical simulator to rectify physical discrepancies, and uses Cross-task Collaborative Training (CC-Train) that integrates thermal radiation dynamics and spatial boundary delineation through parameter sharing and gradient coordination.

Result: Extensive experiments on a fine-grained multimodal fire dataset demonstrate superior accuracy in fire spread prediction compared to existing methods.

Conclusion: The paper validates the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction scenarios.

Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.

[78] Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang

Main category: cs.CV

TL;DR: LGCLIP is a lightweight zero-shot image classification framework that uses LLM-generated prompts and diffusion models to create visual prototypes, eliminating the need for annotated image-text pairs and dual encoders.

Details

Motivation: Existing VLMs like CLIP require expensive annotated text-image pairs for modality alignment and use dual-tower encoders that hinder lightweight deployment. There's a need for more efficient, annotation-free approaches.

Method: LGCLIP uses LLMs to generate class-specific prompts, which guide diffusion models to synthesize reference images as visual prototypes. Real image features are compared with these prototype features using only a visual encoder for classification.

Result: Experimental results validate LGCLIP’s feasibility and efficiency, demonstrating strong performance in zero-shot classification tasks while being lightweight and annotation-free.

Conclusion: LGCLIP establishes a novel paradigm for classification that requires only class labels, eliminates manual annotation needs, and offers a lightweight alternative to traditional VLM approaches.

Abstract: Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

[79] ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

Qi Zhang, Yuxu Chen, Lei Deng, Lili Shen

Main category: cs.CV

TL;DR: ABE-CLIP is a training-free method that enhances attribute-object binding in CLIP models through semantic refinement and local token-patch alignment, improving compositional image-text matching without additional training.

Details

Motivation: CLIP struggles with compositional image-text matching, particularly associating objects with attributes, due to its global representation overlooking fine-grained semantics. Existing methods require additional training or extensive negative sampling but show limited generalization and fail to address the fundamental limitations of global representations.

Method: ABE-CLIP uses a Semantic Refinement Mechanism to refine token embeddings for object and attribute phrases, mitigating attribute confusion. It introduces Local Token-Patch Alignment that computes similarity scores between refined textual tokens and their most relevant image patches, then aggregates localized similarity scores for final image-text similarity.

Result: Experiments on multiple datasets show ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

Conclusion: ABE-CLIP effectively enhances attribute binding in CLIP-like models without requiring additional training, addressing the limitations of global representations through semantic refinement and local alignment strategies.

Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

[80] It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities

Matias Quintana, Fangqi Liu, Jussi Torkko, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Tuuli Toivonen, Yi Lu, Filip Biljecki

Main category: cs.CV

TL;DR: This paper analyzes discrepancies between objective measurements and subjective perceptions of urban greenery across five countries, finding that location/cultural factors influence perception more than demographics or personality.

Details

Motivation: Urban greenery assessment is crucial for planning, but objective measurements (like Green View Index) may differ from subjective human perceptions. Understanding these discrepancies and their drivers is important for effective urban planning that aligns with human experience.

Method: Used street view imagery for objective greenery measurements and conducted a comprehensive urban visual perception survey with 1,000 people across five countries. Analyzed discrepancies between objective measures (GVI) and subjective scores (pairwise ratings) using human, geographic, and spatial dimensions including demographics, personality, and location factors.

Result: Discrepancies between objective and subjective greenery assessments are comparable worldwide. Demographics and personality don’t significantly affect perception. Location factors (where people live) are among the top two most influential features affecting perceived greenery, suggesting cultural, environmental, and experiential factors shape urban greenery perception.

Conclusion: Cultural and location-based factors substantially influence how people perceive urban greenery, more than individual demographics or personality. This highlights the importance of considering local context and human perception alongside objective measurements in urban planning and greenery assessment.

Abstract: Quantifying and assessing urban greenery is consequential for planning and development, reflecting the everlasting importance of green spaces for multiple climate and well-being dimensions of cities. Evaluation can be broadly grouped into objective (e.g., measuring the amount of greenery) and subjective (e.g., polling the perception of people) approaches, which may differ – what people see and feel about how green a place is might not match the measurements of the actual amount of vegetation. In this work, we advance the state of the art by measuring such differences and explaining them through human, geographic, and spatial dimensions. The experiments rely on contextual information extracted from street view imagery and a comprehensive urban visual perception survey collected from 1,000 people across five countries with their extensive demographic and personality information. We analyze the discrepancies between objective measures (e.g., Green View Index (GVI)) and subjective scores (e.g., pairwise ratings), examining whether they can be explained by a variety of human and visual factors such as age group and spatial variation of greenery in the scene. The findings reveal that such discrepancies are comparable around the world and that demographics and personality do not play a significant role in perception. Further, while perceived and measured greenery correlate consistently across geographies (both where people and where imagery are from), where people live plays a significant role in explaining perceptual differences, with these two, as the top among seven, features that influences perceived greenery the most. This location influence suggests that cultural, environmental, and experiential factors substantially shape how individuals observe greenery in cities.

[81] Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences

Zhenbao Yu, Banglei Guan, Shunkun Liang, Zibin Liu, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: A globally optimal solver using affine correspondences to estimate generalized relative pose with known vertical direction for multi-camera systems with IMU.

Details

Motivation: Mobile devices with multi-camera systems and IMUs (like self-driving cars) need accurate relative pose estimation. Current methods need improvement in accuracy for visual-inertial applications.

Method: 1) Decouple rotation matrix and translation vector, establish cost function about relative rotation angle minimizing algebraic error from affine correspondences. 2) Convert global optimization into two polynomials with two unknowns using characteristic equation and its first derivative. 3) Solve relative rotation angle with polynomial eigenvalue solver, get translation from eigenvector. 4) Also propose linear solution for small relative rotations.

Result: The proposed solver outperforms comparable state-of-the-art methods in accuracy on both synthetic data and real-world datasets.

Conclusion: The globally optimal solver using affine correspondences with known vertical direction provides accurate relative pose estimation for multi-camera systems, demonstrating superior performance over existing methods.

Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

[82] Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing, Quan Wang, Yuanyuan Shi

Main category: cs.CV

TL;DR: ARCD introduces anatomical region-guided contrastive decoding to reduce hallucinations in medical VLMs by providing targeted, region-specific guidance through dynamic re-weighting at token, attention, and logits levels.

Details

Motivation: Medical VLMs suffer from hallucinations where they rely on textual priors rather than visual evidence. Existing mitigation strategies have limitations: training-based methods require costly expert annotations, while training-free methods like contrastive decoding apply global, untargeted corrections that are unreliable in complex clinical settings.

Method: ARCD uses an anatomical mask to direct a three-tiered contrastive decoding process. It dynamically re-weights at token, attention, and logits levels to steer the model’s focus onto specified anatomical regions, reinforcing anatomical understanding and suppressing factually incorrect outputs.

Result: Extensive experiments across diverse datasets (chest X-ray, CT, brain MRI, ocular ultrasound) demonstrate effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Conclusion: ARCD provides a plug-and-play strategy for targeted hallucination mitigation in medical VLMs that is more reliable than existing methods and applicable across various medical imaging modalities.

Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model’s focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method’s effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

[83] Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening

Kai Liu, Zeli Lin, Weibo Wang, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: Fose: A lightweight pansharpening network that fuses one-step diffusion model distillation with an end-to-end model using a four-stage training strategy, achieving SOTA performance with 7.42× speedup.

Details

Motivation: Current diffusion models for pansharpening are computationally expensive (multi-step process), while end-to-end models lack priors and have simple structures. Need to combine accuracy with efficiency.

Method: Four-stage training: 1) One-step distillation from enhanced SOTA diffusion model (compressing 50 steps to 1), 2) Fusion with E2E model using lightweight ensemble blocks, 3) Comprehensive training strategy, 4) Lightweight network design.

Result: Significant improvement on three benchmarks, 7.42× speedup compared to baseline diffusion model while achieving better performance.

Conclusion: Fose successfully combines the accuracy of diffusion models with the efficiency of end-to-end models through distillation and fusion, achieving state-of-the-art pansharpening with dramatically reduced computational cost.

Abstract: Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at https://github.com/Kai-Liu001/Fose.

[84] Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng

Main category: cs.CV

TL;DR: Reasoning Palette uses a latent variable framework with VAE to inject diverse reasoning contexts into language models, enabling strategic exploration and better performance in reasoning tasks.

Details

Motivation: Stochastic sampling in large language models often produces redundant reasoning paths with limited high-level diversity, which limits both inference performance and reinforcement learning training effectiveness.

Method: Proposes a latent-modulation framework using a VAE to infer stochastic latent variables from question-answer pairs. These latents are decoded into token prefixes that modulate the model’s internal reasoning trajectory before token generation, with SFT warm-up for adaptation.

Result: Enables interpretable and controllable strategic behavior, achieves consistent performance gains over standard RL methods across multiple reasoning benchmarks, and enhances exploration efficiency and sustained learning capability.

Conclusion: Reasoning Palette provides an effective framework for strategic contextualization in language models, improving both inference diversity and RL training through structured exploration of reasoning strategies.

Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model’s internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model’s strategic behavior, thereby achieving consistent performance gains over standard RL methods.

[85] CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao, Bin Jing, Quan Wang

Main category: cs.CV

TL;DR: CheXPO-v2 is a novel alignment framework that uses process supervision with Knowledge Graph Consistency Rewards to reduce hallucinations in medical VLMs, outperforming existing methods with high data efficiency.

Details

Motivation: Medical Vision-Language Models suffer from hallucinations that compromise clinical reliability. Current reinforcement learning methods like GRPO focus on outcome-based rewards, which encourage models to generate verbose, unverifiable reasoning that obscures factual errors and poses safety risks.

Method: Proposes CheXPO-v2 framework with Knowledge Graph Consistency Reward mechanism using Entity-Relation Matching. Parses reasoning steps into structured “Disease, Relation, Anatomy” triplets for fine-grained supervision, combined with hard-example mining strategy.

Result: Significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning.

Conclusion: CheXPO-v2 effectively addresses hallucination issues in medical VLMs by shifting from outcome to process supervision, providing a more reliable and data-efficient alignment solution for clinical applications.

Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to “overthink” – generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured “Disease, Relation, Anatomy” triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

[86] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

Main category: cs.CV

TL;DR: DAVE is a specialized vision encoder for VLMs designed for document understanding and web agent tasks, trained through self-supervised and supervised stages with model-merging and ensemble techniques to improve performance without costly annotations.

Details

Motivation: Current VLMs have weak vision encoders that lack robust structural and spatial information needed for document understanding and web agent tasks, creating a fundamental limitation in these applications.

Method: Two-stage training: 1) Self-supervised pretraining on unlabeled images, 2) Supervised autoregressive pretraining with limited high-quality data. Uses model-merging to combine encoders trained with different text decoders, and ensemble training to fuse features from generalist encoders with document/web-specific representations.

Result: Extensive experiments on document tasks, VQAs, web localization, and agent benchmarks validate DAVE’s effectiveness as a strong vision encoder for document and web applications.

Conclusion: DAVE successfully bridges the gap in VLMs’ vision capabilities for document and web agent tasks, establishing itself as a specialized encoder that leverages abundant unlabeled data and innovative training strategies.

Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

[87] Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing

Xuyang Li, Chenyu Li, Danfeng Hong

Main category: cs.CV

TL;DR: AOM is a universal Remote Sensing Foundation Model that handles arbitrary band compositions, sensor types, and resolution scales through spectrum-independent tokenization and multi-scale adaptive patch embedding.

Details

Motivation: Existing RSFMs are limited by fixed band configurations and resolutions, making them vulnerable to real-world scenarios involving missing bands, cross-sensor fusion, and unseen spatial scales, which limits their generalization and practical deployment.

Method: AOM introduces: 1) Spectrum-independent tokenizer with dedicated band embeddings for each channel, 2) Multi-scale adaptive patch embedding mechanism for varying resolutions, 3) Multi-scale semantic alignment mechanism, and 4) Channel-wise self-supervised masking and reconstruction pretraining strategy.

Result: Extensive experiments on over 10 public datasets (Sentinel-2, Landsat, HLS) show AOM consistently achieves state-of-the-art performance under challenging conditions including band missing, cross-sensor, and cross-resolution settings.

Conclusion: AOM provides a universal solution for optical satellite imagery analysis that overcomes the limitations of fixed-configuration RSFMs, enabling robust performance across diverse real-world scenarios with varying band compositions and spatial resolutions.

Abstract: Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.

[88] Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

Son Tung Nguyen, Tobias Fischer, Alejandro Fontan, Michael Milford

Main category: cs.CV

TL;DR: Learning-based visual localization method that learns global descriptors consistent with both geometric structure and visual similarity, improving robustness to noisy geometric constraints without manual place labels.

Details

Motivation: Existing visual localization methods use global descriptors derived only from geometric cues (like covisibility graphs), which limits discriminative power and reduces robustness when geometric constraints are noisy. There's a need for descriptors that consider both geometry and visual similarity.

Method: Proposes an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity. Uses a batch-mining strategy based solely on overlap scores and a modified contrastive loss, enabling training without manual place labels.

Result: Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency.

Conclusion: The method successfully learns descriptors that are close in descriptor space only when images are both visually similar and spatially connected, correcting erroneous associations from unreliable overlap scores, and generalizes well across diverse environments.

Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust_scr}{github.com/sontung/robust_scr}.

[89] RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette

Main category: cs.CV

TL;DR: RadImageNet-VQA is a large-scale radiologic VQA dataset with 750K CT/MRI images and 7.5M QA pairs covering abnormality detection, anatomy recognition, and pathology identification across 8 anatomical regions and 97 pathologies.

Details

Motivation: Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts, creating a need for comprehensive radiologic VQA resources.

Method: Built from expert-curated annotations, the dataset includes 750K images paired with 7.5M question-answer samples covering three key tasks across eight anatomical regions and 97 pathology categories, supporting multiple question formats.

Result: State-of-the-art vision-language models struggle with fine-grained pathology identification, especially in open-ended settings, and text-only analysis shows performance collapses without image inputs, confirming absence of linguistic shortcuts.

Conclusion: RadImageNet-VQA provides a comprehensive benchmark for radiologic VQA research, revealing current model limitations and enabling future development of more capable medical vision-language systems.

Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

[90] Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi, Zhixiong Zeng, Qingmin Liao, Lin Ma

Main category: cs.CV

TL;DR: The paper addresses “visual forgetting” in MLLMs by disentangling abstract reasoning from visual perception through a two-stage curriculum: first building reasoning skills on text, then teaching when to look via reinforcement learning.

Details

Motivation: Current MLLMs suffer from "visual forgetting" where they lose visual grounding during long reasoning chains. This stems from prematurely entangling abstract reasoning skills with visual perception strategies during training.

Method: Two-stage curriculum: 1) Disentangled SFT builds abstract reasoning backbone on text-only data, then anchors to vision with Perception-Grounded Chain-of-Thought. 2) Reinforcement learning with Pivotal Perception Reward teaches models when to look by coupling perceptual actions to linguistic markers of uncertainty.

Result: The framework transforms models from heuristic-driven observers to strategic, grounded reasoners by addressing both the foundational cold-start deficiency (weak abstract reasoning) and strategic perception deficit (lack of when-to-look policy).

Conclusion: By disentangling reasoning from perception and teaching strategic visual grounding, the approach solves the “think longer, see less” problem in MLLMs, enabling more robust long-chain visual reasoning.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is “visual forgetting”, where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as “think longer, see less”. We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning “how-to-think”) and (2) strategic visual perception (“when-to-look”). This creates a foundational cold-start deficiency – weakening abstract reasoning – and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., “wait”, “verify”), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

[91] Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du, Chang Zhou, Chunjie Zhang, Xi Chen, Di Hu

Main category: cs.CV

TL;DR: VideoDetective: An efficient question-aware memory mechanism for MLLMs to process long videos by iteratively seeking critical clues through compression and recurrent aggregation.

Details

Motivation: Long Video QA is challenging for MLLMs due to immense context, information overload, and prohibitive memory consumption. Existing methods that reduce visual tokens or extend context may miss useful information or require excessive computation.

Method: VideoDetective processes videos iteratively in sub-segments using question-aware compression with special memory tokens. It recurrently aggregates and stores these tokens to update history context for subsequent segments, enabling efficient critical clue seeking.

Result: Enables MLLMs with 32K context length to process 100K tokens (3600 frames, hour-long video at 1fps) in 2 minutes using 37GB GPU memory. Outperforms on multiple long video benchmarks and introduces GLVC dataset for better evaluation.

Conclusion: The question-aware memory mechanism effectively addresses long video QA challenges by enabling efficient critical clue seeking from massive information while maintaining computational efficiency.

Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model’s context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model’s long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

[92] Mitty: Diffusion-based Human-to-Robot Video Generation

Yiren Song, Cheng Liu, Weijia Mao, Mike Zheng Shou

Main category: cs.CV

TL;DR: Mitty is a Diffusion Transformer that enables end-to-end human-to-robot video generation through video in-context learning, bypassing intermediate representations like keypoints or trajectories.

Details

Motivation: Existing methods for learning from human demonstration videos rely on intermediate representations (keypoints, trajectories) that introduce information loss and cumulative errors, harming temporal and visual consistency. There's a need for more direct, end-to-end approaches.

Method: Mitty is built on a pretrained video diffusion model and uses a Diffusion Transformer architecture. It compresses demonstration videos into condition tokens and fuses them with robot denoising tokens through bidirectional attention during diffusion. An automatic synthesis pipeline creates human-robot pairs from large egocentric datasets to address paired-data scarcity.

Result: Experiments on Human2Robot and EPIC-Kitchens datasets show state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.

Conclusion: Mitty demonstrates that video in-context learning with diffusion transformers enables effective end-to-end human-to-robot video generation without intermediate abstractions, offering a promising approach for scalable robot learning directly from human demonstrations.

Abstract: Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation. Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions. Demonstration videos are compressed into condition tokens and fused with robot denoising tokens through bidirectional attention during diffusion. To mitigate paired-data scarcity, we also develop an automatic synthesis pipeline that produces high-quality human-robot pairs from large egocentric datasets. Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.

[93] AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning

Dong Zifei, Wu Wenjie, Hao Jinkui, Chen Tianqi, Weng Ziqiao, Zhou Bo

Main category: cs.CV

TL;DR: AnyCXR is a unified framework for generalizable multi-organ segmentation of chest X-rays using only synthetic supervision, achieving strong zero-shot generalization across different projection angles without real annotations.

Details

Motivation: Robust anatomical segmentation of chest X-rays is challenging due to scarcity of comprehensive annotations and substantial variability in real-world acquisition conditions.

Method: Combines Multi-stage Domain Randomization (MSDR) engine generating diverse synthetic radiographs from 3D CT volumes with Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial/imperfect labels by enforcing anatomical consistency in latent space.

Result: Achieves strong zero-shot generalization on multiple real-world datasets, accurately delineating 54 anatomical structures in PA, lateral, and oblique views. Supports downstream clinical tasks including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification.

Conclusion: AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

Abstract: Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

[94] WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images

Guoping Cai, Houjin Chen, Yanfeng Li, Jia Sun, Ziwei Chen, Qingzi Geng

Main category: cs.CV

TL;DR: WDFFU-Mamba: A novel U-shaped Mamba network with wavelet-guided enhancement and dual-attention feature fusion for robust breast ultrasound tumor segmentation, achieving state-of-the-art performance on public datasets.

Details

Motivation: Breast ultrasound image segmentation is crucial for clinical diagnosis and early tumor screening, but faces challenges from speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries that hinder accurate segmentation.

Method: Proposes WDFFU-Mamba network with two key modules: 1) Wavelet-denoised High-Frequency-guided Feature (WHF) module to enhance low-level representations using noise-suppressed high-frequency cues, and 2) Dual Attention Feature Fusion (DAFF) module to effectively merge skip-connected and semantic features for better contextual consistency, all within a U-shaped Mamba architecture.

Result: Extensive experiments on two public BUS datasets show WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).

Conclusion: The combination of wavelet-domain enhancement and attention-based fusion improves both accuracy and robustness of BUS image segmentation while maintaining computational efficiency. The model shows strong generalization ability across datasets, making it promising for real-world clinical applications in breast tumor ultrasound analysis.

Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.

[95] Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge

Zehui Lin, Luyi Han, Xin Wang, Ying Zhou, Yanming Zhang, Tianyu Zhang, Lingyun Bao, Shandong Wu, Dong Xu, Tao Tan, the UUSIC25 Challenge Consortium

Main category: cs.CV

TL;DR: General-purpose ultrasound AI models show high multi-task performance but struggle with domain generalization to unseen data.

Details

Motivation: Current ultrasound AI tools are fragmented into single-task applications, limiting clinical utility compared to versatile modern ultrasound systems that require multi-organ, multi-task capabilities.

Method: The Universal UltraSound Image Challenge 2025 (UUSIC25) evaluated general-purpose deep learning models on 11,644 training images. Models were tested on an independent, multi-center test set of 2,479 images, including data from a completely unseen center to assess generalization.

Result: Top model (SMART) achieved macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models excelled at segmentation (fetal head DSC: 0.942) but showed significant performance drops on unseen data, particularly for complex tasks like breast cancer molecular subtyping (AUC dropped from 0.571 to 0.508).

Conclusion: General-purpose AI models can achieve high accuracy and efficiency across multiple ultrasound tasks using single architectures, but domain generalization remains a critical challenge for clinical deployment, especially for complex diagnostic tasks.

Abstract: IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model’s performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.

[96] Vision-Language Model Guided Image Restoration

Cuixin Yang, Rongkang Dong, Kin-Man Lam

Main category: cs.CV

TL;DR: VLMIR is a vision-language model guided image restoration framework that leverages CLIP’s vision-language priors through two-stage feature extraction and diffusion-based restoration to improve both visual fidelity and semantic coherence.

Details

Motivation: Previous image restoration approaches struggle to effectively leverage both visual and linguistic knowledge, and recent VLM-based methods fail to utilize linguistic priors for semantic coherence during restoration.

Method: Two-stage framework: 1) VLM-based feature extraction with visual/linguistic representation extraction, caption embedding alignment using cosine similarity loss with LoRA fine-tuning, and degradation predictor; 2) Diffusion-based restoration integrating complementary embeddings via cross-attention mechanisms.

Result: Superior performance across both universal and degradation-specific image restoration tasks, demonstrating the critical role of integrated visual and linguistic knowledge from VLMs.

Conclusion: VLMIR effectively leverages vision-language priors to enhance image restoration through improved visual perception and semantic understanding, advancing restoration capabilities.

Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.

[97] Towards Pixel-Wise Anomaly Location for High-Resolution PCBA \ via Self-Supervised Image Reconstruction

Wuyi Liu, Le Jin, Junxian Yang, Yuanchao Yu, Zishuo Peng, Jinfeng Xu, Xianzhi Li, Jun Zhou

Main category: cs.CV

TL;DR: HiSIR-Net: A high-resolution self-supervised reconstruction framework for pixel-wise PCBA defect inspection using selective reconstruction gates and optimized patch selection to handle 4K-resolution images with minimal false positives.

Details

Motivation: PCBA defect inspection faces challenges due to insufficient labeled data, micro-defects (few pixels), visually-complex high-resolution images, and lack of high-resolution PCBA datasets.

Method: HiSIR-Net combines: 1) Selective Input-Reconstruction Gate (SIR-Gate) that decides where to trust reconstruction vs. original input to reduce artifacts and false alarms; 2) Region-level Optimized Patch Selection (ROPS) with positional cues for coherent overlapping patch reconstructions across arbitrary resolutions.

Result: Superior localization performance on SIPCBA-500 dataset (500 self-collected images) and public benchmarks, producing clean high-resolution anomaly maps with low false positive rate while running at practical speed.

Conclusion: HiSIR-Net effectively addresses PCBA defect inspection challenges through self-supervised reconstruction with selective gating and optimized patch selection, enabling practical 4K-resolution inspection with minimal false positives.

Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.

[98] ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen

Main category: cs.CV

TL;DR: ProCache is a training-free dynamic feature caching framework that accelerates Diffusion Transformers by using non-uniform caching intervals and selective computation to reduce error accumulation.

Details

Motivation: Diffusion Transformers (DiTs) have high computational costs that hinder real-time deployment. Existing feature caching methods use uniform intervals that don't match DiT's non-uniform temporal dynamics and cause error accumulation with large caching intervals.

Method: ProCache uses two components: 1) constraint-aware caching pattern search that generates non-uniform activation schedules via offline constrained sampling, and 2) selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation.

Result: ProCache achieves up to 1.96x and 2.90x acceleration on PixArt-alpha and DiT with negligible quality degradation, significantly outperforming prior caching-based methods.

Conclusion: ProCache effectively addresses the limitations of existing feature caching methods for DiTs by aligning caching patterns with the model’s temporal dynamics and mitigating error accumulation through selective computation.

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

[99] MatLat: Material Latent Space for PBR Texture Generation

Kyeongmin Yeo, Yunhong Min, Jaihoon Kim, Minhyuk Sung

Main category: cs.CV

TL;DR: A generative framework for high-quality PBR textures using fine-tuned VAE with material latent space and locality-preserving regularization.

Details

Motivation: Large-scale PBR texture datasets are scarce, and existing methods freeze embedding networks causing distribution shifts when encoding additional PBR channels, hindering diffusion training.

Method: Fine-tune pretrained VAE to incorporate new material channels with minimal latent distribution deviation, and introduce locality regularization that crops latent patches, decodes them, and aligns corresponding image regions to maintain spatial correspondence.

Result: The framework improves PBR texture fidelity, with ablation studies showing each component is critical for achieving state-of-the-art performance.

Conclusion: The proposed approach effectively leverages pretrained latent image generative models while learning a material latent space, addressing distribution shift issues and ensuring cross-view consistency through locality preservation.

Abstract: We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.

[100] EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance

Ankit Yadav, Ta Duc Huy, Lingqiao Liu

Main category: cs.CV

TL;DR: EMAG is a training-free guidance method for diffusion transformers that modifies attention at inference using adaptive layer selection to create harder, semantically faithful negative samples, improving generation quality over CFG.

Details

Motivation: Existing guidance methods like CFG and recent strong/weak model approaches lack reliable control over negative sample granularity/difficulty and have fixed target-layer selection, limiting their ability to surface difficult failure modes for refinement.

Method: EMAG modifies attention at inference time in diffusion transformers using exponential moving average statistics with an adaptive layer-selection rule, producing harder but semantically faithful negative samples (fine-grained degradations).

Result: EMAG boosts quality and human preference score by +0.46 over CFG, surfaces difficult failure modes for refinement, and naturally composes with advanced guidance techniques like APG and CADS for further improvements.

Conclusion: EMAG provides a training-free, adaptive approach to guidance that creates more challenging negative samples while maintaining semantic faithfulness, leading to better generation quality and composability with other guidance methods.

Abstract: In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.

[101] Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang

Main category: cs.CV

TL;DR: DRIM enables deep, reliable multi-turn reasoning in vision-language models by combining supervised fine-tuning with redundancy-penalized reinforcement learning to encourage self-reflection and correction.

Details

Motivation: Current VLMs with Chain-of-Thought reasoning struggle to reflect on and correct incorrect reasoning trajectories when thinking with images, limiting their reliability in complex visual tasks.

Method: Three-stage pipeline: 1) Construct high-difficulty verifiable visual QA pairs requiring multi-turn tool calls; 2) Cold-start SFT using tool trajectories; 3) RL with redundancy-penalized policy optimization that penalizes incorrect answers without sufficient multi-scale exploration.

Result: DRIM achieves superior performance on visual understanding benchmarks compared to existing models.

Conclusion: The proposed DRIM framework enables deep and reliable multi-turn reasoning in VLMs by encouraging self-reflective reasoning patterns through redundancy-penalized optimization.

Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

[102] CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

Main category: cs.CV

TL;DR: CodeDance uses executable code as a general solver for visual reasoning, enabling flexible tool orchestration, intermediate computation, and visual artifact rendering for transparent reasoning.

Details

Motivation: Current open-source approaches rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex visual reasoning tasks.

Method: CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts. It introduces a reward for balanced and adaptive tool-call to balance exploration with efficiency and mitigate tool overuse.

Result: CodeDance consistently outperforms schema-driven and text-only baselines, surpasses advanced closed models like GPT-4o and larger open-source models. It demonstrates novel emergent behaviors: novel tool invocations, unseen compositions, and cross-task transfer without task-specific fine-tuning.

Conclusion: CodeDance provides a general and scalable mechanism for executable visual reasoning through code-based orchestration, enabling transparent, self-checkable reasoning with emergent capabilities beyond atomic supervision.

Abstract: Recent releases such as o3 highlight human-like “thinking with images” reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

[103] InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo

Main category: cs.CV

TL;DR: InsertAnywhere is a diffusion-based framework for realistic video object insertion that achieves geometrically consistent placement and appearance-faithful synthesis through 4D scene understanding and illumination-aware generation.

Details

Motivation: Existing diffusion-based video generation methods struggle with realistic video object insertion due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects, making realistic object placement challenging.

Method: The framework uses a 4D-aware mask generation module for scene geometry reconstruction and temporal propagation, extends a diffusion-based video generation model for joint object and local variation synthesis, and trains on ROSE++ dataset with illumination-aware synthetic triplets.

Result: The method produces geometrically plausible and visually coherent object insertions across diverse real-world scenarios, significantly outperforming existing research and commercial models.

Conclusion: InsertAnywhere successfully addresses key challenges in video object insertion by combining 4D scene understanding with illumination-aware synthesis, enabling realistic and consistent object placement in videos.

Abstract: Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.

[104] Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

Main category: cs.CV

TL;DR: ADK enhances Vision-Language Models by generating rich descriptive prompts offline using LLMs, providing compositional and instance-specific knowledge to improve few-shot adaptation without computational overhead.

Details

Motivation: VLMs struggle with distribution shifts in downstream tasks, and existing PEFT methods rely on fixed handcrafted prompts that are insufficient for understanding class semantics. Image-induced prompts help but introduce prohibitive computational overhead at inference.

Method: ADK uses LLMs to generate descriptive prompts for each class offline. These pre-computed features are deployed as: 1) Compositional Knowledge (averaged representation for rich semantics), and 2) Instance-Specific Knowledge (lightweight attention mechanism dynamically selects relevant descriptions per image). It’s a parameter-free plug-and-play component.

Result: ADK consistently boosts performance of multiple PEFT baselines, setting new state-of-the-art across various scenarios without compromising efficiency.

Conclusion: ADK efficiently enriches text representations for VLMs, providing additional descriptive knowledge to facilitate category distinction across domains while maintaining computational efficiency.

Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

[105] EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

Lu Wei, Yuta Nakashima, Noa Garcia

Main category: cs.CV

TL;DR: EMMA benchmark reveals concept erasure methods struggle with implicit prompts, visually similar concepts, and may amplify bias.

Details

Motivation: Text-to-image generation raises privacy, bias, and copyright concerns. Concept erasure techniques promise to remove unwanted concepts without full retraining, but current evaluations are limited and simplistic.

Method: Introduce EMMA benchmark with 12 metrics across 5 key dimensions to test concept erasure boundaries. Evaluate robustness under challenging conditions including indirect descriptions, visually similar non-target concepts, and bias analysis.

Result: Existing methods struggle with implicit prompts (generating erased concepts when indirectly referenced) and visually similar non-target concepts. Some methods amplify gender and ethnicity bias compared to original models.

Conclusion: Current concept erasure techniques have significant limitations in real-world scenarios. More robust methods are needed that handle indirect references, visual similarities, and avoid bias amplification.

Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.

[106] Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, Qifeng Chen

Main category: cs.CV

TL;DR: Robust-R1 is a new framework that explicitly models visual degradations through structured reasoning chains to improve MLLM robustness, outperforming existing methods on real-world degradation benchmarks.

Details

Motivation: Current Multimodal Large Language Models struggle with real-world visual degradations, and existing robust MLLMs rely on implicit training that lacks interpretability and focuses only on visual encoder generalization.

Method: Proposes Robust-R1 framework with three components: 1) supervised fine-tuning for degradation-aware reasoning foundations, 2) reward-driven alignment for accurate degradation parameter perception, and 3) dynamic reasoning depth scaling adapted to degradation intensity. Also introduces an 11K dataset with realistic degradations across four visual processing stages.

Result: Achieves state-of-the-art robustness, outperforming all general and robust baselines on R-Bench benchmark, and maintains superior anti-degradation performance on MMMB, MMStar, and RealWorldQA under multi-intensity adversarial degradations.

Conclusion: Robust-R1 effectively addresses MLLM robustness limitations through explicit degradation modeling with structured reasoning chains, demonstrating significant improvements in handling real-world visual degradations.

Abstract: Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

[107] Rotterdam artery-vein segmentation (RAV) dataset

Jose Vargas Quiros, Bart Liefers, Karin van Garderen, Jeroen Vermeulen, Eyened Reading Center, Caroline Klaver

Main category: cs.CV

TL;DR: A diverse, high-quality dataset of color fundus images with detailed artery-vein segmentation annotations for developing and evaluating ML algorithms in ophthalmology.

Details

Motivation: To provide a comprehensive dataset supporting ML algorithm development for retinal vascular analysis, addressing the need for diverse, high-quality annotations under real-world conditions.

Method: Sampled CFIs from the longitudinal Rotterdam Study, annotated using a custom interface with separate artery/vein/unknown vessel layers, starting from initial vessel segmentation masks with connectivity verification tools.

Result: Dataset includes 1024x1024-pixel PNG images in three modalities: original RGB, contrast-enhanced versions, and RGB-encoded A/V masks, with wide quality variation including challenging samples containing valuable vascular information.

Conclusion: The dataset provides a rich, heterogeneous source of CFIs with high-quality segmentations for robust ML model benchmarking and training under real-world variability in image quality and acquisition settings.

Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

[108] DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

Jiyun Kong, Jun-Hyuk Kim, Jong-Seok Lee

Main category: cs.CV

TL;DR: DESSERT is a diffusion-based framework for event-driven single-frame synthesis that uses residual training and a pre-trained Stable Diffusion model to generate sharper, more temporally consistent future frames from event camera data.

Details

Motivation: Traditional video frame prediction struggles with dynamic scenes due to lack of future frame information. Event cameras capture brightness changes with high temporal resolution but existing event-based methods using optical flow suffer from holes and blurring when pixel displacement is inaccurate.

Method: Two-stage training: (1) Event-to-Residual Alignment Variational Autoencoder (ER-VAE) aligns event frames between anchor and target frames with corresponding residuals, (2) diffusion model denoises residual latent conditioned on event data. Uses Diverse-Length Temporal augmentation for robustness.

Result: Outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Conclusion: DESSERT effectively addresses limitations of prior event-based video frame prediction methods by combining diffusion models with residual training, achieving superior frame synthesis quality and temporal consistency through event-driven conditioning.

Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

[109] ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Yuanyuan Liu, Jingying Chen

Main category: cs.CV

TL;DR: ClothHMR: A novel 3D human mesh recovery method that handles diverse clothing by tailoring garments to fit body silhouettes and leveraging foundational human visual models for better generalization.

Details

Motivation: Current 3D human mesh recovery methods perform poorly with diverse clothing, especially loose garments, as they mainly focus on tight clothing. There's a need for methods that can accurately estimate body shapes and poses under various clothing conditions.

Method: ClothHMR consists of two modules: (1) Clothing Tailoring (CT) that uses body semantic estimation and edge prediction to tailor clothing to fit body silhouettes, and (2) FHVM-based Mesh Recovering (MR) that optimizes 3D mesh parameters by aligning intermediate representations with those from foundational human visual models.

Result: ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images, accurately recovering 3D meshes of humans in diverse clothing while precisely estimating body shapes and poses.

Conclusion: ClothHMR effectively addresses the challenge of 3D human mesh recovery under diverse clothing conditions and has practical applications in fashion and shopping, with a developed web application demonstrating real-world utility.

Abstract: With 3D data rapidly emerging as an important form of multimedia information, 3D human mesh recovery technology has also advanced accordingly. However, current methods mainly focus on handling humans wearing tight clothing and perform poorly when estimating body shapes and poses under diverse clothing, especially loose garments. To this end, we make two key insights: (1) tailoring clothing to fit the human body can mitigate the adverse impact of clothing on 3D human mesh recovery, and (2) utilizing human visual information from large foundational models can enhance the generalization ability of the estimation. Based on these insights, we propose ClothHMR, to accurately recover 3D meshes of humans in diverse clothing. ClothHMR primarily consists of two modules: clothing tailoring (CT) and FHVM-based mesh recovering (MR). The CT module employs body semantic estimation and body edge prediction to tailor the clothing, ensuring it fits the body silhouette. The MR module optimizes the initial parameters of the 3D human mesh by continuously aligning the intermediate representations of the 3D mesh with those inferred from the foundational human visual model (FHVM). ClothHMR can accurately recover 3D meshes of humans wearing diverse clothing, precisely estimating their body shapes and poses. Experimental results demonstrate that ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images. Additionally, a web application for online fashion and shopping powered by ClothHMR is developed, illustrating that ClothHMR can effectively serve real-world usage scenarios. The code and model for ClothHMR are available at: \url{https://github.com/starVisionTeam/ClothHMR}.

[110] Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi

Main category: cs.CV

TL;DR: The paper introduces Polysome for synthetic instruction generation, creates HISTAI-Instruct dataset with 1.1M instruction-response pairs from 24K slides, and trains ANTONI-α VLM that outperforms MedGemma on WSI-level VQA tasks.

Details

Motivation: Current vision-language models for pathology have limitations: they focus on small regions, provide only static slide-level outputs, rely on non-public data, and lack training data with detailed clinical reports, hindering transparent and generalizable VLMs.

Method: 1) Developed Polysome, a standardized tool for synthetic instruction generation. 2) Applied Polysome to HISTAI dataset to create HISTAI-Instruct with 24,259 slides and 1.1M instruction-response pairs. 3) Trained ANTONI-α VLM using this dataset for visual-question answering tasks.

Result: ANTONI-α outperforms MedGemma on WSI-level VQA tasks including tissue identification, neoplasm detection, and differential diagnosis. Performance of multiple ANTONI-α versions trained with different data amounts was compared.

Conclusion: The work addresses key limitations in pathology VLMs by providing standardized synthetic instruction generation, creating a large public dataset, and demonstrating superior performance on clinical tasks. All methods, data, and code are publicly available for reproducibility.

Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-α, a VLM capable of visual-question answering (VQA). We show that ANTONI-α outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-α trained with different amounts of data. All methods, data, and code are publicly available.

[111] A unified FLAIR hyperintensity segmentation model for various CNS tumor types and acquisition time points

Mathilde Gajda Faanes, David Bouget, Asgeir S. Jakola, Timothy R. Smith, Vasileios K. Kavouridis, Francesco Latini, Margret Jensdottir, Peter Milos, Henrietta Nittby Redebrandt, Rickard L. Sjöberg, Rupavathana Mahesparan, Lars Kjelsberg Pedersen, Ole Solheim, Ingerid Reinertsen

Main category: cs.CV

TL;DR: A unified Attention U-Net model for automatic FLAIR hyperintensity segmentation across various brain tumor types and time points, achieving good performance and integrated into open-source Raidionics software.

Details

Motivation: FLAIR MRI scans are crucial for brain tumor diagnosis and monitoring, but manual segmentation of hyperintensity volumes is time-consuming. Automatic segmentation would be clinically useful for assessing tumor volume and edema across different tumor types and time points.

Method: Used ~5000 FLAIR images from various tumor types and acquisition time points across different centers to train a unified segmentation model using an Attention U-Net architecture. Compared performance against dataset-specific models and validated on different tumor types, time points, and against BraTS benchmark.

Result: The unified model achieved Dice scores: 88.65% for pre-op meningiomas, 80.08% for pre-op metastasis, 90.92% for pre-op and 84.60% for post-op gliomas from BraTS, and 84.47% for pre-op and 61.27% for post-op lower grade gliomas. Performed comparably to dataset-specific models while enabling generalization across tumor types and time points.

Conclusion: The unified Attention U-Net model provides effective FLAIR hyperintensity segmentation across diverse brain tumor types and acquisition time points, demonstrating clinical utility and generalization capability. Integration into Raidionics open-source software facilitates clinical deployment.

Abstract: T2-weighted fluid-attenuated inversion recovery (FLAIR) magnetic resonance imaging (MRI) scans are important for diagnosis, treatment planning and monitoring of brain tumors. Depending on the brain tumor type, the FLAIR hyperintensity volume is an important measure to asses the tumor volume or surrounding edema, and an automatic segmentation of this would be useful in the clinic. In this study, around 5000 FLAIR images of various tumors types and acquisition time points from different centers were used to train a unified FLAIR hyperintensity segmentation model using an Attention U-Net architecture. The performance was compared against dataset specific models, and was validated on different tumor types, acquisition time points and against BraTS. The unified model achieved an average Dice score of 88.65% for pre-operative meningiomas, 80.08% for pre-operative metastasis, 90.92% for pre-operative and 84.60% for post-operative gliomas from BraTS, and 84.47% for pre-operative and 61.27% for post-operative lower grade gliomas. In addition, the results showed that the unified model achieved comparable segmentation performance to the dataset specific models on their respective datasets, and enables generalization across tumor types and acquisition time points, which facilitates the deployment in a clinical setting. The model is integrated into Raidionics, an open-source software for CNS tumor analysis.

[112] SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation

Shihang Li, Zhiqiang Gong, Minming Ye, Yue Gao, Wen Yao

Main category: cs.CV

TL;DR: SynergyWarpNet: A three-stage attention-guided cooperative warping framework for high-fidelity talking head synthesis that combines explicit warping with reference-augmented semantic correction.

Details

Motivation: Traditional explicit warping struggles with accurate motion transfer and missing regions, while attention-based methods have high complexity and weak geometric grounding. Need for better portrait animation for virtual avatars, telepresence, and digital content creation.

Method: Three-stage progressive refinement: 1) Explicit warping module using 3D dense optical flow for coarse spatial alignment; 2) Reference-augmented correction module using cross-attention across 3D keypoints and texture features from multiple references; 3) Confidence-guided fusion module with learned confidence map for spatially-adaptive blending.

Result: Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance in talking head synthesis.

Conclusion: SynergyWarpNet effectively addresses limitations of both explicit and attention-based warping methods by combining geometric alignment with semantic correction, achieving high-fidelity portrait animation.

Abstract: Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.

[113] Multi-level distortion-aware deformable network for omnidirectional image super-resolution

Cuixin Yang, Rongkang Dong, Kin-Man Lam, Yuhang Zhang, Guoping Qiu

Main category: cs.CV

TL;DR: MDDN is a novel Multi-level Distortion-aware Deformable Network for OmniDirectional Image Super-Resolution that addresses latitude-dependent geometric distortion in ERP images through parallel deformable branches with expanded sampling ranges.

Details

Motivation: Existing ODISR methods have limited sampling ranges and feature extraction capabilities, which hinder their ability to capture distorted patterns over large areas in ERP images where distortion is minimal near the equator but severe toward the poles.

Method: Proposes MDDN with three parallel branches: deformable attention mechanism (dilation=1) and two dilated deformable convolutions (dilation rates 2 and 3). Uses multi-level feature fusion module and low-rank decomposition strategy to reduce computational cost.

Result: Extensive experiments on publicly available datasets demonstrate that MDDN outperforms state-of-the-art methods, showing effectiveness and superiority in ODISR.

Conclusion: MDDN successfully addresses geometric distortion in ERP images by expanding sampling range and receptive field, generating dense features that effectively capture distorted patterns across wider areas.

Abstract: As augmented reality and virtual reality applications gain popularity, image processing for OmniDirectional Images (ODIs) has attracted increasing attention. OmniDirectional Image Super-Resolution (ODISR) is a promising technique for enhancing the visual quality of ODIs. Before performing super-resolution, ODIs are typically projected from a spherical surface onto a plane using EquiRectangular Projection (ERP). This projection introduces latitude-dependent geometric distortion in ERP images: distortion is minimal near the equator but becomes severe toward the poles, where image content is stretched across a wider area. However, existing ODISR methods have limited sampling ranges and feature extraction capabilities, which hinder their ability to capture distorted patterns over large areas. To address this issue, we propose a novel Multi-level Distortion-aware Deformable Network (MDDN) for ODISR, designed to expand the sampling range and receptive field. Specifically, the feature extractor in MDDN comprises three parallel branches: a deformable attention mechanism (serving as the dilation=1 path) and two dilated deformable convolutions with dilation rates of 2 and 3. This architecture expands the sampling range to include more distorted patterns across wider areas, generating dense and comprehensive features that effectively capture geometric distortions in ERP images. The representations extracted from these deformable feature extractors are adaptively fused in a multi-level feature fusion module. Furthermore, to reduce computational cost, a low-rank decomposition strategy is applied to dilated deformable convolutions. Extensive experiments on publicly available datasets demonstrate that MDDN outperforms state-of-the-art methods, underscoring its effectiveness and superiority in ODISR.

[114] MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

Main category: cs.CV

TL;DR: MGRegBench is the first large-scale public benchmark dataset for mammogram registration with over 5,000 image pairs and manual annotations, enabling standardized evaluation and comparison of diverse registration methods.

Details

Motivation: Robust mammography registration is crucial for clinical applications like tracking disease progression, but progress has been limited by the absence of public datasets and standardized benchmarks, with existing studies using private data and inconsistent evaluation frameworks.

Method: Created MGRegBench dataset with over 5,000 image pairs including 100 with manual anatomical landmarks and segmentation masks. Benchmarked diverse registration methods: classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a classic mammography-specific approach, and state-of-the-art deep learning method MammoRegNet.

Result: Established the first public dataset of this scale with manual landmarks and masks for mammography registration, conducted the first like-for-like comparison of diverse methods on this modality, and performed extensive analysis of deep learning-based registration.

Conclusion: MGRegBench provides a foundational resource for fair comparisons in mammography registration research, publicly releasing code and data to catalyze future research and establish standardized benchmarks in the field.

Abstract: Robust mammography registration is essential for clinical applications like tracking disease progression and monitoring longitudinal changes in breast tissue. However, progress has been limited by the absence of public datasets and standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a public benchmark dataset for mammogram registration. It comprises over 5,000 image pairs, with 100 containing manual anatomical landmarks and segmentation masks for rigorous evaluation. This makes MGRegBench one of the largest public 2D registration datasets with manual annotations. Using this resource, we benchmarked diverse registration methods including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a classic mammography-specific approach, and a recent state-of-the-art deep learning method MammoRegNet. The implementations were adapted to this modality from the authors’ implementations or re-implemented from scratch. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) the first like-for-like comparison of diverse methods on this modality; and (3) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair comparisons and catalyze future research. The source code and data are at https://github.com/KourtKardash/MGRegBench.

[115] Beyond Semantic Features: Pixel-level Mapping for Generalized AI-Generated Image Detection

Chenming Zhou, Jiaan Wang, Yu Li, Lei Li, Juan Cao, Sheng Tang

Main category: cs.CV

TL;DR: A pixel-level mapping pre-processing method that disrupts image pixel distributions to break semantic shortcuts, forcing AI-generated image detectors to focus on generalizable high-frequency artifacts instead of overfitting to source-specific patterns.

Details

Motivation: Current AI-generated image detectors fail to generalize to unseen generative models because they overfit to source-specific semantic cues rather than learning universal generative artifacts. This lack of cross-generator generalization is a critical limitation.

Method: Introduces a simple pixel-level mapping pre-processing step that disrupts the pixel value distribution of images. This breaks the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts, forcing them to focus on fundamental high-frequency traces inherent to the image generation process.

Result: The approach significantly boosts cross-generator performance of state-of-the-art detectors on both GAN and diffusion-based generators. Extensive analysis verifies that disrupting semantic cues is key to achieving generalization across different generative models.

Conclusion: By breaking semantic shortcuts through pixel-level distribution disruption, detectors can learn more generalizable features focused on fundamental generative artifacts rather than source-specific patterns, enabling better performance on unseen generative models.

Abstract: The rapid evolution of generative technologies necessitates reliable methods for detecting AI-generated images. A critical limitation of current detectors is their failure to generalize to images from unseen generative models, as they often overfit to source-specific semantic cues rather than learning universal generative artifacts. To overcome this, we introduce a simple yet remarkably effective pixel-level mapping pre-processing step to disrupt the pixel value distribution of images and break the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts. This forces the detector to focus on more fundamental and generalizable high-frequency traces inherent to the image generation process. Through comprehensive experiments on GAN and diffusion-based generators, we show that our approach significantly boosts the cross-generator performance of state-of-the-art detectors. Extensive analysis further verifies our hypothesis that the disruption of semantic cues is the key to generalization.

[116] Towards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors

Peixuan Zhang, Shuchen Weng, Jiajun Tang, Si Li, Boxin Shi

Main category: cs.CV

TL;DR: Proposes Affective Image Filter (AIF) task to transform text-expressed emotions into visually-concrete images, with two models (AIF-B and AIF-D) that outperform state-of-the-art methods in emotional fidelity and content consistency.

Details

Motivation: Social media users express emotions through text with images, but there's a need to better reflect abstract emotions from text into concrete visual representations to create more emotionally compelling content.

Method: Introduces AIF dataset and task formulation. Presents AIF-B as initial multi-modal transformer approach, then AIF-D as extension leveraging pre-trained large-scale diffusion models for deeper emotional reflection.

Result: Quantitative/qualitative experiments show AIF models achieve superior content consistency and emotional fidelity compared to SOTA methods. User studies confirm AIF models are significantly more effective at evoking specific emotions.

Conclusion: AIF models demonstrate strong potential for creating emotionally compelling visual content from text, with comprehensive discussion of their value and future applications in emotional expression through social media.

Abstract: Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.

[117] Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation

Alexandre Personnic, Mihai Bâce

Main category: cs.CV

TL;DR: ST-Gaze: A spatio-temporal network combining CNN with attention modules for video-based gaze estimation, achieving SOTA on EVE dataset by modeling both intra-frame spatial context and inter-frame temporal dynamics.

Details

Motivation: Video-based gaze estimation needs to capture both spatial features within frames and temporal dynamics across frames. Current methods are limited by how they handle spatial-temporal relationships, often losing important spatial context through premature pooling.

Method: ST-Gaze combines CNN backbone with channel attention and self-attention modules to fuse eye and face features. Fused features are treated as spatial sequences to capture intra-frame context, then propagated through time to model inter-frame dynamics using spatio-temporal recurrence.

Result: Achieves state-of-the-art performance on EVE dataset both with and without person-specific adaptation. Ablation study shows preserving intra-frame spatial context with spatio-temporal recurrence is superior to premature spatial pooling.

Conclusion: ST-Gaze demonstrates that modeling both intra-frame spatial context and inter-frame temporal dynamics is crucial for robust video-based gaze estimation using commonly available cameras, paving the way for more effective gaze tracking systems.

Abstract: Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.

[118] Beyond Occlusion: In Search for Near Real-Time Explainability of CNN-Based Prostate Cancer Classification

Martin Krebs, Jan Obdržálek, Vít Musil, Tomáš Brázdil

Main category: cs.CV

TL;DR: Researchers found a faster alternative to occlusion-based explanation methods for prostate cancer detection AI, reducing explanation time by 10x without quality loss.

Details

Motivation: Occlusion-based explanation methods for deep neural networks in cancer diagnosis are computationally slow, hindering model development and clinical adoption. Pathologists need explainable AI outputs, but current methods delay iteration and interaction.

Method: Developed a framework for comparing explanation methods using identified criteria and metrics, then evaluated alternative methods to replace occlusion in a prostate cancer detection system.

Result: Found an alternative explanation method that reduces computation time by at least 10x while maintaining output quality, enabling faster model development and debugging.

Conclusion: The approach successfully identifies faster explanation methods for clinical AI systems, facilitating adoption of AI-assisted prostate cancer detection and can be applied to other medical applications.

Abstract: Deep neural networks are starting to show their worth in critical applications such as assisted cancer diagnosis. However, for their outputs to get accepted in practice, the results they provide should be explainable in a way easily understood by pathologists. A well-known and widely used explanation technique is occlusion, which, however, can take a long time to compute, thus slowing the development and interaction with pathologists. In this work, we set out to find a faster replacement for occlusion in a successful system for detecting prostate cancer. Since there is no established framework for comparing the performance of various explanation methods, we first identified suitable comparison criteria and selected corresponding metrics. Based on the results, we were able to choose a different explanation method, which cut the previously required explanation time at least by a factor of 10, without any negative impact on the quality of outputs. This speedup enables rapid iteration in model development and debugging and brings us closer to adopting AI-assisted prostate cancer detection in clinical settings. We propose that our approach to finding the replacement for occlusion can be used to evaluate candidate methods in other related applications.

[119] An Empirical Study of Sampling Hyperparameters in Diffusion-Based Super-Resolution

Yudhistira Arief Wibowo

Main category: cs.CV

TL;DR: Empirical study finds conditioning step size (2.0-3.0 range) matters more than diffusion step count for diffusion-based super-resolution performance.

Details

Motivation: Diffusion models show promise for inverse problems like super-resolution, but existing conditioning methods (DPS, MCG) introduce hyperparameters requiring careful tuning. Need to understand which factors most affect performance.

Method: Conducted empirical ablation study on FFHQ super-resolution to identify dominant factors affecting performance when applying conditioning to pretrained diffusion models. Compared conditioning step size versus diffusion step count.

Result: Conditioning step size has significantly greater impact than diffusion step count. Step sizes in range [2.0, 3.0] yield best overall performance in experiments.

Conclusion: For diffusion-based super-resolution, careful tuning of conditioning step size (optimal range 2.0-3.0) is more critical than diffusion step count for achieving best reconstruction quality.

Abstract: Diffusion models have shown strong potential for solving inverse problems such as single-image super-resolution, where a high-resolution image is recovered from a low-resolution observation using a pretrained unconditional prior. Conditioning methods, including Diffusion Posterior Sampling (DPS) and Manifold Constrained Gradient (MCG), can substantially improve reconstruction quality, but they introduce additional hyperparameters that require careful tuning. In this work, we conduct an empirical ablation study on FFHQ super-resolution to identify the dominant factors affecting performance when applying conditioning to pretrained diffusion models, and show that the conditioning step size has a significantly greater impact than the diffusion step count, with step sizes in the range of [2.0, 3.0] yielding the best overall performance in our experiments.

[120] AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments

Georgios Simantiris, Konstantinos Bacharidis, Apostolos Papanikolaou, Petros Giannakakis, Costas Panagiotakis

Main category: cs.CV

TL;DR: AIFloodSense is a comprehensive global aerial imagery dataset for flood detection with 470 high-resolution images from 230 flood events across 64 countries, supporting classification, segmentation, and VQA tasks.

Details

Motivation: Current flood detection datasets are scarce, geographically limited, and lack annotation detail, hindering development of robust computer vision methods for disaster response and risk assessment.

Method: Created AIFloodSense dataset with 470 high-resolution aerial images from 230 distinct flood events across 64 countries (2022-2024), annotated for three tasks: image classification (environment type, camera angle, continent), semantic segmentation (flood, sky, buildings), and visual question answering.

Result: Established baseline benchmarks using state-of-the-art architectures, demonstrating the dataset’s complexity and value for advancing domain-generalized AI tools for climate resilience.

Conclusion: AIFloodSense bridges the gap in flood detection datasets by providing comprehensive, globally diverse, and temporally relevant data to support robust AI development for disaster response and climate resilience.

Abstract: Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large-scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high-resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022-2024), supporting three complementary tasks: (i) Image Classification with novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel-level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset’s complexity and its value in advancing domain-generalized AI tools for climate resilience.

[121] Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu

Main category: cs.CV

TL;DR: Open-source home-centric vision-language model MiMo-VL-Miloco-7B achieves strong performance on smart-home understanding and general multimodal reasoning through specialized training.

Details

Motivation: To develop a specialized vision-language model for smart-home environments that balances home-scenario understanding with general multimodal reasoning capabilities.

Method: Two-stage training pipeline: supervised fine-tuning + reinforcement learning (Group Relative Policy Optimization) with chain-of-thought supervision and token-budget-aware reasoning, leveraging multi-domain data efficiently.

Result: Achieves leading F1 scores on gesture recognition and home-scenario understanding, with consistent gains on video benchmarks (Video-MME, Video-MMMU, Charades-STA) and language benchmarks (MMMU-Pro, MMLU-Pro), outperforming both closed-source and open-source baselines.

Conclusion: Targeted home-scenario training enhances activity/gesture understanding and improves text-only reasoning with minimal trade-offs on document tasks; model weights and evaluation toolkit are publicly available for smart-home research and deployment.

Abstract: We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco} to support research and deployment in real-world smart-home applications.

Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem

Main category: cs.CV

TL;DR: LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to create diverse traffic scenarios using 3D scene decomposition and agentic pipeline coordination.

Details

Motivation: The paper aims to enable natural language control for editing real-world driving videos to synthesize diverse traffic scenarios, addressing the need for fine-grained editing and realism in driving video manipulation.

Method: The framework uses explicit 3D scene decomposition to represent driving videos as scene graphs with static background and dynamic objects. It employs an agentic pipeline where an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents: Object Grounding Agent (text-to-object correspondence), Behavior Editing Agent (generates multi-object trajectories), and Behavior Reviewer Agent (iterative refinement). Edited scenes are rendered and refined using video diffusion tools.

Result: LangDriveCTRL achieves nearly 2× higher instruction alignment than previous state-of-the-art methods, with superior structural preservation, photorealism, and traffic realism. It supports both object node editing (removal, insertion, replacement) and multi-object behavior editing from single natural-language instructions.

Conclusion: LangDriveCTRL presents an effective natural-language-controllable framework for driving video editing that combines 3D scene understanding with agentic coordination to achieve high-quality, instruction-aligned traffic scenario synthesis.

Abstract: LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.

[123] MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation

Jon Muhovič, Janez Perš

Main category: cs.CV

TL;DR: MULTIAQUA is a novel multimodal maritime dataset with synchronized RGB, thermal, IR, and LIDAR data to improve scene interpretation for unmanned surface vehicles in poor visibility conditions.

Details

Motivation: Unmanned surface vehicles face challenging visual conditions where color cameras alone are insufficient, especially in poor weather and lighting conditions that require multimodal sensor fusion.

Method: Created MULTIAQUA dataset with synchronized, calibrated, and annotated multimodal sensor data; developed training approaches for robust multimodal methods that can be trained only on daytime images.

Result: The dataset enables evaluation of multimodal methods on difficult nighttime test sets; training approaches allow networks to retain reliable performance even in near-complete darkness using only daytime training data.

Conclusion: MULTIAQUA dataset and proposed training approaches significantly simplify data acquisition and annotation while enabling robust multimodal scene interpretation for maritime applications in challenging visibility conditions.

Abstract: Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.

[124] Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image

Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen, Matthias Nießner

Main category: cs.CV

TL;DR: Pix2NPHM: A vision transformer network that directly regresses Neural Parametric Head Model parameters from single images, achieving high-fidelity 3D face reconstruction at interactive speeds.

Details

Motivation: Neural Parametric Head Models (NPHMs) offer better geometric detail than traditional 3DMMs, but fitting them to visual inputs is challenging due to their expressive latent space. There's a need for efficient, high-quality single-image face reconstruction.

Method: Uses a vision transformer (ViT) network to directly regress NPHM parameters from single images. Employs domain-specific ViTs pretrained on geometric prediction tasks. Trained on mixed 3D data (100K+ NPHM registrations for direct SDF supervision) and 2D video datasets (using normal estimates as pseudo ground truth). Includes optional inference-time optimization against estimated surface normals and canonical point maps.

Result: Achieves more recognizable facial geometry and accurate expressions than existing approaches. Enables 3D reconstructions at interactive frame rates. Allows further geometric fidelity improvement through inference-time optimization. Achieves unprecedented face reconstruction quality that scales to in-the-wild data.

Conclusion: Pix2NPHM successfully addresses the challenge of fitting expressive NPHMs to visual inputs, enabling high-fidelity single-image face reconstruction with broad generalization and interactive performance.

Abstract: Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.

[125] 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

Tobias Sautter, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 3D-RE-GEN is a compositional framework that reconstructs single images into textured 3D mesh scenes suitable for artists’ workflows in visual effects and game development, addressing limitations of current methods through integrated detection, reconstruction, and placement models.

Details

Motivation: Current 3D scene generation methods produce visually appealing results but lack artist-friendly modifiable textured mesh scenes needed for visual effects and game development workflows. Existing textured mesh reconstruction methods suffer from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds.

Method: A compositional framework integrating state-of-the-art models for asset detection, reconstruction, and placement. Uses image editing with generative models for occluded object inference, scene-level reasoning for consistent lighting/geometry, and a novel 4-DoF differentiable optimization to align objects with estimated ground planes. Generates comprehensive backgrounds that spatially constrain objects during optimization.

Result: Achieves state-of-the-art performance in single image 3D scene reconstruction. Produces coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization. Addresses artists’ requirements by generating physically realistic layouts suitable for visual effects and game development.

Conclusion: 3D-RE-GEN successfully bridges the gap between automated 3D scene generation and artist workflows by producing modifiable textured mesh scenes with correct object decomposition, accurate spatial relationships, and comprehensive backgrounds through a novel compositional approach.

Abstract: Recent advances in 3D scene generation produce visually appealing output, but current representations hinder artists’ workflows that require modifiable 3D textured mesh scenes for visual effects and game development. Despite significant advances, current textured mesh scene reconstruction methods are far from artist ready, suffering from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds. We present 3D-RE-GEN, a compositional framework that reconstructs a single image into textured 3D objects and a background. We show that combining state of the art models from specific domains achieves state of the art scene reconstruction performance, addressing artists’ requirements. Our reconstruction pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains. Obtaining occluded objects is treated as an image editing task with generative models to infer and reconstruct with scene level reasoning under consistent lighting and geometry. Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization and provides a foundation for realistic lighting and simulation tasks in visual effects and games. To obtain physically realistic layouts, we employ a novel 4-DoF differentiable optimization that aligns reconstructed objects with the estimated ground plane. 3D-RE-GEN~achieves state of the art performance in single image 3D scene reconstruction, producing coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization.

[126] TwinSegNet: A Digital Twin-Enabled Federated Learning Framework for Brain Tumor Analysis

Almustapha A. Wakili, Adamu Hussaini, Abubakar A. Musa, Woosub Jung, Wei Yu

Main category: cs.CV

TL;DR: TwinSegNet is a privacy-preserving federated learning framework that combines hybrid ViT-UNet models with personalized digital twins for accurate brain tumor segmentation across multiple institutions without sharing sensitive medical data.

Details

Motivation: Current deep learning methods for brain tumor segmentation require centralized data collection, which raises privacy concerns and limits generalization across diverse institutions due to data confidentiality requirements and non-IID distributions.

Method: Proposes TwinSegNet, a federated learning framework integrating hybrid ViT-UNet models (convolutional encoders with Vision Transformer bottlenecks) with personalized digital twins. Each institution fine-tunes the global model on private data to create institution-specific digital twins while maintaining data privacy.

Result: Evaluated on nine heterogeneous MRI datasets (BraTS 2019-2021 and custom collections), achieving high Dice scores (up to 0.90%) and sensitivity/specificity exceeding 90%. Demonstrates robustness across non-IID client distributions while preserving privacy comparable to centralized models like TumorVisNet.

Conclusion: TwinSegNet enables scalable, personalized brain tumor segmentation for multi-institutional clinical settings while adhering to strict data confidentiality requirements, offering an effective privacy-preserving alternative to centralized approaches without sacrificing performance.

Abstract: Brain tumor segmentation is critical in diagnosis and treatment planning for the disease. Yet, current deep learning methods rely on centralized data collection, which raises privacy concerns and limits generalization across diverse institutions. In this paper, we propose TwinSegNet, which is a privacy-preserving federated learning framework that integrates a hybrid ViT-UNet model with personalized digital twins for accurate and real-time brain tumor segmentation. Our architecture combines convolutional encoders with Vision Transformer bottlenecks to capture local and global context. Each institution fine-tunes the global model of private data to form its digital twin. Evaluated on nine heterogeneous MRI datasets, including BraTS 2019-2021 and custom tumor collections, TwinSegNet achieves high Dice scores (up to 0.90%) and sensitivity/specificity exceeding 90%, demonstrating robustness across non-independent and identically distributed (IID) client distributions. Comparative results against centralized models such as TumorVisNet highlight TwinSegNet’s effectiveness in preserving privacy without sacrificing performance. Our approach enables scalable, personalized segmentation for multi-institutional clinical settings while adhering to strict data confidentiality requirements.

[127] Animate Any Character in Any World

Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu

Main category: cs.CV

TL;DR: AniX is a world model that combines static 3D scene generation with controllable character animation, allowing users to direct characters through natural language to perform diverse actions in 3DGS scenes while generating coherent videos.

Details

Motivation: Current world models have limitations: static world generation lacks active agents, while controllable-entity models only allow limited single-entity actions. There's a need for models that combine realistic static environments with open-ended character control.

Method: AniX takes a 3DGS scene and character as input, uses natural language to direct character actions, and formulates it as conditional autoregressive video generation. Built on a pre-trained video generator with enhanced training strategy to improve motion dynamics while maintaining generalization across actions and characters.

Result: The model synthesizes temporally coherent video clips preserving visual fidelity with provided scenes and characters. Evaluation covers visual quality, character consistency, action controllability, and long-horizon coherence across diverse behaviors from basic locomotion to object-centric interactions.

Conclusion: AniX successfully bridges the gap between static world generation and controllable-entity models, enabling user-specified characters to perform open-ended actions in realistic 3D environments through natural language control.

Abstract: Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.

[128] LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models

Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral, Joost Van De Weijer

Main category: cs.CV

TL;DR: LumiCtrl is a method for precise illuminant control in text-to-image generation that learns illuminant prompts from single object images using physics-based augmentation, edge-guided prompt disentanglement, and masked reconstruction loss.

Details

Motivation: Current T2I models lack precise control over scene illuminants, which is crucial for content designers to manipulate mood, atmosphere, and visual aesthetics in generated images.

Method: Three components: 1) physics-based illuminant augmentation along Planckian locus for fine-tuning variants, 2) edge-guided prompt disentanglement using frozen ControlNet to focus on illumination rather than structure, 3) masked reconstruction loss focusing on foreground object while allowing background contextual adaptation.

Result: Significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. Human preference study confirms strong user preference for LumiCtrl outputs.

Conclusion: LumiCtrl enables precise illuminant control in T2I generation through a novel personalization approach that combines physics-based augmentation, structural disentanglement, and contextual adaptation, addressing a key limitation in current creative image generation systems.

Abstract: Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and (c) a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.

[129] MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Oskar Kristoffersen, Alba R. Sánchez, Morten R. Hannemose, Anders B. Dahl, Dim P. Papadopoulos

Main category: cs.CV

TL;DR: MMLANDMARKS is a multimodal geo-spatial dataset with 197k aerial images, 329k ground images, text, and GPS coordinates for 18,557 US landmarks, enabling unified training for various geo-spatial tasks.

Details

Motivation: Current geo-spatial benchmarks have limited modality coverage, restricting progress because approaches cannot integrate all relevant modalities (images from various viewpoints, text, coordinates) within a unified framework.

Method: Created MMLANDMARKS dataset with one-to-one correspondence across four modalities: high-resolution aerial images, ground-view images, textual information, and geographic coordinates for 18,557 distinct US landmarks.

Result: Demonstrated broad generalization and competitive performance against off-the-shelf foundational models and specialized SOTA models across tasks (cross-view retrieval, geolocalization, text-to-image/GPS retrieval) using a simple CLIP-inspired baseline.

Conclusion: The dataset enables unified multimodal geo-spatial understanding, and the results illustrate the necessity for multimodal datasets to achieve comprehensive geo-spatial analysis capabilities.

Abstract: Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.

[130] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

Main category: cs.CV

TL;DR: GroundingME is a new benchmark that systematically tests MLLMs’ visual grounding capabilities across four challenging dimensions, revealing significant gaps in their ability to handle real-world complexity.

Details

Motivation: Current benchmarks fail to capture real-world complexity where humans effortlessly handle ambiguous references and recognize when grounding is impossible. There's a need to assess whether MLLMs truly ground language in vision with human-like sophistication or are merely pattern-matching on simplified datasets.

Method: Created GroundingME benchmark with 1,005 challenging examples through careful curation combining automated generation with human verification. The benchmark systematically tests models across four dimensions: Discriminative (similar objects), Spatial (complex relations), Limited (occlusions/tiny objects), and Rejection (ungroundable queries). Evaluated 25 state-of-the-art MLLMs and explored two improvement strategies: test-time scaling and data-mixture training.

Result: Evaluation revealed profound capability gaps: best model achieved only 45.1% accuracy, and most scored 0% on rejection tasks, hallucinating objects rather than acknowledging their absence. Test-time scaling improved complex grounding by up to 2.9%, and data-mixture training boosted rejection accuracy from 0% to 27.9%.

Conclusion: GroundingME serves as both a diagnostic tool revealing current limitations in MLLMs’ visual grounding capabilities and a roadmap toward achieving human-level visual grounding, highlighting critical safety concerns for deployment.

Abstract: Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs’ true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.

[131] Validation of Diagnostic Artificial Intelligence Models for Prostate Pathology in a Middle Eastern Cohort

Peshawa J. Muhammad Ali, Navin Vincent, Saman S. Abdulla, Han N. Mohammed Fadhl, Anders Blilie, Kelvin Szolnoky, Julia Anna Mielcarz, Xiaoyi Ji, Nita Mulliqi, Abdulbasit K. Al-Talabani, Kimmo Kartasalo

Main category: cs.CV

TL;DR: AI models for prostate cancer diagnosis show pathologist-level performance when validated on Middle Eastern population, with high cross-scanner consistency including low-cost compact scanners.

Details

Motivation: Current AI pathology systems are primarily validated on European/US cohorts, leaving under-represented populations unaddressed. There's a need for validation in regions where AI support could have the greatest impact, particularly the Middle East.

Method: Collected 339 prostate biopsy specimens from 185 consecutive patients (2013-2024) in Kurdistan, Iraq. Evaluated task-specific end-to-end AI model and two foundation models for concordance with pathologists and consistency across three scanner models (Hamamatsu, Leica, Grundium).

Result: AI-pathologist grading concordance (kappa 0.801) was similar to pathologist-pathologist concordance (kappa 0.799). High cross-scanner concordance (kappa > 0.90) for all AI models and scanner pairs, including low-cost compact scanner.

Conclusion: AI achieves pathologist-level performance in prostate histopathology. Compact scanners enable validation in non-digitalized settings and cost-effective AI adoption. First openly available Middle Eastern digital pathology dataset supports globally equitable AI pathology research.

Abstract: Background: Artificial intelligence (AI) is improving the efficiency and accuracy of cancer diagnostics. The performance of pathology AI systems has been almost exclusively evaluated on European and US cohorts from large centers. For global AI adoption in pathology, validation studies on currently under-represented populations - where the potential gains from AI support may also be greatest - are needed. We present the first study with an external validation cohort from the Middle East, focusing on AI-based diagnosis and Gleason grading of prostate cancer. Methods: We collected and digitised 339 prostate biopsy specimens from the Kurdistan region, Iraq, representing a consecutive series of 185 patients spanning the period 2013-2024. We evaluated a task-specific end-to-end AI model and two foundation models in terms of their concordance with pathologists and consistency across samples digitised on three scanner models (Hamamatsu, Leica, and Grundium). Findings: Grading concordance between AI and pathologists was similar to pathologist-pathologist concordance with Cohen’s quadratically weighted kappa 0.801 vs. 0.799 (p=0.9824). Cross-scanner concordance was high (quadratically weighted kappa > 0.90) for all AI models and scanner pairs, including low-cost compact scanner. Interpretation: AI models demonstrated pathologist-level performance in prostate histopathology assessment. Compact scanners can provide a route for validation studies in non-digitalised settings and enable cost-effective adoption of AI in laboratories with limited sample volumes. This first openly available digital pathology dataset from the Middle East supports further research into globally equitable AI pathology. Funding: SciLifeLab and Wallenberg Data Driven Life Science Program, Instrumentarium Science Foundation, Karolinska Institutet Research Foundation.

[132] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

Main category: cs.CV

TL;DR: InfSplign is a training-free inference-time method that improves spatial alignment in text-to-image diffusion models by adjusting noise through a compound loss using cross-attention maps.

Details

Motivation: Current T2I diffusion models generate high-quality images but often fail to capture spatial relations specified in text prompts due to lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics.

Method: InfSplign is a training-free inference-time method that adjusts noise through a compound loss in every denoising step, leveraging different levels of cross-attention maps from the backbone decoder to enforce accurate object placement and balanced object presence.

Result: InfSplign establishes new state-of-the-art on VISOR and T2I-CompBench, achieving substantial performance gains over existing inference-time baselines and even outperforming fine-tuning-based methods.

Conclusion: InfSplign is a lightweight, plug-and-play, and compatible training-free method that significantly improves spatial alignment in T2I diffusion models through inference-time adjustments using cross-attention guidance.

Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

[133] Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection

Sairam VCR, Rishabh Lalla, Aveen Dayal, Tejal Kulkarni, Anuj Lalla, Vineeth N Balasubramanian, Muhammad Haris Khan

Main category: cs.CV

TL;DR: FALCON-SFOD improves source-free object detection by strengthening object-focused feature representations using foundation model guidance and noise-robust pseudo-labeling, addressing domain shift issues that cause background clutter activation.

Details

Motivation: Current SFOD methods using Mean-Teacher self-labeling suffer from domain shift that weakens object-focused representations, causing high-confidence activations over background clutter and unreliable pseudo-labels. Prior works focus on refining pseudo-labels but overlook the need to strengthen the feature space itself.

Method: FALCON-SFOD has two components: 1) SPAR (Spatial Prior-Aware Regularization) uses vision foundation models (OV-SAM) to generate class-agnostic binary masks that regularize the detector’s feature space toward object regions, and 2) IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) promotes balanced and noise-tolerant learning under severe foreground-background imbalance.

Result: The framework achieves competitive performance across SFOD benchmarks, with theoretical analysis showing tighter localization and classification error bounds.

Conclusion: FALCON-SFOD effectively addresses domain shift in source-free object detection by enhancing object-focused adaptation through foundation model guidance and robust pseudo-labeling, outperforming prior methods that only refine pseudo-labels.

Abstract: Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector’s ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector’s feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.

[134] PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology

Siemen Brussee, Pieter A. Valkema, Jurre A. J. Weijer, Thom Doeleman, Anne M. R. Schrader, Jesper Kers

Main category: cs.CV

TL;DR: PathBench-MIL is an open-source AutoML and benchmarking framework for multiple instance learning in histopathology that automates end-to-end MIL pipeline construction and provides reproducible benchmarking.

Details

Motivation: To address the need for standardized, reproducible, and automated MIL pipeline construction in histopathology, enabling rapid experimentation and comparison across different models and feature extractors.

Method: Develops an open-source framework that automates end-to-end MIL pipeline construction including preprocessing, feature extraction, and MIL-aggregation, with integrated visualization tooling, unified configuration system, and modular extensibility.

Result: PathBench-MIL provides reproducible benchmarking of dozens of MIL models and feature extractors, enabling standardization across datasets and tasks in histopathology.

Conclusion: PathBench-MIL offers a comprehensive AutoML and benchmarking solution for MIL in histopathology, facilitating rapid experimentation, standardization, and reproducible research in the field.

Abstract: We introduce PathBench-MIL, an open-source AutoML and benchmarking framework for multiple instance learning (MIL) in histopathology. The system automates end-to-end MIL pipeline construction, including preprocessing, feature extraction, and MIL-aggregation, and provides reproducible benchmarking of dozens of MIL models and feature extractors. PathBench-MIL integrates visualization tooling, a unified configuration system, and modular extensibility, enabling rapid experimentation and standardization across datasets and tasks. PathBench-MIL is publicly available at https://github.com/Sbrussee/PathBench-MIL

[135] Interpretable Plant Leaf Disease Detection Using Attention-Enhanced CNN

Balram Singh, Ram Prakash Sharma, Somnath Dey

Main category: cs.CV

TL;DR: This paper introduces CBAM-VGG16, an interpretable attention-guided CNN for plant leaf disease detection that integrates Convolution Block Attention Module at each convolutional stage to enhance feature extraction and disease localization.

Details

Motivation: Plant diseases threaten global food security, creating a need for accurate and interpretable disease detection methods in agriculture. Current methods often lack transparency and interpretability, which is crucial for reliable agricultural diagnostics.

Method: The authors propose CBAM-VGG16, which integrates Convolution Block Attention Module (CBAM) at each convolutional stage of a VGG16 architecture. This attention mechanism enhances feature extraction and disease localization. The model is trained on five diverse plant disease datasets and evaluated using multiple interpretability techniques including CBAM attention maps, Grad-CAM, Grad-CAM++, and Layer-wise Relevance Propagation (LRP).

Result: The approach outperforms recent techniques, achieving high accuracy up to 98.87% and demonstrating robust generalization across different plant disease datasets. The interpretability analysis provides transparent visualization of disease localization.

Conclusion: This study advances explainable AI in agricultural diagnostics by offering a transparent and reliable system for smart farming. The attention-guided approach provides both high accuracy and interpretability, making it suitable for practical agricultural applications.

Abstract: Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an interpretable attention-guided Convolutional Neural Network (CNN), CBAM-VGG16, for plant leaf disease detection. By integrating Convolution Block Attention Module (CBAM) at each convolutional stage, the model enhances feature extraction and disease localization. Trained on five diverse plant disease datasets, our approach outperforms recent techniques, achieving high accuracy (up to 98.87%) and demonstrating robust generalization. Here, we show the effectiveness of our method through comprehensive evaluation and interpretability analysis using CBAM attention maps, Grad-CAM, Grad-CAM++, and Layer-wise Relevance Propagation (LRP). This study advances the application of explainable AI in agricultural diagnostics, offering a transparent and reliable system for smart farming. The code of our proposed work is available at https://github.com/BS0111/PlantAttentionCBAM.

[136] FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie, Lizhuang Ma

Main category: cs.CV

TL;DR: FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views without requiring 3D annotations, using 2D-to-3D lifting from uncalibrated multi-view images.

Details

Motivation: Previous methods for 3D Gaussian reconstruction suffer from fixed input view requirements and insufficient 3D training data, limiting their practical application and semantic richness.

Method: A 3D-annotation-free training framework using 2D-to-3D lifting from arbitrary uncalibrated multi-view images, instance-guided contrastive learning to align 2D semantics with 3D representations, and geometry-semantic hierarchical sparsification to reduce computational costs.

Result: FLEG efficiently reconstructs language-embedded 3D Gaussian representations from arbitrary sparse or dense views, producing accurate geometry, high-fidelity appearance, and language-aligned semantics, outperforming existing methods on various tasks.

Conclusion: The proposed framework enables efficient 3D Gaussian reconstruction with rich semantic embedding without requiring 3D annotations, making it practical for real-world applications with arbitrary view inputs.

Abstract: We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

[137] G3Splat: Geometrically Consistent Generalizable Gaussian Splatting

Mehdi Hosseinzadeh, Shin-Fang Chng, Yi Xu, Simon Lucey, Ian Reid, Ravi Garg

Main category: cs.CV

TL;DR: G3Splat introduces geometric priors to address ambiguities in learning 3D Gaussian splats from view-synthesis supervision alone, achieving SOTA in reconstruction, pose estimation, and novel-view synthesis.

Details

Motivation: View-synthesis loss alone is insufficient for recovering geometrically meaningful 3D Gaussian splats, as prior work extends networks to predict Gaussian parameters but lacks geometric consistency.

Method: G3Splat enforces geometric priors to obtain geometrically consistent 3D scene representations for pose-free generalizable splatting, addressing learning ambiguities under self-supervision.

Result: Achieves state-of-the-art performance on RE10K in geometrically consistent reconstruction, relative pose estimation, and novel-view synthesis, with strong zero-shot generalization on ScanNet.

Conclusion: Geometric priors are essential for learning meaningful 3D Gaussian splats, enabling improved reconstruction, pose estimation, and generalization beyond view-synthesis supervision alone.

Abstract: 3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters – orientation, scale, opacity, and appearance – while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (https://m80hz.github.io/g3splat/).

[138] RadarGen: Automotive Radar Point Cloud Generation from Cameras

Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany

Main category: cs.CV

TL;DR: RadarGen is a diffusion model that generates realistic automotive radar point clouds from multi-view camera images, using BEV representations and foundation model guidance for physically plausible results.

Details

Motivation: To create a scalable method for generating realistic radar data that can work with existing visual datasets and simulation frameworks, bridging the gap between camera and radar modalities for autonomous driving applications.

Method: Adapts image-latent diffusion to radar domain by representing radar measurements in bird’s-eye-view (BEV) format with RCS and Doppler attributes. Uses lightweight recovery to reconstruct point clouds from generated maps. Incorporates BEV-aligned depth, semantic, and motion cues from pretrained foundation models to guide generation toward physical plausibility.

Result: RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, showing promising results on large-scale driving datasets.

Conclusion: The approach represents a step toward unified generative simulation across sensing modalities, offering a scalable direction for multimodal simulation that can leverage existing visual datasets.

Abstract: We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird’s-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.

Qilong Wang, Xiaofan Ming, Zhenyi Lin, Jinwen Li, Dongwei Ren, Wangmeng Zuo, Qinghua Hu

Main category: cs.CV

TL;DR: RoomEditor++ is a diffusion-based architecture for virtual furniture synthesis that integrates reference objects into indoor scenes with geometric coherence and visual realism, outperforming SOTA methods on the new RoomBench++ benchmark.

Details

Motivation: Virtual furniture synthesis has great potential for home design and e-commerce, but remains underexplored due to lack of reproducible benchmarks and limitations of existing image composition methods in achieving high-fidelity synthesis while preserving background integrity.

Method: Proposes RoomEditor++ with parameter-sharing dual diffusion backbone compatible with both U-Net and DiT architectures, unifying feature extraction and inpainting for reference and background images to enforce aligned feature representations for precise geometric transformations and seamless integration.

Result: Extensive experiments show RoomEditor++ outperforms state-of-the-art approaches in quantitative metrics, qualitative assessments, and human preference studies, with strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning.

Conclusion: RoomEditor++ provides an effective solution for virtual furniture synthesis, supported by the comprehensive RoomBench++ benchmark dataset, demonstrating superior performance and generalization capabilities for practical applications.

Abstract: Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{https://github.com/stonecutter-21/roomeditor}.

[140] Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi, Sandesh Swamy

Main category: cs.CV

TL;DR: A framework to evaluate and improve visual faithfulness in reasoning chains of VLMs, using step-level decomposition and self-reflection without training.

Details

Motivation: Current VLM evaluations only measure final-answer accuracy, failing to distinguish between models that reach correct answers via visually unfaithful reasoning versus those that reason faithfully but make wrong predictions. There's a need to evaluate visual faithfulness of reasoning chains as a distinct dimension.

Method: Proposes a training-free framework that: 1) decomposes reasoning chains into perception vs reasoning steps, 2) uses off-the-shelf VLM judges for step-level faithfulness evaluation, 3) introduces lightweight self-reflection that detects and regenerates unfaithful perception steps without training.

Result: The method reduces Unfaithful Perception Rate while preserving final-answer accuracy across multiple reasoning-trained VLMs and perception-heavy benchmarks, improving multimodal reasoning reliability.

Conclusion: Visual faithfulness of reasoning chains is a crucial evaluation dimension for VLMs, and the proposed training-free framework with self-reflection effectively improves reliability of multimodal reasoning by addressing unfaithful perception steps.

Abstract: Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

[141] Adversarial Robustness of Vision in Open Foundation Models

Jonathon Fox, William J Buchanan, Pavlos Papadopoulos

Main category: cs.CV

TL;DR: This paper investigates adversarial robustness of vision-language models (LLaVA-1.5-13B and Llama 3.2 Vision-8B-2) against visual attacks using PGD on VQA tasks, finding that Llama 3.2 Vision shows better robustness despite lower baseline accuracy.

Details

Motivation: As deep learning models become more complex and opaque, adversaries can exploit vulnerabilities by adding subtle perturbations to images that confuse AI recognition systems. The paper aims to evaluate the adversarial robustness of contemporary vision-language models against visual attacks.

Method: The study tests LLaVA-1.5-13B and Meta’s Llama 3.2 Vision-8B-2 using untargeted Projected Gradient Descent (PGD) attacks against the visual input modality. Evaluation is conducted on a subset of the Visual Question Answering (VQA) v2 dataset, with results quantified using standard VQA accuracy metrics and accuracy degradation analysis.

Result: Llama 3.2 Vision, despite having lower baseline accuracy in the experimental setup, exhibited smaller performance drops under adversarial attack compared to LLaVA, particularly at higher perturbation levels. Both models showed vulnerability to visual attacks, confirming the vision modality as a viable attack vector.

Conclusion: Vision modality represents a significant vulnerability for contemporary open-weight vision-language models. Adversarial robustness does not necessarily correlate with standard benchmark performance and may be influenced by underlying architectural and training factors, suggesting that robustness should be considered separately from standard accuracy metrics.

Abstract: With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta’s Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta’s Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.

[142] 3One2: One-step Regression Plus One-step Diffusion for One-hot Modulation in Dual-path Video Snapshot Compressive Imaging

Ge Wang, Xing Liu, Xin Yuan

Main category: cs.CV

TL;DR: This paper proposes a novel algorithm for video snapshot compressive imaging (SCI) using one-hot modulation masks, transforming reconstruction into a generative video inpainting problem solved via stochastic differential equations and diffusion refinement.

Details

Motivation: Current video SCI using random binary modulation causes temporal aliasing during compression. One-hot modulation offers perfect temporal decoupling but lacks algorithms to fully exploit its potential. The authors aim to bridge this gap by developing a specialized algorithm for one-hot masks.

Method: 1) Transform reconstruction into generative video inpainting using SDEs aligned with hardware compression. 2) Propose a novel framework combining one-step regression initialization with one-step diffusion refinement to address limitations of pure diffusion. 3) Implement dual optical path hardware to mitigate spatial degradation from one-hot modulation using complementary information.

Result: Experiments on synthetic datasets and real scenes demonstrate the effectiveness of the proposed method. This represents the first work integrating diffusion into video SCI reconstruction.

Conclusion: The proposed algorithm successfully addresses the limitations of existing video SCI methods by leveraging one-hot modulation’s decoupling properties, combining regression and diffusion techniques, and incorporating hardware-level dual optical paths for enhanced reconstruction quality.

Abstract: Video snapshot compressive imaging (SCI) captures dynamic scene sequences through a two-dimensional (2D) snapshot, fundamentally relying on optical modulation for hardware compression and the corresponding software reconstruction. While mainstream video SCI using random binary modulation has demonstrated success, it inevitably results in temporal aliasing during compression. One-hot modulation, activating only one sub-frame per pixel, provides a promising solution for achieving perfect temporal decoupling, thereby alleviating issues associated with aliasing. However, no algorithms currently exist to fully exploit this potential. To bridge this gap, we propose an algorithm specifically designed for one-hot masks. First, leveraging the decoupling properties of one-hot modulation, we transform the reconstruction task into a generative video inpainting problem and introduce a stochastic differential equation (SDE) of the forward process that aligns with the hardware compression process. Next, we identify limitations of the pure diffusion method for video SCI and propose a novel framework that combines one-step regression initialization with one-step diffusion refinement. Furthermore, to mitigate the spatial degradation caused by one-hot modulation, we implement a dual optical path at the hardware level, utilizing complementary information from another path to enhance the inpainted video. To our knowledge, this is the first work integrating diffusion into video SCI reconstruction. Experiments conducted on synthetic datasets and real scenes demonstrate the effectiveness of our method.

Ananta R. Bhattarai, Helge Rhodin

Main category: cs.CV

TL;DR: Re-Depth Anything is a test-time self-supervision framework that improves monocular depth estimation by fusing Depth Anything V2 with 2D diffusion models, using shape-from-shading cues and targeted optimization to bridge domain gaps.

Details

Motivation: Recent foundation models like Depth Anything V2 struggle with real-world images that differ from training distribution, creating a domain gap problem in monocular depth estimation.

Method: A test-time self-supervision framework that fuses DA-V2 with 2D diffusion models, using shape-from-shading cues with Score Distillation Sampling. Instead of optimizing depth directly, it freezes the encoder, updates intermediate embeddings, and fine-tunes the decoder.

Result: Substantial gains in depth accuracy and realism across diverse benchmarks compared to DA-V2, demonstrating effective domain gap bridging.

Conclusion: Re-Depth Anything showcases new avenues for self-supervision by augmenting geometric reasoning, offering a promising approach to improve monocular depth estimation in real-world scenarios.

Abstract: Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.

[144] Medical Imaging AI Competitions Lack Fairness

Annika Reinke, Evangelia Christodoulou, Sthuthi Sadananda, A. Emre Kavur, Khrystyna Faryna, Daan Schouten, Bennett A. Landman, Carole Sudre, Olivier Colliot, Nick Heller, Sophie Loizillon, Martin Maška, Maëlys Solal, Arya Yazdan-Panah, Vilma Bozgo, Ömer Sümer, Siem de Jong, Sophie Fischer, Michal Kozubek, Tim Rädsch, Nadim Hammoud, Fruzsina Molnár-Gábor, Steven Hicks, Michael A. Riegler, Anindo Saha, Vajira Thambawita, Pal Halvorsen, Amelia Jiménez-Sánchez, Qingyang Yang, Veronika Cheplygina, Sabrina Bottazzi, Alexander Seitel, Spyridon Bakas, Alexandros Karargyris, Kiran Vaidhya Venkadesh, Bram van Ginneken, Lena Maier-Hein

Main category: cs.CV

TL;DR: Medical imaging AI benchmarks have fairness issues: datasets lack real-world clinical diversity and have poor accessibility/reusability, creating disconnect between leaderboard success and clinical relevance.

Details

Motivation: To assess whether medical imaging AI benchmarks provide sufficiently representative, accessible, and reusable data to support clinically meaningful AI, examining fairness in dataset composition and FAIR compliance.

Method: Conducted large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities, evaluating dataset representativeness and FAIR principles compliance.

Result: Found substantial biases in dataset composition (geographic, modality, problem type) and poor accessibility/reusability due to restrictive access, inconsistent licensing, and incomplete documentation.

Conclusion: Current medical imaging AI benchmarks have foundational fairness limitations, showing disconnect between leaderboard success and clinical relevance due to unrepresentative data and poor FAIR compliance.

Abstract: Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.

[145] HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

Zhaolin Cai, Fan Li, Ziwei Zheng, Haixia Bi, Lijun He

Main category: cs.CV

TL;DR: HeadHunt-VAD is a tuning-free video anomaly detection method that identifies and uses specific attention heads within frozen multimodal LLMs instead of relying on textual outputs, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Traditional VAD methods require extensive labeled data and are computationally expensive. Existing MLLM-based tuning-free methods rely on textual outputs which suffer from information loss, normalcy bias, and prompt sensitivity, making them insufficient for capturing subtle anomalous cues.

Method: Proposes HeadHunt-VAD with a Robust Head Identification module that systematically evaluates all attention heads using multi-criteria analysis of saliency and stability to identify a sparse subset of consistently discriminative heads. Features from these expert heads are fed into a lightweight anomaly scorer and temporal locator.

Result: Achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful solution.

Conclusion: Head-level probing in frozen MLLMs provides a practical and effective tuning-free solution for real-world video anomaly detection, bypassing the limitations of textual output-based approaches.

Abstract: Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.

[146] Semi-Supervised 3D Segmentation for Type-B Aortic Dissection with Slim UNETR

Denis Mikhailapov, Vladimir Berikov

Main category: cs.CV

TL;DR: A semi-supervised learning method for multi-output CNN segmentation models in medical imaging, applied to aortic dissection diagnosis using limited labeled data.

Details

Motivation: Multi-output CNN models improve medical image segmentation accuracy but require extensive labeled data, which is expensive and time-consuming to obtain, especially for 3D medical images like CT angiography for aortic dissection diagnosis.

Method: Proposes a semi-supervised learning method for multi-output CNN models using data augmentation techniques (rotations and flipping) that doesn’t rely on probabilistic model assumptions, making it universally applicable to architectures with separate segmentation outputs.

Result: The method enables training with both labeled and unlabeled data, addressing the challenge of limited labeled data for 3D medical image segmentation tasks like type B aortic dissection diagnosis on the ImageTBDA dataset.

Conclusion: The proposed semi-supervised approach provides a universal solution for training multi-output CNN segmentation models with limited labeled data, particularly valuable for complex 3D medical imaging tasks where labeling is costly.

Abstract: Convolutional neural networks (CNN) for multi-class segmentation of medical images are widely used today. Especially models with multiple outputs that can separately predict segmentation classes (regions) without relying on a probabilistic formulation of the segmentation of regions. These models allow for more precise segmentation by tailoring the network’s components to each class (region). They have a common encoder part of the architecture but branch out at the output layers, leading to improved accuracy. These methods are used to diagnose type B aortic dissection (TBAD), which requires accurate segmentation of aortic structures based on the ImageTBDA dataset, which contains 100 3D computed tomography angiography (CTA) images. These images identify three key classes: true lumen (TL), false lumen (FL), and false lumen thrombus (FLT) of the aorta, which is critical for diagnosis and treatment decisions. In the dataset, 68 examples have a false lumen, while the remaining 32 do not, creating additional complexity for pathology detection. However, implementing these CNN methods requires a large amount of high-quality labeled data. Obtaining accurate labels for the regions of interest can be an expensive and time-consuming process, particularly for 3D data. Semi-supervised learning methods allow models to be trained by using both labeled and unlabeled data, which is a promising approach for overcoming the challenge of obtaining accurate labels. However, these learning methods are not well understood for models with multiple outputs. This paper presents a semi-supervised learning method for models with multiple outputs. The method is based on the additional rotations and flipping, and does not assume the probabilistic nature of the model’s responses. This makes it a universal approach, which is especially important for architectures that involve separate segmentation.

[147] Self-Supervised Weighted Image Guided Quantitative MRI Super-Resolution

Alireza Samadifardheris, Dirk H. J. Poot, Florian Wiesinger, Stefan Klein, Juan A. Hernandez-Tamames

Main category: cs.CV

TL;DR: Physics-informed self-supervised framework for qMRI super-resolution using routine HR weighted MRI as guidance, eliminating need for HR qMRI ground truth during training.

Details

Motivation: High-resolution qMRI relaxometry provides objective tissue characterization but is clinically underutilized due to lengthy acquisition times. There's a need for faster qMRI methods that can leverage existing clinical scans.

Method: Bayesian maximum a posteriori inference framework minimizing two discrepancies: (1) between synthesized HR images from super-resolved qMRI maps and acquired wMRI guides via forward signal models, and (2) between acquired LR qMRI and downsampled predictions. Uses deep neural network trained on synthetic data generated by degrading HR qMRI via k-space truncation.

Result: Models trained on synthetic data can produce super-resolved maps from 1-minute acquisition with quality comparable to 5-minute reference scan. T1-weighted images enhance T1 maps, T2-weighted images improve T2 maps, and combined guidance optimally enhances all parameters. Framework shows cross-qMRI sequence generalizability.

Conclusion: By decoupling training from HR qMRI requirement, the framework enables fast qMRI acquisitions enhanced via routine clinical images, offering a practical pathway for integrating quantitative relaxometry into clinical workflows with acceptable additional scan time.

Abstract: High-resolution (HR) quantitative MRI (qMRI) relaxometry provides objective tissue characterization but remains clinically underutilized due to lengthy acquisition times. We propose a physics-informed, self-supervised framework for qMRI super-resolution that uses routinely acquired HR weighted MRI (wMRI) scans as guidance, thus, removing the necessity for HR qMRI ground truth during training. We formulate super-resolution as Bayesian maximum a posteriori inference, minimizing two discrepancies: (1) between HR images synthesized from super-resolved qMRI maps and acquired wMRI guides via forward signal models, and (2) between acquired LR qMRI and downsampled predictions. This physics-informed objective allows the models to learn from clinical wMRI without HR qMRI supervision. To validate the concept, we generate training data by synthesizing wMRI guides from HR qMRI using signal equations, then degrading qMRI resolution via k-space truncation. A deep neural network learns the super-resolution mapping. Ablation experiments demonstrate that T1-weighted images primarily enhance T1 maps, T2-weighted images improve T2 maps, and combined guidance optimally enhances all parameters simultaneously. Validation on independently acquired in-vivo data from a different qMRI sequence confirms cross-qMRI sequence generalizability. Models trained on synthetic data can produce super-resolved maps from a 1-minute acquisition with quality comparable to a 5-minute reference scan, leveraging the scanner-independent nature of relaxometry parameters. By decoupling training from HR qMRI requirement, our framework enables fast qMRI acquisitions enhanced via routine clinical images, offering a practical pathway for integrating quantitative relaxometry into clinical workflows with acceptable additional scan time.

[148] StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection

Di Wu, Feng Yang, Wenhui Zhao, Jinwen Yu, Pan Liao, Benlian Xu, Dingwen Zhang

Main category: cs.CV

TL;DR: StereoMV2D improves multi-view 3D object detection by integrating temporal stereo modeling with 2D detection-guided queries, enhancing depth perception through cross-temporal disparities while maintaining computational efficiency.

Details

Motivation: Existing sparse query-based 3D detectors using 2D detection priors (like MV2D) still suffer from depth ambiguity in single-frame 2D detections, limiting 3D query generation accuracy.

Method: Proposes StereoMV2D framework that integrates temporal stereo modeling into 2D detection-guided multi-view 3D detection. Uses cross-temporal disparities of objects across adjacent frames to enhance depth perception, performs computations efficiently within 2D RoIs, and includes dynamic confidence gating to evaluate reliability of temporal stereo cues based on inter-frame matching matrix and appearance consistency.

Result: Extensive experiments on nuScenes and Argoverse 2 datasets show superior detection performance without significant computational overhead.

Conclusion: StereoMV2D effectively addresses depth ambiguity in 2D-guided 3D detection through temporal stereo modeling, achieving better accuracy while maintaining efficiency, making it suitable for autonomous driving applications.

Abstract: Multi-view 3D object detection is a fundamental task in autonomous driving perception, where achieving a balance between detection accuracy and computational efficiency remains crucial. Sparse query-based 3D detectors efficiently aggregate object-relevant features from multi-view images through a set of learnable queries, offering a concise and end-to-end detection paradigm. Building on this foundation, MV2D leverages 2D detection results to provide high-quality object priors for query initialization, enabling higher precision and recall. However, the inherent depth ambiguity in single-frame 2D detections still limits the accuracy of 3D query generation. To address this issue, we propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector. By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors, while performing all computations efficiently within 2D regions of interest (RoIs). Furthermore, a dynamic confidence gating mechanism adaptively evaluates the reliability of temporal stereo cues through learning statistical patterns derived from the inter-frame matching matrix together with appearance consistency, ensuring robust detection under object appearance and occlusion. Extensive experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead. Code will be available at https://github.com/Uddd821/StereoMV2D.

[149] PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology

Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang

Main category: cs.CV

TL;DR: PathFLIP is a novel pathology vision-language framework that decomposes slide captions into region-level subcaptions for fine-grained WSI interpretation, outperforming existing methods with less training data.

Details

Motivation: Current VLMs struggle with gigapixel WSIs due to spatial heterogeneity and poor fine-grained alignment between text and thousands of image patches, limiting performance on downstream clinical tasks.

Method: Decomposes slide-level captions into region-level subcaptions, generates text-conditioned region embeddings for precise visual-language grounding, and leverages LLMs for clinical instruction following.

Result: Outperforms existing large-scale pathological VLMs on four benchmarks while requiring significantly less training data, demonstrating versatile capabilities across multiple paradigms.

Conclusion: PathFLIP enables fine-grained, instruction-aware WSI interpretation for clinical practice by addressing the scale and heterogeneity challenges of computational pathology.

Abstract: While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.

Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu, Jing Liu, Wei Shen, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: GRASP-HO reformulates HOI detection from closed-set classification to open-vocabulary generation using frozen MLLMs with lightweight cognitive steering, achieving SOTA closed-set performance and strong zero-shot generalization.

Details

Motivation: Existing HOI detection methods operate under closed-world assumptions with predefined verb sets, struggling with unseen/ambiguous interactions. MLLMs have rich world knowledge but are computationally prohibitive to fine-tune, creating a gap between vision and cognitive reasoning.

Method: Proposes GRASP-HO framework that extracts hybrid interaction representations and uses a lightweight Cognitive Steering Conduit (CSC) module to inject visual evidence into frozen MLLMs. Introduces hybrid guidance strategy combining language modeling loss and auxiliary classification loss to address supervision mismatch.

Result: Achieves state-of-the-art closed-set performance and strong zero-shot generalization, demonstrating a unified paradigm that bridges discriminative perception and generative reasoning for open-world HOI detection.

Conclusion: GRASP-HO successfully bridges the gap between vision and cognitive reasoning for HOI detection, enabling open-vocabulary understanding without expensive MLLM fine-tuning, while maintaining both discriminative grounding and generative flexibility.

Abstract: Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.

[151] Bitbox: Behavioral Imaging Toolbox for Computational Analysis of Behavior from Videos

Evangelos Sariyanidi, Gokul Nair, Lisa Yankowitz, Casey J. Zampella, Mohan Kashyap Pargi, Aashvi Manakiwala, Maya McNealis, John D. Herrington, Jeffrey Cohn, Robert T. Schultz, Birkan Tunc

Main category: cs.CV

TL;DR: Bitbox is an open-source toolkit that makes AI-based computational behavioral measurement from video accessible to behavioral scientists and clinical researchers without requiring engineering expertise.

Details

Motivation: Current AI-based behavioral measurement tools are developed for engineering audiences, require specialized software stacks, and create barriers for behavioral/clinical researchers who want to use modern computational methods in hypothesis-driven research.

Method: Bitbox provides a standardized interface for extracting high-level behavioral measurements (facial expression, head movement, body action) from video using multiple face, head, and body processors. It’s designed with principles of reproducibility, modularity, and interpretability.

Result: Bitbox bridges the translational gap by giving behavioral researchers access to robust behavioral metrics without engineering expertise, while providing computer scientists a practical dissemination mechanism. Core modules have been tested and validated on clinical samples.

Conclusion: Bitbox is expected to accelerate integration of computational behavioral measurement into behavioral, clinical, and mental health research. It’s designed as a community-driven effort that will evolve through contributions from both method developers and domain scientists.

Abstract: Computational measurement of human behavior from video has recently become feasible due to major advances in AI. These advances now enable granular and precise quantification of facial expression, head movement, body action, and other behavioral modalities and are increasingly used in psychology, psychiatry, neuroscience, and mental health research. However, mainstream adoption remains slow. Most existing methods and software are developed for engineering audiences, require specialized software stacks, and fail to provide behavioral measurements at a level directly useful for hypothesis-driven research. As a result, there is a large barrier to entry for researchers who wish to use modern, AI-based tools in their work. We introduce Bitbox, an open-source toolkit designed to remove this barrier and make advanced computational analysis directly usable by behavioral scientists and clinical researchers. Bitbox is guided by principles of reproducibility, modularity, and interpretability. It provides a standardized interface for extracting high-level behavioral measurements from video, leveraging multiple face, head, and body processors. The core modules have been tested and validated on clinical samples and are designed so that new measures can be added with minimal effort. Bitbox is intended to serve both sides of the translational gap. It gives behavioral researchers access to robust, high-level behavioral metrics without requiring engineering expertise, and it provides computer scientists a practical mechanism for disseminating methods to domains where their impact is most needed. We expect that Bitbox will accelerate integration of computational behavioral measurement into behavioral, clinical, and mental health research. Bitbox has been designed from the beginning as a community-driven effort that will evolve through contributions from both method developers and domain scientists.

[152] FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

Cheng Peng, Zhuo Su, Liao Wang, Chen Guo, Zhaohu Li, Chengjiang Long, Zheng Lv, Jingxiang Sun, Chenyangguang Zhang, Yebin Liu

Main category: cs.CV

TL;DR: FlexAvatar is a flexible 3D head avatar reconstruction model that creates high-fidelity avatars with detailed dynamic deformations from single or sparse images, without needing camera poses or expression labels.

Details

Motivation: The paper aims to solve the challenge of creating animatable 3D head avatars with detailed dynamic deformations from limited input (single or sparse images) while eliminating requirements for camera poses and expression labels, which are often impractical to obtain.

Method: Uses a transformer-based reconstruction model with structured head query tokens as canonical anchors to aggregate flexible inputs. Introduces a lightweight UNet decoder conditioned on UV-space position maps for real-time detailed expression-dependent deformations. Employs data distribution adjustment to balance rare expressions, and includes a 10-second refinement for identity-specific details.

Result: FlexAvatar achieves superior 3D consistency and detailed dynamic realism compared to previous methods, providing a practical solution for animatable 3D avatar creation with real-time deformation capabilities.

Conclusion: The proposed FlexAvatar offers a flexible, practical approach for high-fidelity 3D head avatar reconstruction from minimal inputs, with detailed dynamic deformations and real-time performance, advancing the field of animatable avatar creation.

Abstract: We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.

Shaoyan Zhai, Mohamed Abdel-Aty, Chenzhu Wang, Rodrigo Vena Garcia

Main category: cs.CV

TL;DR: SAVeD is a large-scale video dataset of ADAS vehicle crashes and near-misses collected from social media, with frame-level annotations for safety-critical analysis.

Details

Motivation: Existing datasets lack authentic ADAS vehicle behavior under risk conditions, being limited to simulations or human-driven vehicles. There's a need for real-world data capturing high-risk edge cases like near-misses and system failures.

Method: Curated 2,119 first-person videos from social media showing ADAS vehicle operations. Annotated frame-level collisions, evasive maneuvers, and disengagements. Developed framework with semantic segmentation and monocular depth estimation for real-time TTC computation, and used GEV distribution for extreme risk modeling.

Result: Created comprehensive dataset covering diverse locations, lighting, and weather. Demonstrated utility through TTC computation, risk quantification across roadway types, and established VLLM benchmarks showing significant performance improvements with domain adaptation.

Conclusion: SAVeD addresses critical gap in ADAS safety research by providing authentic real-world risk data, enabling better analysis of perception/decision failures and improving model performance in complex near-miss scenarios.

Abstract: The advancement of safety-critical research in driving behavior in ADAS-equipped vehicles require real-world datasets that not only include diverse traffic scenarios but also capture high-risk edge cases such as near-miss events and system failures. However, existing datasets are largely limited to either simulated environments or human-driven vehicle data, lacking authentic ADAS (Advanced Driver Assistance System) vehicle behavior under risk conditions. To address this gap, this paper introduces SAVeD, a large-scale video dataset curated from publicly available social media content, explicitly focused on ADAS vehicle-related crashes, near-miss incidents, and disengagements. SAVeD features 2,119 first-person videos, capturing ADAS vehicle operations in diverse locations, lighting conditions, and weather scenarios. The dataset includes video frame-level annotations for collisions, evasive maneuvers, and disengagements, enabling analysis of both perception and decision-making failures. We demonstrate SAVeD’s utility through multiple analyses and contributions: (1) We propose a novel framework integrating semantic segmentation and monocular depth estimation to compute real-time Time-to-Collision (TTC) for dynamic objects. (2) We utilize the Generalized Extreme Value (GEV) distribution to model and quantify the extreme risk in crash and near-miss events across different roadway types. (3) We establish benchmarks for state-of-the-art VLLMs (VideoLLaMA2 and InternVL2.5 HiCo R16), showing that SAVeD’s detailed annotations significantly enhance model performance through domain adaptation in complex near-miss scenarios.

[154] MambaMIL+: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image

Qian Zeng, Yihui Wang, Shu Yang, Yingxue Xu, Fengtao Zhou, Jiabo Ma, Dejia Cai, Zhengyu Zhang, Lijuan Qu, Yu Wang, Li Liang, Hao Chen

Main category: cs.CV

TL;DR: MambaMIL+ is a new multiple instance learning framework for whole-slide image analysis that integrates spatial context modeling while maintaining long-range dependencies without memory decay, outperforming existing methods on 20 benchmarks.

Details

Motivation: Whole-slide images present challenges due to their gigapixel resolution and lack of fine-grained annotations. While MIL treats WSIs as bags of patches, effectively modeling ultra-long sequences with rich spatial context remains difficult. Mamba offers linear scaling for long sequences but suffers from limited spatial context modeling and memory decay in WSI analysis.

Method: MambaMIL+ introduces three key components: 1) overlapping scanning to restructure patch sequences and embed spatial continuity, 2) selective stripe position encoder (S2PE) to encode positional information while mitigating fixed scanning order biases, and 3) contextual token selection (CTS) mechanism that uses supervisory knowledge to dynamically enlarge contextual memory for stable long-range modeling.

Result: Extensive experiments on 20 benchmarks across diagnostic classification, molecular prediction, and survival analysis demonstrate that MambaMIL+ consistently achieves state-of-the-art performance under three different feature extractors (ResNet-50, PLIP, and CONCH).

Conclusion: MambaMIL+ effectively addresses the limitations of existing MIL approaches for WSI analysis by integrating spatial context modeling while maintaining long-range dependencies without memory forgetting, proving to be an effective and robust framework for large-scale computational pathology tasks.

Abstract: Whole-slide images (WSIs) are an important data modality in computational pathology, yet their gigapixel resolution and lack of fine-grained annotations challenge conventional deep learning models. Multiple instance learning (MIL) offers a solution by treating each WSI as a bag of patch-level instances, but effectively modeling ultra-long sequences with rich spatial context remains difficult. Recently, Mamba has emerged as a promising alternative for long sequence learning, scaling linearly to thousands of tokens. However, despite its efficiency, it still suffers from limited spatial context modeling and memory decay, constraining its effectiveness to WSI analysis. To address these limitations, we propose MambaMIL+, a new MIL framework that explicitly integrates spatial context while maintaining long-range dependency modeling without memory forgetting. Specifically, MambaMIL+ introduces 1) overlapping scanning, which restructures the patch sequence to embed spatial continuity and instance correlations; 2) a selective stripe position encoder (S2PE) that encodes positional information while mitigating the biases of fixed scanning orders; and 3) a contextual token selection (CTS) mechanism, which leverages supervisory knowledge to dynamically enlarge the contextual memory for stable long-range modeling. Extensive experiments on 20 benchmarks across diagnostic classification, molecular prediction, and survival analysis demonstrate that MambaMIL+ consistently achieves state-of-the-art performance under three feature extractors (ResNet-50, PLIP, and CONCH), highlighting its effectiveness and robustness for large-scale computational pathology

[155] AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

Main category: cs.CV

TL;DR: The paper introduces Diff-Gen dataset and AdaptPrompt framework for generalizable deepfake detection using CLIP, achieving SOTA performance across diverse generators with strong cross-domain generalization.

Details

Motivation: The increasing realism of synthetic media makes deepfake detection challenging, especially due to poor generalization of detectors trained on narrow generator classes when faced with unseen models.

Method: 1) Diff-Gen: Large-scale benchmark dataset with 100k diffusion-generated fakes capturing broad spectral artifacts. 2) AdaptPrompt: Parameter-efficient transfer learning framework that learns task-specific textual prompts and visual adapters while keeping CLIP backbone frozen. 3) Layer ablation showing pruning final transformer block of vision encoder enhances retention of high-frequency generative artifacts.

Result: Achieves state-of-the-art performance across 25 challenging test sets covering GANs, diffusion models, and commercial tools. Demonstrates strong cross-domain generalization, few-shot learning (using as few as 320 images), and source attribution capabilities for generator identification.

Conclusion: The proposed approach effectively addresses generalization challenges in deepfake detection by leveraging CLIP’s capabilities with specialized datasets and efficient adaptation methods, establishing new benchmarks for detecting synthetic content across diverse generative techniques.

Abstract: Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework’s versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.

[156] LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence

Yohanes Yudhi Adikusuma, Qixing Huang, Ying He

Main category: cs.CV

TL;DR: LiteGE is a lightweight method for computing geodesic distances on 3D surfaces using compact PCA-based descriptors from UDF samples, achieving 300× memory/time reduction and 1000× speedup for shape matching.

Details

Motivation: Existing learning-based methods for geodesic distance computation rely on large 3D backbones, causing high memory usage and latency that limit practical use in interactive or resource-constrained settings.

Method: Constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDF) samples at informative voxels, eliminating need for high-capacity networks and supporting sparse point clouds (as few as 300 points).

Result: Reduces memory usage and inference time by up to 300× compared to neural approaches; achieves up to 1000× speedup over state-of-the-art mesh-based methods while maintaining comparable accuracy on non-isometric shape pairs.

Conclusion: LiteGE provides an efficient, lightweight solution for geodesic distance computation that enables fast shape matching and works robustly on sparse point clouds, making it suitable for interactive and resource-constrained applications.

Abstract: Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300$\times$ compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000$\times$ speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.

[157] UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover

Arya Chavoshi, Hassan Dashtian, Naveen Sudharsan, Dev Niyogi

Main category: cs.CV

TL;DR: UrbanDIFF: A spatial denoising diffusion model for reconstructing cloud-contaminated urban land surface temperature imagery using urban structure conditioning and pixel-guided refinement.

Details

Motivation: Cloud contamination frequently obscures satellite-derived Land Surface Temperature (LST) observations, limiting continuous surface urban heat island (SUHI) monitoring. Existing methods rely on multitemporal or multisensor data that may be unavailable under persistent cloud cover, while traditional spatial methods degrade with large gaps.

Method: UrbanDIFF is a purely spatial denoising diffusion model that reconstructs cloud-contaminated urban LST imagery. It’s conditioned on static urban structure information (built-up surface data and digital elevation model) and enforces consistency with revealed cloud-free pixels through supervised pixel-guided refinement during inference.

Result: UrbanDIFF consistently outperforms interpolation baselines, especially under dense cloud occlusion. At 85% cloud coverage, it achieves SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84, with slower performance degradation as cloud density increases compared to other methods.

Conclusion: Denoising diffusion models offer robust spatial reconstruction for cloud-contaminated urban LST data, enabling more reliable SUHI monitoring even under high cloud coverage conditions where traditional methods fail.

Abstract: Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.

[158] Visually Prompted Benchmarks Are Surprisingly Fragile

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa

Main category: cs.CV

TL;DR: Visual prompting benchmarks for VLMs are surprisingly fragile to minor details like marker color/size, which can completely change model rankings and even make weaker models outperform stronger ones.

Details

Motivation: Current VLM evaluation benchmarks using visual prompting are unstable - minor changes in visual markers (color, size) or technical details (JPEG compression) significantly affect model performance and leaderboard rankings, undermining reliable evaluation.

Method: Evaluated 9 open- and closed-source VLMs on two visually prompted tasks, systematically testing effects of visual marker design, dataset size, and low-level inference choices. Created VPBench with 16 visual marker variants to provide more stable evaluation.

Result: Visual prompting benchmarks are highly sensitive to seemingly irrelevant details: changing marker color from red to blue changes rankings; increasing marker size makes InternVL3-8B perform like Gemini 2.5 Pro; JPEG compression levels also affect rankings. These effects are much larger than in conventional semantic VLM evaluations.

Conclusion: Current visually prompted benchmarks are unstable and unreliable for VLM evaluation. VPBench provides a more robust benchmark with multiple visual marker variants to mitigate these issues and enable fairer model comparisons.

Abstract: A key challenge in evaluating VLMs is testing models’ ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.

[159] Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras

Ami Pandat, Punna Rajasekhar, G. Aravamuthan, Gopika Vinod, Rohit Shukla

Main category: cs.CV

TL;DR: A hybrid camera distortion modeling framework combining extended conventional models with neural network residual correction enables accurate 3D object localization up to 5km, improving long-range photogrammetry for CCTV applications.

Details

Motivation: Existing stereo-camera 3D localization methods are limited to few hundred meters due to inadequate distortion models for camera lens non-linearities, especially problematic for long-distance applications like 3D mapping and object localization.

Method: Hybrid approach: first extends conventional distortion models with higher-order terms, then enhances them using neural network-based residual correction to model complex lens distortion functions that pure neural networks fail to learn directly.

Result: Framework substantially improves long-range localization performance, enabling 3D object position estimation up to 5km distances. Estimated coordinates are transformed to GIS coordinates and visualized on maps.

Conclusion: The hybrid approach provides a practical solution for calibrating CCTV cameras for long-range photogrammetry applications, demonstrating robustness and effectiveness through experimental validation.

Abstract: Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters’ range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.

[160] UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations

Yuan Ren, Guile Wu, Runhao Li, Zheyuan Yang, Yibo Liu, Xingxin Chen, Tongtong Cao, Bingbing Liu

Main category: cs.CV

TL;DR: UniGaussian: A unified 3D Gaussian representation method for urban scene reconstruction that supports both pinhole and fisheye cameras in autonomous driving simulations.

Details

Motivation: Existing photorealistic reconstruction methods focus mainly on pinhole cameras and neglect fisheye cameras, which are important for autonomous driving. There's a need for effective fisheye camera simulation in driving scenes.

Method: Proposes a differentiable rendering method that distorts 3D Gaussians using affine transformations tailored to fisheye camera models, and a framework that learns unified Gaussian representations from multiple camera models by applying camera-specific affine transformations and regularizing shared Gaussians with multi-modal supervision.

Result: Achieves superior rendering quality and fast rendering speed for driving scene simulation, supporting multiple sensors (pinhole and fisheye cameras) and modalities (depth, semantic, normal, and LiDAR point clouds).

Conclusion: UniGaussian successfully addresses the compatibility issue of 3D Gaussian splatting with fisheye cameras, enables unified representation learning from multiple camera models, and achieves holistic driving scene understanding with real-time rendering capabilities.

Abstract: Urban scene reconstruction is crucial for real-world autonomous driving simulators. Although existing methods have achieved photorealistic reconstruction, they mostly focus on pinhole cameras and neglect fisheye cameras. In fact, how to effectively simulate fisheye cameras in driving scene remains an unsolved problem. In this work, we propose UniGaussian, a novel approach that learns a unified 3D Gaussian representation from multiple camera models for urban scene reconstruction in autonomous driving. Our contributions are two-fold. First, we propose a new differentiable rendering method that distorts 3D Gaussians using a series of affine transformations tailored to fisheye camera models. This addresses the compatibility issue of 3D Gaussian splatting with fisheye cameras, which is hindered by light ray distortion caused by lenses or mirrors. Besides, our method maintains real-time rendering while ensuring differentiability. Second, built on the differentiable rendering method, we design a new framework that learns a unified Gaussian representation from multiple camera models. By applying affine transformations to adapt different camera models and regularizing the shared Gaussians with supervision from different modalities, our framework learns a unified 3D Gaussian representation with input data from multiple sources and achieves holistic driving scene understanding. As a result, our approach models multiple sensors (pinhole and fisheye cameras) and modalities (depth, semantic, normal and LiDAR point clouds). Our experiments show that our method achieves superior rendering quality and fast rendering speed for driving scene simulation.

[161] Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald

Main category: cs.CV

TL;DR: Chorus is a multi-teacher pretraining framework that learns a holistic 3D Gaussian Splatting scene encoder by distilling complementary signals from 2D foundation models, enabling rich general-purpose feature learning from 3DGS primitives.

Details

Motivation: While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. The paper aims to address this gap by learning comprehensive scene representations from 3DGS data.

Method: Chorus employs a multi-teacher pretraining framework with a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware 2D foundation models. It distills complementary signals to create a shared embedding space capturing signals from high-level semantics to fine-grained structure.

Result: The method shows strong performance on open-vocabulary semantic/instance segmentation, linear/decoder probing, and data-efficient supervision. A point cloud variant using only Gaussians’ centers, colors, and normals outperforms point cloud baselines while using 39.9 times fewer training scenes. The paper also proposes render-and-distill adaptation for out-of-domain finetuning.

Conclusion: Chorus successfully learns rich general-purpose features from 3DGS primitives through multi-teacher distillation, demonstrating strong transfer capabilities and data efficiency. The framework enables comprehensive scene understanding and adaptation across domains.

Abstract: While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

[162] ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar

Main category: cs.CV

TL;DR: ReX-MLE is a new benchmark for evaluating autonomous coding agents on complex medical imaging tasks, revealing severe performance gaps compared to human experts.

Details

Motivation: Current autonomous coding agents built on LLMs perform well on general software tasks but fail on complex, domain-specific scientific problems like medical imaging, which requires specialized knowledge, high-dimensional data handling, and end-to-end workflow management not captured by existing benchmarks.

Method: The authors introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions across diverse modalities and task types. It evaluates full end-to-end workflows where agents must independently manage data preprocessing, model training, and submission under realistic compute and time constraints.

Result: Evaluation of state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude) shows severe performance gaps - most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations.

Conclusion: ReX-MLE exposes critical bottlenecks in current autonomous AI systems for domain-specific scientific problems and provides a foundation for developing domain-aware autonomous AI systems that can handle complex real-world challenges like medical imaging.

Abstract: Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

[163] Simulation-Driven Deep Learning Framework for Raman Spectral Denoising Under Fluorescence-Dominant Conditions

Mengkun Chen, Sanidhya D. Tripathi, James W. Tunnell

Main category: cs.CV

TL;DR: A simulation-driven deep learning framework that combines statistical noise modeling with cascaded neural networks to denoise Raman spectra by suppressing detector noise and fluorescence background in biomedical tissue analysis.

Details

Motivation: Raman spectroscopy is valuable for biomedical diagnostics but faces challenges from weak Raman scattering and strong fluorescence background in biological tissues, which degrade signal quality and limit practical applications.

Method: Developed a simulation-driven denoising framework that: 1) comprehensively models major noise sources, 2) generates biologically realistic Raman spectra using this model, 3) trains a cascaded deep neural network to jointly suppress stochastic detector noise and fluorescence baseline interference, and 4) validates using simulated human skin spectra derived from real experimental data.

Result: The approach demonstrates potential for improving spectral quality through physics-informed learning, enabling faster and more accurate Raman-based tissue analysis by effectively suppressing noise and fluorescence interference.

Conclusion: Physics-informed deep learning combining statistical noise modeling with neural networks offers a promising solution to enhance Raman spectroscopy for biomedical tissue analysis by addressing key challenges of weak signals and fluorescence background.

Abstract: Raman spectroscopy enables non-destructive, label-free molecular analysis with high specificity, making it a powerful tool for biomedical diagnostics. However, its application to biological tissues is challenged by inherently weak Raman scattering and strong fluorescence background, which significantly degrade signal quality. In this study, we present a simulation-driven denoising framework that combines a statistically grounded noise model with deep learning to enhance Raman spectra acquired under fluorescence-dominated conditions. We comprehensively modeled major noise sources. Based on this model, we generated biologically realistic Raman spectra and used them to train a cascaded deep neural network designed to jointly suppress stochastic detector noise and fluorescence baseline interference. To evaluate the performance of our approach, we simulated human skin spectra derived from real experimental data as a validation case study. Our results demonstrate the potential of physics-informed learning to improve spectral quality and enable faster, more accurate Raman-based tissue analysis.

[164] InSPECT: Invariant Spectral Features Preservation of Diffusion Models

Baohua Yan, Qingyuan Liu, Jennifer Kava, Xuan Di

Main category: cs.CV

TL;DR: InSPECT is a novel diffusion model that preserves invariant spectral features during both forward and backward processes, achieving better generation quality and diversity with faster convergence compared to standard diffusion models.

Details

Motivation: Standard diffusion models diffuse data all the way to white noise, creating an extremely difficult and computationally intractable prediction task. The authors aim to overcome this limitation by preserving important spectral features throughout the diffusion process.

Method: InSPECT keeps invariant spectral features during both forward and backward diffusion processes. The Fourier coefficients smoothly converge to specified random noise at the end of forward process, enabling feature preservation while maintaining diversity and randomness.

Result: Experiments on CIFAR-10, Celeb-A, and LSUN show InSPECT achieves 39.23% reduction in FID and 45.80% improvement in IS against DDPM for 10K iterations. The model demonstrates enhanced visual diversity, faster convergence rate, and smoother diffusion process.

Conclusion: Preserving invariant spectral features in diffusion models leads to superior generation quality and diversity while enhancing computational efficiency and enabling faster convergence. This is the first work to analyze and preserve invariant spectral features in diffusion models.

Abstract: Modern diffusion models (DMs) have achieved state-of-the-art image generation. However, the fundamental design choice of diffusing data all the way to white noise and then reconstructing it leads to an extremely difficult and computationally intractable prediction task. To overcome this limitation, we propose InSPECT (Invariant Spectral Feature-Preserving Diffusion Model), a novel diffusion model that keeps invariant spectral features during both the forward and backward processes. At the end of the forward process, the Fourier coefficients smoothly converge to a specified random noise, enabling features preservation while maintaining diversity and randomness. By preserving invariant features, InSPECT demonstrates enhanced visual diversity, faster convergence rate, and a smoother diffusion process. Experiments on CIFAR-10, Celeb-A, and LSUN demonstrate that InSPECT achieves on average a 39.23% reduction in FID and 45.80% improvement in IS against DDPM for 10K iterations under specified parameter settings, which demonstrates the significant advantages of preserving invariant features: achieving superior generation quality and diversity, while enhancing computational efficiency and enabling faster convergence rate. To the best of our knowledge, this is the first attempt to analyze and preserve invariant spectral features in diffusion models.

[165] VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

Main category: cs.CV

TL;DR: VideoGameQA-Bench is a new benchmark for evaluating Vision-Language Models on video game Quality Assurance tasks, addressing the lack of standardized testing in this domain.

Details

Motivation: The video game industry needs better automation for labor-intensive QA processes. While VLMs show promise, existing benchmarks don't adequately address game-specific QA requirements, creating a need for standardized evaluation.

Method: The authors introduce VideoGameQA-Bench, a comprehensive benchmark covering diverse game QA activities including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos across various games.

Result: The benchmark provides standardized evaluation tools and datasets for assessing VLM performance on real-world game QA scenarios, with code and data publicly available.

Conclusion: VideoGameQA-Bench fills a critical gap in evaluating VLMs for game development workflows, enabling better assessment of their potential to automate QA processes in the entertainment industry.

Abstract: With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector’s sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry’s most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/

[166] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstrøm, Teresa Dorszewski, Siyan Chen, Michael Kampffmeyer, Elisabeth Wetzer, Robert Jenssen

Main category: cs.CV

TL;DR: KCCs transform any pre-trained ViT-based model into a self-explainable classifier without retraining by leveraging ViTs’ ability to identify matching keypoints between images, creating an interpretable decision process that can be visualized in the input space.

Details

Motivation: Current self-explainable models require complex training and specific architectures, making them impractical. With the rise of ViT-based foundation models, there's a growing need for methods to provide transparency and reliability to these powerful but opaque models.

Method: Keypoint Counting Classifiers (KCCs) convert any well-trained ViT-based model into a self-explainable model without retraining. The method builds on ViTs’ demonstrated ability to automatically identify matching keypoints between images, creating an interpretable decision process that can be visualized directly in the input images.

Result: Extensive evaluation shows that KCCs improve human-machine communication compared to recent baselines. The method successfully provides transparency to ViT-based foundation models through visualizable explanations in the input space.

Conclusion: KCCs represent an important step toward making ViT-based foundation models more transparent and reliable by providing a practical way to create self-explainable models without the need for complex retraining or specialized architectures.

Abstract: Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.

[167] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa

Main category: cs.CV

TL;DR: MAGNet is a unified diffusion framework for multi-agent motion generation that handles various interaction tasks through flexible conditioning and can generate long sequences with coherent coordination across agents.

Details

Motivation: Modeling multi-person interactions is challenging due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing methods are task-specific and don't generalize to flexible multi-agent generation.

Method: MAGNet uses an autoregressive diffusion framework with Diffusion Forcing modifications that explicitly model inter-agent coupling during denoising. It supports dyadic prediction, partner inpainting, and full multi-agent generation in a single model, with a scalable architecture agnostic to agent count.

Result: The approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios (3+ agents). It captures both tightly synchronized activities (dancing, boxing) and loosely structured social interactions.

Conclusion: MAGNet provides a unified framework for multi-agent motion generation that overcomes limitations of task-specific methods and enables coherent coordination across variable numbers of agents for diverse interaction scenarios.

Abstract: Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/

[168] Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features

Linghui Zhu, Yiming Li, Haiqin Weng, Yan Liu, Tianwei Zhang, Shu-Tao Xia, Zhi Wang

Main category: cs.CV

TL;DR: Proposes a harmless model ownership verification method for personalized large vision models by decoupling common and dataset-specific features to detect model stealing attacks.

Details

Motivation: Existing defense methods for traditional DNNs are ineffective for fine-tuned large vision models, introducing security risks or prone to misjudgment, while personalized LVMs are valuable intellectual property vulnerable to model stealing attacks.

Method: Three-stage approach: 1) Create shadow models retaining common features while disrupting dataset-specific features, 2) Train meta-classifier to identify stolen models by detecting victim’s dataset-specific features, 3) Conduct model ownership verification via hypothesis test for robustness.

Result: Extensive experiments on benchmark datasets verify effectiveness in detecting different types of model stealing attacks simultaneously.

Conclusion: Proposed method provides effective and harmless model ownership verification for personalized large vision models, addressing limitations of existing defenses for fine-tuned models.

Abstract: Large vision models (LVMs) achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to LVMs. However, this paper reveals that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized LVMs by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by computing the output differences between the shadow and victim models, without altering the victim model or its training process. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously. Our codes are available at https://github.com/zlh-thu/Holmes.

[169] Dexterous World Models

Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo

Main category: cs.CV

TL;DR: DWM is a video diffusion framework that generates realistic human-scene interaction videos from static 3D scenes and egocentric hand motions, enabling interactive digital twins.

Details

Motivation: Current digital twins are static and lack embodied interactivity, limiting them to navigation and view synthesis. The paper aims to bridge this gap by creating interactive digital twins that can model how human actions dynamically change 3D scenes.

Method: DWM uses a scene-action-conditioned video diffusion framework that conditions video generation on: (1) static scene renderings following camera trajectories for spatial consistency, and (2) egocentric hand mesh renderings encoding geometry and motion cues for action-conditioned dynamics. Training uses a hybrid dataset combining synthetic egocentric interactions for aligned supervision and real-world fixed-camera videos for diverse object dynamics.

Result: DWM generates temporally coherent videos depicting plausible human-scene interactions like grasping, opening, and moving objects while maintaining camera and scene consistency. The framework produces realistic and physically plausible interactions.

Conclusion: DWM represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions, moving beyond static digital twins to interactive, action-driven environments.

Abstract: Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.

[170] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

Main category: cs.CV

TL;DR: The paper proposes a framework to adapt discriminative encoder features for generative tasks by introducing semantic-pixel reconstruction to create compact, semantically-rich latent spaces that enable unified text-to-image generation and editing.

Details

Motivation: Current LDMs use VAE latents optimized for pixel reconstruction but lack semantic richness. While using representation encoder features could unify vision generation and understanding, these discriminative features lack compact regularization and have weak pixel reconstruction, leading to inaccurate object structures and poor fine-grained details in generation.

Method: Introduces semantic-pixel reconstruction objective to regularize latent space, compressing both semantic information and fine-grained details into compact representation (96 channels, 16x16 downsampling). Uses this representation to design unified T2I and image editing model.

Result: Achieves state-of-the-art image reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks compared to various feature spaces. Demonstrates representation encoders can be effectively adapted into robust generative components.

Conclusion: The proposed framework successfully adapts understanding-oriented encoder features for generative tasks by addressing fundamental obstacles through semantic-pixel regularization, enabling compact yet semantically-rich latent representations that unify vision generation and understanding.

Abstract: Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

[171] Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

Ko Watanabe, Stanislav Frolov, Aya Hassan, David Dembinsky, Adriano Lucieri, Andreas Dengel

Main category: cs.CV

TL;DR: Synthetic skin lesion images generated by AI can effectively test fairness of skin cancer classifiers across demographic groups.

Details

Motivation: Deep learning for skin cancer screening risks bias due to unrepresentative datasets; need better fairness evaluation methods that account for diverse demographics like sex, age, and race.

Method: Train state-of-the-art generative model to create controllable synthetic skin lesion images; compare fairness testing results using synthetic images vs real-image benchmark dataset (MILK10K) on three public classifiers (DeepGuide, MelaNet, SkinLesionDensnet).

Result: Classification tendencies across demographic attributes showed similar patterns when tested on both real and synthetic images, confirming synthetic images can effectively verify model fairness.

Conclusion: Highly realistic synthetic images provide a viable solution for fairness testing of skin cancer classifiers when representative real datasets are difficult to obtain.

Abstract: Recent advances in deep learning and on-device inference could transform routine screening for skin cancers. Along with the anticipated benefits of this technology, potential dangers arise from unforeseen and inherent biases. A significant obstacle is building evaluation datasets that accurately reflect key demographics, including sex, age, and race, as well as other underrepresented groups. To address this, we train a state-of-the-art generative model to generate synthetic data in a controllable manner to assess the fairness of publicly available skin cancer classifiers. To evaluate whether synthetic images can be used as a fairness testing dataset, we prepare a real-image dataset (MILK10K) as a benchmark and compare the True Positive Rate result of three models (DeepGuide, MelaNet, and SkinLesionDensnet). As a result, the classification tendencies observed in each model when tested on real and generated images showed similar patterns across different attribute data sets. We confirm that highly realistic synthetic images facilitate model fairness verification.

[172] LN3DIFF++: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Yushi Lan, Fangzhou Hong, Shangchen Zhou, Shuai Yang, Xuyi Meng, Yongwei Chen, Zhaoyang Lyu, Bo Dai, Xingang Pan, Chen Change Loy

Main category: cs.CV

TL;DR: LN3Diff++ is a novel 3D diffusion framework that enables fast, high-quality conditional 3D generation using a 3D-aware latent space and transformer-based decoder, achieving SOTA performance without per-instance optimization.

Details

Motivation: While 2D diffusion models have been successful, there's no unified 3D diffusion pipeline. The paper aims to address this gap by creating a fast, high-quality, and generic conditional 3D generation framework.

Method: Uses a 3D-aware architecture with VAE to encode input images into structured 3D latent space, then decodes via transformer-based decoder into 3D neural field. Trains diffusion model on this 3D-aware latent space.

Result: Achieves state-of-the-art performance on ShapeNet for 3D generation, superior performance in monocular 3D reconstruction and conditional 3D generation across datasets, and faster inference without per-instance optimization.

Conclusion: LN3Diff++ represents significant advancement in 3D generative modeling with promise for various 3D vision and graphics applications, offering a unified solution for fast, high-quality 3D generation.

Abstract: The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff++ to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

[173] Tracing the Roots: Leveraging Temporal Dynamics in Diffusion Trajectories for Origin Attribution

Andreas Floros, Seyed-Mohsen Moosavi-Dezfooli, Pier Luigi Dragotti

Main category: cs.CV

TL;DR: The paper introduces a framework for data provenance in diffusion models, analyzing diffusion trajectories to classify images as training data, novel generations, or external sources, challenging existing membership inference practices.

Details

Motivation: While diffusion models excel at image synthesis, there's a critical need to ensure responsible use by verifying image origins - whether from training data, novel generations, or external sources. Current methods for membership inference have significant limitations.

Method: The framework analyzes temporal dynamics across entire diffusion trajectories rather than focusing on specific denoising stages. It introduces a white-box approach for model attribution directly applicable to diffusion models, and unifies data provenance into a cohesive framework.

Result: The study shows that temporal dynamics across full trajectories enable more robust classification and challenges the “Goldilocks zone” conjecture. It exposes flaws in current membership inference methods that fail under distribution shifts or when model-generated data is present.

Conclusion: The paper proposes a unified framework for data provenance in modern generative systems, demonstrating that comprehensive trajectory analysis provides better classification than stage-specific approaches and highlighting the need for more robust verification methods.

Abstract: Diffusion models have transformed image synthesis through iterative denoising, by defining trajectories from noise to coherent data. While their capabilities are widely celebrated, a critical challenge remains unaddressed: ensuring responsible use by verifying whether an image originates from a model’s training set, its novel generations or external sources. We introduce a framework that analyzes diffusion trajectories for this purpose. Specifically, we demonstrate that temporal dynamics across the entire trajectory allow for more robust classification and challenge the widely-adopted “Goldilocks zone” conjecture, which posits that membership inference is effective only within narrow denoising stages. More fundamentally, we expose critical flaws in current membership inference practices by showing that representative methods fail under distribution shifts or when model-generated data is present. For model attribution, we demonstrate a first white-box approach directly applicable to diffusion. Ultimately, we propose the unification of data provenance into a single, cohesive framework tailored to modern generative systems.

[174] From Engineering Diagrams to Graphs: Digitizing P&IDs with Transformers

Jan Marius Stürmer, Marius Graumann, Tobias Koch

Main category: cs.CV

TL;DR: Transformer-based Relationformer jointly extracts symbols and connections from P&IDs, outperforming modular approaches by 25%+ in edge detection accuracy, with a new public benchmark dataset.

Details

Motivation: Previous P&ID digitization methods use separate steps for symbol and line detection, limiting their ability to capture diagram structure. There's a need for joint extraction and a public benchmark dataset.

Method: Transformer-based Relationformer approach that jointly extracts symbols and their interconnections from P&IDs in a unified framework.

Result: Significantly outperforms modular baseline with over 25% improvement in edge detection accuracy on real-world diagrams.

Conclusion: The research provides a reproducible evaluation framework, demonstrates transformer effectiveness for structural understanding of engineering diagrams, and releases the first public P&ID digitization benchmark dataset.

Abstract: Digitizing engineering diagrams like Piping and Instrumentation Diagrams (P&IDs) plays a vital role in maintainability and operational efficiency of process and hydraulic systems. Previous methods typically decompose the task into separate steps such as symbol detection and line detection, which can limit their ability to capture the structure in these diagrams. In this work, a transformer-based approach leveraging the Relationformer that addresses this limitation by jointly extracting symbols and their interconnections from P&IDs is introduced. To evaluate our approach and compare it to a modular digitization approach, we present the first publicly accessible benchmark dataset for P&ID digitization, annotated with graph-level ground truth. Experimental results on real-world diagrams show that our method significantly outperforms the modular baseline, achieving over 25% improvement in edge detection accuracy. This research contributes a reproducible evaluation framework and demonstrates the effectiveness of transformer models for structural understanding of complex engineering diagrams. The dataset is available under https://zenodo.org/records/14803338.

Abderrezzaq Sendjasni, Seif-Eddine Benkabou, Mohamed-Chaker Larabi

Main category: cs.CV

TL;DR: A novel framework for 360-degree image quality assessment that uses embedding similarity-based selection to reduce patch redundancy by 40-50% while maintaining or improving performance.

Details

Motivation: Addresses the fundamental bottleneck in data-driven 360-degree IQA: lack of intelligent, sample-level data selection, leading to redundant patch sampling that wastes computational resources.

Method: Proposes an embedding similarity-based selection algorithm that distills redundant patches into a compact, informative subset. Formulated as a regularized optimization problem preserving perceptual relationships in low-dimensional space, using residual analysis to filter irrelevant samples.

Result: Extensive experiments on CVIQ, OIQA, and MVAQD datasets show baseline models match/exceed performance using only 40-50% of patches. Integration with state-of-the-art CNN/transformer models maintains/improves performance with 20-40% reduced computational load.

Conclusion: Adaptive post-sampling data refinement is a powerful, widely applicable strategy for efficient and robust 360-degree IQA, establishing intelligent data selection as crucial for computational efficiency.

Abstract: This article identifies and addresses a fundamental bottleneck in data-driven 360-degree image quality assessment (IQA): the lack of intelligent, sample-level data selection. Hence, we propose a novel framework that introduces a critical refinement step between patches sampling and model training. The core of our contribution is an embedding similarity-based selection algorithm that distills an initial, potentially redundant set of patches into a compact, maximally informative subset. This is formulated as a regularized optimization problem that preserves intrinsic perceptual relationships in a low-dimensional space, using residual analysis to explicitly filter out irrelevant or redundant samples. Extensive experiments on three benchmark datasets (CVIQ, OIQA, MVAQD) demonstrate that our selection enables a baseline model to match or exceed the performance of using all sampled data while keeping only 40-50% of patches. Particularly, we demonstrate the universal applicability of our approach by integrating it with several state-of-the-art IQA models, incleasy to deploy. Most significantly, its value as a generic,uding CNN- and transformer-based architectures, consistently enabling them to maintain or improve performance with 20-40% reduced computational load. This work establishes that adaptive, post-sampling data refinement is a powerful and widely applicable strategy for achieving efficient and robust 360-degree IQA.

[176] SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

Yimeng Fan, Changsong Liu, Mingyang Li, Dongze Liu, Yanyan Liu, Wei Zhang

Main category: cs.CV

TL;DR: SpikeDet is a novel spiking object detector that addresses local firing saturation in SNNs, achieving better accuracy with lower power consumption through optimized firing patterns and multi-direction feature fusion.

Details

Motivation: Existing SNN-based object detection methods suffer from local firing saturation where adjacent neurons reach maximum firing rates simultaneously, especially in object-centric regions. This reduces feature discrimination, lowers detection accuracy, and increases firing rates, preventing SNNs from achieving their potential energy efficiency.

Method: 1) MDSNet spiking backbone network that adjusts membrane synaptic input distribution at each layer for better neuron firing patterns; 2) Spiking Multi-direction Fusion Module (SMFM) for multi-direction fusion of spiking features to enhance multi-scale detection; 3) Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation.

Result: SpikeDet achieves 52.2% AP on COCO 2017, outperforming previous SNN methods by 3.3% AP while requiring only half the power consumption. It also achieves best performance on specialized datasets: event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman.

Conclusion: SpikeDet successfully addresses the local firing saturation problem in SNN object detection through optimized firing patterns and feature fusion, achieving superior accuracy with significantly reduced power consumption across diverse detection scenarios.

Abstract: Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low power consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method, with SpikeDet achieving superior performance. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the power consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

[177] Multi Anatomy X-Ray Foundation Model

Nishank Singla, Krisztian Koos, Farzin Haddadpour, Amin Honarmandi Shandiz, Lovish Chum, Xiaojian Xu, Qing Jin, Erhan Bas

Main category: cs.CV

TL;DR: XR-0 is a multi-anatomy X-ray foundation model trained on 1.15M images that achieves SOTA performance across 12 datasets and 20 clinical tasks, demonstrating the importance of anatomical diversity for robust medical AI.

Details

Motivation: Most existing AI foundation models in radiology are limited to chest anatomy and fail to generalize across broader clinical tasks, creating a need for more versatile medical vision models.

Method: Self-supervised learning on a large private dataset of 1.15 million X-ray images spanning diverse anatomical regions, evaluated across 12 datasets and 20 downstream tasks including classification, retrieval, segmentation, localization, visual grounding, and report generation.

Result: XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks, outperforming specialized models.

Conclusion: Anatomical diversity and self-supervised learning are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

Abstract: X-ray imaging is a ubiquitous in radiology, yet most existing AI foundation models are limited to chest anatomy and fail to generalize across broader clinical tasks. In this work, we introduce XR-0, the multi-anatomy X-ray foundation model using self-supervised learning on a large, private dataset of 1.15 million images spanning diverse anatomical regions and evaluated across 12 datasets and 20 downstream tasks, including classification, retrieval, segmentation, localization, visual grounding, and report generation. XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks. Our results demonstrate that anatomical diversity and supervision are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

[178] 3D Cell Oversegmentation Correction via Geo-Wasserstein Divergence

Peter Chen, Bryan Chang, Olivia A Creasey, Julie Beth Sneddon, Zev J Gartner, Yining Liu

Main category: cs.CV

TL;DR: A geometric framework for identifying and correcting oversegmentation errors in 3D cell segmentation using pre-trained classifiers with 2D geometric and 3D topological features, plus a novel Geo-Wasserstein divergence metric.

Details

Motivation: Oversegmentation in 3D cell segmentation degrades quality and is difficult to resolve because oversegmentation errors resemble natural gaps between adjacent cells. There's a need for a systematic approach to address this problem.

Method: 1) Formulates oversegmentation as a concrete problem and proposes a geometric framework to identify and correct errors. 2) Builds a pre-trained classifier using both 2D geometric and 3D topological features from flawed segmentation results. 3) Introduces Geo-Wasserstein divergence to quantify changes in 2D geometries in a geometry-aware manner.

Result: Validated through extensive experiments on in-domain plant datasets (synthesized and real oversegmented cases) and out-of-domain animal datasets to demonstrate transfer learning performance. Ablation study highlights the contribution of Geo-Wasserstein divergence.

Conclusion: Provides a clear pipeline for end-users to build pre-trained models for any labeled dataset, offering a systematic solution to the oversegmentation problem in 3D cell segmentation.

Abstract: 3D cell segmentation methods are often hindered by \emph{oversegmentation}, where a single cell is incorrectly split into multiple fragments. This degrades the final segmentation quality and is notoriously difficult to resolve, as oversegmentation errors often resemble natural gaps between adjacent cells. Our work makes two key contributions. First, for 3D cell segmentation, we are the first work to formulate oversegmentation as a concrete problem and propose a geometric framework to identify and correct these errors. Our approach builds a pre-trained classifier using both 2D geometric and 3D topological features extracted from flawed 3D segmentation results. Second, we introduce a novel metric, Geo-Wasserstein divergence, to quantify changes in 2D geometries. This captures the evolving trends of cell mask shape in a geometry-aware manner. We validate our method through extensive experiments on in-domain plant datasets, including both synthesized and real oversegmented cases, as well as on out-of-domain animal datasets to demonstrate transfer learning performance. An ablation study further highlights the contribution of the Geo-Wasserstein divergence. A clear pipeline is provided for end-users to build pre-trained models to any labeled dataset.

[179] GraphCompNet: A Position-Aware Model for Predicting and Compensating Shape Deviations in 3D Printing

Juheon Lee, Lei, Chen, Juan Carlos Catana, Hui Wang, Jun Zeng

Main category: cs.CV

TL;DR: GraphCompNet: A graph-based neural network framework using adversarial training to predict and compensate for shape deviations in additive manufacturing, addressing position-dependent variations in batch production.

Details

Motivation: Traditional methods for controlling geometric deviations in additive manufacturing rely on complex parameterized models and repetitive metrology, which are time-consuming and not applicable for batch production. There's a need for generalizable approaches that can handle complex geometries and adapt to position-dependent variations in industrial-scale production.

Method: GraphCompNet integrates graph-based neural networks with a GAN-inspired training paradigm. It uses point cloud representations and dynamic graph convolutional neural networks (DGCNNs) to model complex geometries while incorporating position-specific thermal and mechanical variations. A two-stage adversarial training process with compensator-predictor architecture iteratively refines compensated designs with real-time feedback.

Result: Experimental validation shows the framework can predict deviations in freeform geometries and adapt to position-dependent batch production conditions, achieving 35-65% improvement in compensation accuracy across the entire printing space while addressing position-dependent variabilities within the print chamber.

Conclusion: The proposed method advances Digital Twin development for additive manufacturing by offering scalable, real-time monitoring and compensation capabilities, addressing critical challenges in geometric accuracy for industrial-scale production.

Abstract: Shape deviation modeling and compensation in additive manufacturing are pivotal for achieving high geometric accuracy and enabling industrial-scale production. Critical challenges persist, including generalizability across complex geometries and adaptability to position-dependent variations in batch production. Traditional methods of controlling geometric deviations often rely on complex parameterized models and repetitive metrology, which can be time-consuming yet not applicable for batch production. In this paper, we present a novel, process-agnostic approach to address the challenge of ensuring geometric precision and accuracy in position-dependent AM production. The proposed GraphCompNet presents a novel computational framework integrating graph-based neural networks with a GAN inspired training paradigm. The framework leverages point cloud representations and dynamic graph convolutional neural networks (DGCNNs) to model intricate geometries while incorporating position-specific thermal and mechanical variations. A two-stage adversarial training process iteratively refines compensated designs using a compensator-predictor architecture, enabling real-time feedback and optimization. Experimental validation across various shapes and positions demonstrates the framework’s ability to predict deviations in freeform geometries and adapt to position-dependent batch production conditions, significantly improving compensation accuracy (35 to 65 percent) across the entire printing space, addressing position-dependent variabilities within the print chamber. The proposed method advances the development of a Digital Twin for AM, offering scalable, real-time monitoring and compensation capabilities.

[180] G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation

Yesung Cho, Sungmin Lee, Geongyu Lee, Minkyung Lee, Jongbae Park, Dongmyung Shin

Main category: cs.CV

TL;DR: G2L framework uses knowledge distillation to boost large-scale pathology models to giga-scale performance using only 1K cancer-specific slides, achieving better results than same-size models and sometimes surpassing giga-scale teachers.

Details

Motivation: Giga-scale pathology foundation models have high performance but are computationally prohibitive for practical use due to massive training data requirements (hundreds of thousands of slides) and billions of parameters. There's a need for parameter-efficient alternatives that can achieve similar performance without the computational burden.

Method: Proposes G2L framework that applies knowledge distillation to transfer capabilities from a giga-scale model to a large-scale model (only 15% of giga-scale parameters). Uses just 1,000 pathology slides of a target cancer type (e.g., breast, prostate) for distillation.

Result: Distilled model outperformed state-of-the-art models of same size (large-scale) across several benchmarks. Surprisingly surpassed the giga-scale teacher and huge-scale models in some benchmarks. Also exhibited higher robustness index, showing improved resilience to image variations from multiple institutions.

Conclusion: The distillation approach provides a data- and parameter-efficient way to achieve giga-scale-level performance for cancer-specific applications without prohibitive computational costs, making advanced pathology AI more practical for real-world deployment.

Abstract: Recent studies in pathology foundation models have shown that scaling training data, diversifying cancer types, and increasing model size consistently improve their performance. However, giga-scale foundation models, which are trained on hundreds of thousands of slides covering tens of cancer types and contain billions of parameters, pose significant challenges for practical use due to their tremendous computational costs in both development and deployment. In this work, we present a novel strategy, named the G2L framework, to increase the performance of large-scale foundation models, which consist of only $15%$ of the parameters of giga-scale models, to a comparable performance level of giga-scale models in cancer-specific tasks. Our approach applies knowledge distillation, transferring the capabilities of a giga-scale model to a large-scale model, using just 1K pathology slides of a target cancer (e.g., breast, prostate, etc.). The resulting distilled model not only outperformed state-of-the-art models of the same size (i.e., large-scale) across several benchmarks but also, interestingly, surpassed the giga-scale teacher and huge-scale models in some benchmarks. In addition, the distilled model exhibited a higher robustness index, indicating improved resilience to image variations originating from multiple institutions. These findings suggest that the proposed distillation approach for a large-scale model is a data- and parameter-efficient way to achieve giga-scale-level performance for cancer-specific applications without prohibitive computational burden.

[181] Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, Dong Gong

Main category: cs.CV

TL;DR: TPPT: A simple continual learning method for CLIP that uses textual prototypes as stable anchors to guide visual prompt learning, reducing forgetting while learning new tasks.

Details

Motivation: Existing CL methods for CLIP are overly complex with intricate regularization schemes, routing mechanisms, and multi-stage designs that underutilize CLIP's intrinsic capabilities. There's a need for a more concise approach that fully exploits CLIP's multi-modal structure and the stability of textual representations.

Method: Textual Prototype-guided Prompt Tuning (TPPT) uses textual prototypes as stable anchors to guide visual prompt learning (TPPT-V). It also jointly optimizes visual and textual prompts (TPPT-VT) to close the vision-language gap. The method includes relational diversity regularization on textual anchors to prevent embedding space collapse and mitigate correlated forgetting.

Result: Extensive experiments demonstrate the effectiveness of TPPT in enabling effective learning of new knowledge while reducing forgetting, highlighting the benefits of leveraging CLIP’s intrinsic guidance for continual adaptation.

Conclusion: TPPT provides a concise yet effective continual learning approach for CLIP that fully exploits its multi-modal capabilities through textual prototype guidance and bidirectional supervision, outperforming more complex existing methods.

Abstract: Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP’s intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP’s intrinsic guidance for continual adaptation.

[182] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang

Main category: cs.CV

TL;DR: Proposes EDVD-LLaMA, a multimodal LLM framework for explainable deepfake video detection that provides traceable reasoning alongside detection results, using spatio-temporal feature extraction and fine-grained chain-of-thought with facial feature constraints.

Details

Motivation: Traditional deepfake detection methods lack transparency and generalization capabilities, creating urgent need for detectors that can both identify forged content and provide verifiable reasoning explanations.

Method: 1) Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract/fuse global/local cross-frame deepfake features; 2) Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism using facial feature data as hard constraints for pixel-level spatio-temporal localization; 3) Explainable Reasoning FF++ dataset (ER-FF++set) with structured annotations for dual supervision.

Result: EDVD-LLaMA achieves outstanding performance and robustness in detection accuracy, explainability, and handling cross-forgery methods and cross-dataset scenarios, outperforming previous DVD methods.

Conclusion: The framework provides a more explainable and superior solution for deepfake video detection by combining accurate detection with trustworthy, traceable reasoning processes.

Abstract: The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The project page is available at: https://11ouo1.github.io/edvd-llama/.

[183] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch

Main category: cs.CV

TL;DR: HOLODECK 2.0 is an advanced vision-language-guided framework for automated 3D scene generation with interactive editing capabilities, supporting diverse styles and both indoor/open-domain environments.

Details

Motivation: Current 3D scene design requires extensive manual effort, and existing automated methods struggle with open-domain scenes and flexible editing. There's a need for more automated, flexible 3D generation systems.

Method: Uses vision-language models to identify and parse required objects, generates assets via state-of-the-art 3D generative models, then iteratively applies spatial constraints from VLMs for semantically coherent and physically plausible layouts.

Result: HOLODECK 2.0 generates diverse, stylistically rich 3D scenes with high semantic fidelity to fine-grained descriptions, outperforming baselines in both indoor and open-domain scenarios. Supports interactive editing based on human feedback.

Conclusion: HOLODECK 2.0 effectively addresses current limitations in 3D scene generation, providing high-quality automated generation with flexible editing capabilities, demonstrated in practical applications like procedural game modeling.

Abstract: 3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. Then, HOLODECK 2.0 iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Both human and model evaluations demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, HOLODECK 2.0 provides editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling to generate visually rich and immersive environments that can boost efficiency in game design.

[184] CharDiff-LP: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration

Kihyun Na, Gyuhwan Park, Injung Kim

Main category: cs.CV

TL;DR: CharDiff-LP: A diffusion-based license plate restoration framework with character-level guidance that uses region-wise masking to improve restoration quality and recognition accuracy.

Details

Motivation: License plate image restoration is crucial for both preprocessing in recognition systems and enhancing evidential value, visual clarity, and reusability of license plate images, especially when dealing with severely degraded images captured under realistic conditions.

Method: Proposes CharDiff-LP, a diffusion-based framework with character-level guidance that leverages fine-grained character priors from external segmentation and OCR modules. Introduces Character-guided Attention through Region-wise Masking (CHARM) module to restrict character guidance to specific regions, preventing interference between different character areas.

Result: CharDiff-LP significantly outperforms baseline restoration models in both restoration quality and recognition accuracy, achieving a 28.3% relative reduction in character error rate (CER) on the Roboflow-LP dataset compared to the best-performing baseline.

Conclusion: The proposed CharDiff-LP framework effectively restores severely degraded license plate images through character-level guidance with region-wise masking, demonstrating superior performance in both restoration and recognition tasks compared to existing methods.

Abstract: License plate image restoration is important not only as a preprocessing step for license plate recognition but also for enhancing evidential value, improving visual clarity, and enabling broader reuse of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff-LP, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff-LP leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff-LP incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character’s guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff-LP significantly outperformed baseline restoration models in both restoration quality and recognition accuracy, achieving a 28.3% relative reduction in character error rate (CER) on the Roboflow-LP dataset compared with the best-performing baseline.

[185] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li

Main category: cs.CV

TL;DR: DocR1 is a multimodal LLM trained with Evidence Page-Guided GRPO (EviGRPO), a novel RL framework for multi-page document understanding that uses evidence-aware rewards to guide coarse-to-fine reasoning (retrieve pages first, then answer).

Details

Motivation: Multi-page document understanding is challenging for MLLMs as it requires fine-grained visual comprehension and multi-hop reasoning across pages. RL has been used for advanced reasoning in MLLMs but remains underexplored for multi-page documents.

Method: Introduces EviGRPO (Evidence Page-Guided GRPO) RL framework with evidence-aware reward mechanism promoting coarse-to-fine reasoning. Uses two-stage annotation pipeline and curriculum learning to construct EviBench (4.8k training examples) and ArxivFullQA (8.6k QA evaluation benchmark).

Result: DocR1 achieves state-of-the-art performance on multi-page tasks while maintaining strong results on single-page benchmarks across extensive experiments.

Conclusion: The EviGRPO framework enables building high-quality multi-page document understanding models with limited supervision through evidence-guided RL training.

Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

[186] Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Bailey Trang, Parham Saremi, Alan Q. Wang, Fangrui Huang, Zahra TehraniNasab, Amar Kumar, Tal Arbel, Li Fei-Fei, Ehsan Adeli

Main category: cs.CV

TL;DR: Rainbow is a conditional image generation framework that uses GFlowNets to decompose input conditions into diverse latent representations, generating multiple plausible images from uncertain prompts while maintaining fidelity.

Details

Motivation: Traditional methods for generating diverse images from uncertain conditions either modify random seeds (making differences hard to interpret) or diversify input prompts (limited in verbal interpretability). There's a need for methods that can capture meaningful diversity when conditions contain uncertainty leading to multiple plausible outputs.

Method: Rainbow integrates a latent graph parameterized by GFlowNets into prompt representation computation. It uses GFlowNets’ advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, producing multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images.

Result: Evaluations on natural image and medical image datasets demonstrate Rainbow’s improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.

Conclusion: Rainbow provides an effective framework for generating diverse plausible images from uncertain conditions by decomposing input conditions into diverse latent representations using GFlowNets, applicable to any pretrained conditional generative model.

Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets’ advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow’s improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.

[187] On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness

Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gómez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

Main category: cs.CV

TL;DR: CLIP models show a systematic transition during training: early stages have strong texture bias and align with low-level human perception but are noise-sensitive; later stages shift to shape-based representations with better noise robustness but reduced low-level perceptual alignment.

Details

Motivation: To understand how CLIP's internal visual representations evolve during training and how this evolution relates to human perception, since most existing analyses only characterize fully trained models, leaving training dynamics unexplored.

Method: Epoch-by-epoch analysis of CLIP models throughout training, using multiple perceptual benchmarks: low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness evaluation.

Result: Early training stages show strong texture bias, high alignment with low-level human perception, and increased sensitivity to Gaussian noise. As training progresses, texture bias diminishes in favor of shape-based representations, with improved noise robustness but declining low-level perceptual alignment. These dynamics are consistent across multiple CLIP model scales.

Conclusion: There’s a systematic trade-off between early low-level perceptual alignment and later robustness, revealing how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training, offering insights into vision-language model dynamics and their relationship to human visual processing.

Abstract: Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.

[188] The Generation Phases of Flow Matching: a Denoising Perspective

Anne Gagneux, Ségolène Martin, Rémi Gribonval, Mathurin Massias

Main category: cs.CV

TL;DR: The paper investigates flow matching from a denoising perspective, establishing formal connections between flow matching models and denoisers to analyze generation quality factors.

Details

Motivation: Flow matching has achieved success but the factors influencing generation quality remain poorly understood. The authors aim to empirically probe the generation process by adopting a denoising perspective.

Method: The authors design a framework to connect flow matching models with denoisers, providing a common ground for comparison. They introduce principled perturbations (noise and drift) to influence sample generation and analyze the dynamical phases of the generative process.

Result: The framework reveals new insights into distinct dynamical phases of the generative process, enabling precise characterization of when denoisers succeed or fail during generation and why this matters.

Conclusion: By establishing formal connections between flow matching and denoising, the work provides a principled way to analyze generation quality factors and understand the dynamics of the generative process.

Abstract: Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.

[189] TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing Using Single-Pixel

Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux

Main category: cs.CV

TL;DR: TimeSenCLIP is a lightweight vision-language model for remote sensing that aligns Sentinel-2 time series with ground-level images using temporal contrastive learning, without needing text annotations.

Details

Motivation: Current VLMs for remote sensing rely on caption-based supervision (often limited) and prioritize spatial context over spectral/temporal information, making them ineffective for medium-resolution imagery where temporal and spectral signals are crucial.

Method: Uses a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, focusing on temporal and spectral signals rather than spatial context, and operates without textual annotations.

Result: The paper presents TimeSenCLIP as a solution that investigates whether single-pixel time series contain sufficient information for various remote sensing tasks, emphasizing temporal and spectral analysis over spatial context.

Conclusion: TimeSenCLIP addresses limitations of current VLMs by focusing on temporal and spectral information in medium-resolution remote sensing, offering a text-free approach that better captures the dynamics of land-use and land-cover changes.

Abstract: Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks.

[190] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Yinjie Lei, Changsheng Li

Main category: cs.CV

TL;DR: PhysGM is a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling fast 4D simulation and rendering without time-consuming per-scene optimization.

Details

Motivation: Current physics-based 3D motion synthesis methods have three key limitations: 1) reliance on pre-reconstructed 3DGS requiring dense multi-view images and per-scene optimization, 2) physics integration via either inflexible hand-specified attributes or unstable SDS optimization, and 3) naive concatenation of prebuilt 3DGS with physics modules that ignores physical information in appearance.

Method: PhysGM uses a physics-aware reconstruction model pre-trained to directly infer both Gaussian and physical parameters from a single image. It further refines the model with Direct Preference Optimization (DPO) to align simulations with physically plausible reference videos, avoiding high-cost SDS optimization. The authors also created PhysAssets, a dataset of 50K+ 3D assets with physical properties and reference videos to support training.

Result: PhysGM produces high-fidelity 4D simulations from a single image in just one minute, achieving significant speedup over prior work while delivering realistic renderings.

Conclusion: PhysGM addresses key limitations in physics-based 3D motion synthesis by enabling fast, feed-forward joint prediction of 3D representation and physical properties from a single image, supported by a novel dataset and efficient optimization approach.

Abstract: Despite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naive concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that directly infers both Gaussian and physical parameters. We further refine the model with Direct Preference Optimization (DPO), aligning simulations with the physically plausible reference videos and avoiding the high-cost SDS optimization. To address the absence of a supporting dataset for this task, we propose PhysAssets, a dataset of 50K+ 3D assets annotated with physical properties and corresponding reference videos. Experiments show that PhysGM produces high-fidelity 4D simulations from a single image in one minute, achieving a significant speedup over prior work while delivering realistic renderings.Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/

[191] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

Vipooshan Vipulananthan, Kumudu Mohottala, Kavindu Chinthana, Nimsara Paramulla, Charith D Chitraranjan

Main category: cs.CV

TL;DR: STAGNet model improves accident prediction from dash-cam videos using enhanced spatio-temporal features and recurrent networks, outperforming previous methods on multiple datasets.

Details

Motivation: Accident prediction is crucial for road safety in ADAS and autonomous vehicles. While existing systems use multiple sensors (LiDAR, radar, GPS), dash-cam videos offer a more cost-effective and easily deployable solution, though more challenging.

Method: Proposes STAGNet model that incorporates improved spatio-temporal features and aggregates them through a recurrent network to enhance state-of-the-art graph neural networks for accident prediction from dash-cam videos.

Result: Experiments on three public datasets (DAD, DoTA, DADA) show STAGNet achieves higher average precision and mean time-to-accident scores than previous methods, both in cross-validation and cross-dataset evaluation.

Conclusion: The proposed STAGNet model effectively improves accident prediction performance from dash-cam videos, offering a practical and cost-effective solution for road safety applications.

Abstract: Accident prediction and timely preventive actions improve road safety by reducing the risk of injury to road users and minimizing property damage. Hence, they are critical components of advanced driver assistance systems (ADAS) and autonomous vehicles. While many existing systems depend on multiple sensors such as LiDAR, radar, and GPS, relying solely on dash-cam videos presents a more challenging, yet more cost-effective and easily deployable solution. In this work, we incorporate improved spatio-temporal features and aggregate them through a recurrent network to enhance state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets (DAD, DoTA and DADA) show that our proposed STAGNet model achieves higher average precision and mean time-to-accident scores than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.

[192] CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

Asmit Bandyopadhyay, Anindita Das Bhattacharjee, Rakesh Das

Main category: cs.CV

TL;DR: CLAReSNet is a hybrid CNN-transformer architecture for hyperspectral image classification that addresses high dimensionality, spectral-spatial correlations, and class imbalance through multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck.

Details

Motivation: HSI classification faces challenges: high spectral dimensionality, complex spectral-spatial correlations, limited training samples with severe class imbalance. CNNs and transformers individually have limitations - CNNs lack long-range dependencies, transformers have quadratic complexity and insufficient inductive biases.

Method: Hybrid architecture integrating multi-scale convolutional extraction with transformer-style attention via adaptive latent bottleneck. Uses multi-scale convolutional stem with deep residual blocks and enhanced Convolutional Block Attention Module for spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA) that reduces complexity from O(T²D) to O(Tlog(T)D) via adaptive latent token allocation.

Result: State-of-the-art performance on Indian Pines (99.71% OA) and Salinas (99.96% OA) datasets, significantly surpassing HybridSN, SSRN, and SpectralFormer. Learned embeddings show superior inter-class separability and compact intra-class clustering, effective under severe class imbalance.

Conclusion: CLAReSNet effectively addresses HSI classification challenges by combining CNNs and transformers through adaptive latent bottleneck, achieving superior performance and robust classification under class imbalance through hierarchical cross-attention fusion.

Abstract: Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from $\mathcal{O}(T^2D)$ to $\mathcal{O}(T\log(T)D)$ by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet’s effectiveness under severe class imbalance.

[193] Edge-Native Digitization of Handwritten Marksheets: A Hybrid Heuristic-Deep Learning Framework

Md. Irtiza Hossain, Junaid Ahmed Sifat, Abir Chowdhury

Main category: cs.CV

TL;DR: A hybrid framework combining OpenCV-based table detection with lightweight YOLOv8 achieves 97.5% accuracy on EMNIST with 95× speedup over standard OCR and real-time 29 FPS on CPU.

Details

Motivation: Digitizing structured handwritten documents like marksheets is challenging due to irregular tables and diverse handwriting. Existing Transformer-based methods (TableNet, TrOCR) are accurate but computationally expensive, making them unsuitable for resource-constrained edge deployments.

Method: A resource-efficient hybrid framework: 1) Heuristic OpenCV-based pipeline for rapid table structure detection, 2) Modified lightweight YOLOv8 architecture for handwritten character recognition by removing SPPF and deep C2f layers from the backbone to reduce computational overhead.

Result: Modified YOLOv8 achieves 97.5% accuracy on EMNIST digit benchmark. Framework offers 95× inference speedup over standard OCR pipelines and massive efficiency gains over LMMs like Qwen2.5-VL, achieving real-time 29 FPS on standard CPU hardware. Robust performance confirmed on AMES dataset (real-world marksheets).

Conclusion: The hybrid framework bridges the gap between high-performance deep learning and practical, scalable document automation by maintaining high recognition fidelity while enabling real-time performance on resource-constrained hardware for structured handwritten document digitization.

Abstract: The digitization of structured handwritten documents, such as academic marksheets, remains a significant challenge due to the dual complexity of irregular table structures and diverse handwriting styles. While recent Transformer-based approaches like TableNet and TrOCR achieve state-of-the-art accuracy, their high computational cost renders them unsuitable for resource-constrained edge deployments. This paper introduces a resource-efficient hybrid framework that integrates a heuristic OpenCV-based pipeline for rapid table structure detection with a modified lightweight YOLOv8 architecture for handwritten character recognition. By strategically removing the SPPF and deep C2f layers from the standard YOLOv8 backbone, we reduce computational overhead while maintaining high recognition fidelity. Experimental results on the EMNIST digit benchmark demonstrate that our Modified YOLOv8 model achieves 97.5% accuracy. Furthermore, we provide a comprehensive efficiency analysis showing that our framework offers a 95 times inference speedup over standard OCR pipelines and massive efficiency gains over emerging Large Multimodal Models (LMMs) like Qwen2.5-VL, achieving real-time performance 29 FPS on standard CPU hardware. A qualitative and quantitative evaluation on the AMES dataset, a challenging subset of real-world marksheets, confirms the system’s robustness in handling mixed alphanumeric content, bridging the gap between high-performance deep learning and practical, scalable document automation.

[194] Human-like Content Analysis for Generative AI with Language-Grounded Sparse Encoders

Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Yingtao Zhu, Ye Zhang, Trang Nguyen, Yih-Chung Tham, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, Dianbo Liu

Main category: cs.CV

TL;DR: LanSE is a new AI content analysis tool that decomposes images into interpretable visual patterns with natural language descriptions, enabling granular analysis of AI-generated content across various domains.

Details

Motivation: Current AI content analysis methods treat images as indivisible wholes, but real-world AI failures often manifest as specific visual patterns that evade holistic detection. There's a need for more granular, decomposed analysis of AI-generated content, especially in high-stakes domains.

Method: Language-Grounded Sparse Encoders (LanSE) decomposes images into interpretable visual patterns with natural language descriptions using interpretability modules and large multimodal models. It automatically identifies visual patterns within data modalities.

Result: LanSE discovered over 5,000 visual patterns with 93% human agreement, provides decomposed evaluation outperforming existing methods, establishes the first systematic evaluation of physical plausibility, and extends to medical imaging settings.

Conclusion: LanSE’s language-grounded pattern extraction capability can be adapted to numerous fields (biology, geography) and other data modalities (protein structures, time series), advancing content analysis for generative AI.

Abstract: The rapid development of generative AI has transformed content creation, communication, and human development. However, this technology raises profound concerns in high-stakes domains, demanding rigorous methods to analyze and evaluate AI-generated content. While existing analytic methods often treat images as indivisible wholes, real-world AI failures generally manifest as specific visual patterns that can evade holistic detection and suit more granular and decomposed analysis. Here we introduce a content analysis tool, Language-Grounded Sparse Encoders (LanSE), which decompose images into interpretable visual patterns with natural language descriptions. Utilizing interpretability modules and large multimodal models, LanSE can automatically identify visual patterns within data modalities. Our method discovers more than 5,000 visual patterns with 93% human agreement, provides decomposed evaluation outperforming existing methods, establishes the first systematic evaluation of physical plausibility, and extends to medical imaging settings. Our method’s capability to extract language-grounded patterns can be naturally adapted to numerous fields, including biology and geography, as well as other data modalities such as protein structures and time series, thereby advancing content analysis for generative AI.

[195] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection

Ryan Banks, Camila Lindoni Azevedo, Hongying Tang, Yunpeng Li

Main category: cs.CV

TL;DR: Hierarchical semantic segmentation framework for dental anatomy improves fine-grained detection and anatomical coherence in panoramic radiographs.

Details

Motivation: Existing hierarchy-aware segmentation methods provide weak supervision through loss functions only. Accurate anatomical structure understanding is essential for reliable dental disease staging, requiring better hierarchical modeling.

Method: Recurrent level-wise prediction with restrictive output heads and top-down feature conditioning using Feature-wise Linear Modulation (FiLM). Probabilistic composition rule enforces parent-child consistency. Hierarchical loss combines per-level Dice, cross entropy, and consistency terms.

Result: Hierarchical variants consistently increase IoU, Dice, and recall (especially for fine-grained anatomies) and produce more anatomically coherent masks. However, they show increased recall over precision (more false positives).

Conclusion: Explicit hierarchical structuring improves both performance and clinical plausibility, particularly in low-data dental imaging regimes.

Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.

[196] SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang, Yang Luo, Jing Du, Siyu Chen, Yiping Chen, Ting Han

Main category: cs.CV

TL;DR: SGS-3D is a training-free refinement framework for 3D instance segmentation that splits ambiguous 2D-to-3D lifted masks using geometric primitives, then grows them into complete instances by fusing semantic and geometric information.

Details

Motivation: Existing 2D-to-3D lifting approaches for 3D instance segmentation suffer from accumulated errors due to ambiguous semantic guidance and insufficient depth constraints, leading to imprecise instance-level segmentation.

Method: Proposes a “split-then-grow” framework: 1) Purifies and splits ambiguous lifted masks using geometric primitives with a mask filtering strategy based on 3D geometry co-occurrence, 2) Grows masks into complete instances by exploiting spatial continuity and high-level features, particularly for semantically ambiguous objects.

Result: Experimental results on ScanNet200, ScanNet++, and KITTI-360 show substantial improvements in segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances with strong generalization across indoor and outdoor environments.

Conclusion: SGS-3D effectively addresses the limitations of 2D-to-3D lifting approaches by jointly fusing semantic and geometric information, providing a training-free refinement method that produces accurate 3D instance segmentation across diverse environments.

Abstract: Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel “split-then-grow” framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available at https://github.com/wangchaolei7/SGS-3D.

[197] SportsGPT: An LLM-driven Framework for Interpretable Sports Motion Assessment and Training Guidance

Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu, Zihao Zhang, Zhang Zhang

Main category: cs.CV

TL;DR: SportsGPT is an LLM-driven framework for interpretable sports motion assessment and training guidance that establishes a closed loop from motion input to professional guidance using MotionDTW alignment, KISMAM assessment, and SportsRAG guidance generation.

Details

Motivation: Existing sports analysis systems focus mainly on scoring and visualization but lack automatic performance diagnosis and interpretable training guidance. Recent advances in LLMs and motion analysis provide opportunities to address these limitations.

Method: 1) MotionDTW: Two-stage time series alignment algorithm for accurate keyframe extraction from skeleton-based motion sequences. 2) KISMAM: Knowledge-based Interpretable Sports Motion Assessment Model that contrasts keyframes with target models to obtain interpretable metrics. 3) SportsRAG: RAG-based training guidance model built on Qwen3 with 6B-token knowledge base that retrieves domain-specific QA pairs to generate professional guidance.

Result: MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Ablation studies validate KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.

Conclusion: SportsGPT successfully establishes a closed-loop framework for interpretable sports motion assessment and training guidance, leveraging LLMs and motion analysis to provide professional, interpretable feedback that addresses limitations of existing systems.

Abstract: Existing intelligent sports analysis systems mainly focus on “scoring and visualization,” often lacking automatic performance diagnosis and interpretable training guidance. Recent advances in Large Language Models (LLMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by contrasting the keyframes with the target models. Finally, we propose SportsRAG, a RAG-based training guidance model built upon Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.

[198] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura

Main category: cs.CV

TL;DR: ZeroPlantSeg enables zero-shot hierarchical segmentation of rosette-shaped plant individuals from top-view images by combining foundation segmentation models for leaves with vision-language models for plant structure reasoning, outperforming existing zero-shot methods and achieving better cross-domain performance than supervised approaches.

Details

Motivation: While foundation segmentation models can extract leaf instances zero-shot, segmenting entire plant individuals with multiple overlapping leaves (hierarchical segmentation) remains challenging and typically requires annotated training datasets that are species-specific and labor-intensive to create.

Method: Integrates a foundation segmentation model to extract leaf instances with a vision-language model that reasons about plant structures to extract complete plant individuals, all without additional training (zero-shot).

Result: The method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods when evaluated on datasets with multiple plant species, growth stages, and shooting environments.

Conclusion: ZeroPlantSeg provides an effective zero-shot solution for hierarchical plant segmentation that eliminates the need for annotated training data while maintaining strong performance across diverse conditions.

Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang

Main category: cs.CV

TL;DR: LRPO is a reinforcement learning framework for blind face restoration that uses likelihood regularization to improve restoration quality while balancing perceptual quality and fidelity.

Details

Motivation: Blind Face Restoration (BFR) faces challenges with large solution space leading to artifacts like missing details and identity ambiguity. Existing methods struggle to balance perceptual quality and fidelity in restoration results.

Method: Proposes Likelihood-Regularized Policy Optimization (LRPO) framework using online RL for BFR. Includes: 1) composite reward function for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment to balance perceptual quality and fidelity.

Result: Extensive experiments show LRPO significantly improves face restoration quality over baseline methods and achieves state-of-the-art performance.

Conclusion: LRPO successfully applies reinforcement learning to blind face restoration with novel regularization strategies, effectively addressing the challenges of large solution space while balancing perceptual quality and fidelity.

Abstract: Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

[200] Human Mesh Modeling for Anny Body

Romain Brégier, Guénolé Fiche, Laura Bravo-Sánchez, Thomas Lucas, Matthieu Armando, Philippe Weinzaepfel, Grégory Rogez, Fabien Baradel

Main category: cs.CV

TL;DR: Anny is an open-source, scan-free human body model based on anthropometric knowledge that provides semantic control over human shape variation across demographics, calibrated with WHO statistics.

Details

Motivation: Existing parametric body models rely on costly 3D scans and proprietary shape spaces that are demographically narrow, limiting accessibility and diversity in human-centric 3D modeling tasks.

Method: Anny creates a fully differentiable body model grounded in anthropometric knowledge from MakeHuman community, using phenotype parameters (gender, age, height, weight) to control blendshapes across ages and body types, calibrated with WHO population statistics.

Result: Anny provides realistic demographically-grounded shape variation, supports millimeter-accurate scan fitting, synthetic data generation, and HMR. Anny-One dataset of 800k images shows HMR models trained with Anny match performance of scan-based models.

Conclusion: Anny offers an accessible, open-source foundation for human-centric 3D modeling with semantic control and demographic diversity, released under Apache 2.0 license to democratize human shape modeling.

Abstract: Parametric body models provide the structural basis for many human-centric tasks, yet existing models often rely on costly 3D scans and learned shape spaces that are proprietary and demographically narrow. We introduce Anny, a simple, fully differentiable, and scan-free human body model grounded in anthropometric knowledge from the MakeHuman community. Anny defines a continuous, interpretable shape space, where phenotype parameters (e.g. gender, age, height, weight) control blendshapes spanning a wide range of human forms–across ages (from infants to elders), body types, and proportions. Calibrated using WHO population statistics, it provides realistic and demographically grounded human shape variation within a single unified model. Thanks to its openness and semantic control, Anny serves as a versatile foundation for 3D human modeling–supporting millimeter-accurate scan fitting, controlled synthetic data generation, and Human Mesh Recovery (HMR). We further introduce Anny-One, a collection of 800k photorealistic images generated with Anny, showing that despite its simplicity, HMR models trained with Anny can match the performance of those trained with scan-based body models. The Anny body model and its code are released under the Apache 2.0 license, making Anny an accessible foundation for human-centric 3D modeling.

[201] CD-DPE: Dual-Prompt Expert Network Based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

Xianming Gu, Lihui Wang, Ying Cao, Zeyu Deng, Yingfeng Ou, Guodong Hu, Yi Chen

Main category: cs.CV

TL;DR: CD-DPE: A dual-prompt expert network with convolutional dictionary feature decoupling for multi-contrast MRI super-resolution, using frequency and routing prompts to better integrate reference image features.

Details

Motivation: Multi-contrast MRI super-resolution aims to reconstruct high-resolution images from low-resolution scans using reference images with different contrasts. However, contrast disparities between modalities make it challenging to effectively utilize reference image textures, often leading to suboptimal feature integration in reconstruction.

Method: Proposes CD-DPE with two key components: 1) Iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, reducing redundancy and interference. 2) Dual-prompt feature fusion expert module (DP-FFEM) using frequency prompt to select relevant reference features and adaptive routing prompt to determine optimal fusion method for reference and target features.

Result: Extensive experiments on public multi-contrast MRI datasets show CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additional experiments on unseen datasets demonstrate strong generalization capabilities.

Conclusion: The proposed CD-DPE framework effectively addresses contrast disparity challenges in multi-contrast MRI super-resolution through feature decoupling and dual-prompt fusion, achieving superior reconstruction quality and generalization performance.

Abstract: Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.

[202] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance is an efficient music-to-dance generation model that combines MeanFlow with physical constraints for high-quality motion generation in few sampling steps, using a BiMamba-based architecture for fast non-autoregressive generation.

Details

Motivation: Existing music-to-dance generation methods have limited generation efficiency, leaving insufficient computational resources for high-fidelity 3D rendering, which constrains the expressiveness of 3D characters in real-world applications.

Method: FlowerDance combines MeanFlow with Physical Consistency Constraints for high-quality motion with few sampling steps, uses a BiMamba-based backbone with Channel-Level Cross-Modal Fusion for efficient non-autoregressive generation, and supports interactive motion editing.

Result: Extensive experiments on AIST++ and FineDance datasets show FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency (inference speed and memory utilization).

Conclusion: FlowerDance addresses the efficiency bottleneck in music-to-dance generation, enabling high-quality motion with physical plausibility and artistic expressiveness while maintaining computational efficiency for real-world applications.

Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization. Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance. Project page: https://flowerdance25.github.io/ .

[203] Equivariant symmetry-aware head pose estimation for fetal MRI

Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Polina Golland, Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose is a fast pose estimation method that models rotation equivariance and object symmetry to estimate fetal head pose in MRI scans, enabling automatic adaptive slice prescription.

Details

Motivation: The paper addresses the challenging problem of accounting for fetal head motion during diagnostic MRI scans. The goal is to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes acquired before each 2D slice.

Method: E(3)-Pose jointly and explicitly models rotation equivariance and object symmetry by construction. It captures anatomical symmetries and rigid pose equivariance to yield robust estimates of fetal head pose from 3D MRI volumes.

Result: Experiments on publicly available and representative clinical fetal MRI datasets demonstrate superior robustness and generalization across domains. E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes.

Conclusion: The method’s explicit modeling of rotation equivariance and object symmetry enables robust fetal head pose estimation, paving the way for clinical translation in adaptive MRI slice prescription. Implementation is publicly available.

Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.

[204] CLUENet: Cluster Attention Makes Neural Networks Have Eyes

Xiangshuai Song, Jun-Jie Huang, Tianrui Liu, Ke Liang, Chang Tang

Main category: cs.CV

TL;DR: CLUENet is a transparent deep architecture for visual semantic understanding that combines clustering paradigms with attention mechanisms to achieve better accuracy, efficiency, and interpretability than existing methods.

Details

Motivation: Convolution- and attention-based models have rigid receptive fields and complex architectures that limit their ability to model irregular spatial patterns and hinder interpretability, which is problematic for tasks requiring high transparency. Clustering paradigms offer interpretability and flexible semantic modeling but suffer from limited accuracy, low efficiency, and gradient vanishing during training.

Method: Proposes CLUENet with three key innovations: (1) Global Soft Aggregation and Hard Assignment with Temperature-Scaled Cosine Attention and gated residual connections for enhanced local modeling, (2) inter-block Hard and Shared Feature Dispatching, and (3) an improved cluster pooling strategy.

Result: Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.

Conclusion: CLUENet successfully addresses the limitations of both traditional vision models and clustering paradigms, providing a transparent deep architecture that achieves superior performance while maintaining interpretability for visual semantic understanding tasks.

Abstract: Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing challenges for tasks requiring high model transparency. Clustering paradigms offer promising interpretability and flexible semantic modeling, but suffer from limited accuracy, low efficiency, and gradient vanishing during training. To address these issues, we propose CLUster attEntion Network (CLUENet), an transparent deep architecture for visual semantic understanding. We propose three key innovations include (i) a Global Soft Aggregation and Hard Assignment with a Temperature-Scaled Cosin Attention and gated residual connections for enhanced local modeling, (ii) inter-block Hard and Shared Feature Dispatching, and (iii) an improved cluster pooling strategy. These enhancements significantly improve both classification performance and visual interpretability. Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.

[205] SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds

Alexander Dow, Manduhu Manduhu, Matheus Santos, Ben Bartlett, Gerard Dooly, James Riordan

Main category: cs.CV

TL;DR: SSCATeR uses LiDAR’s continuous scanning motion to focus object detection only on changing regions between frames, achieving 6.61x speedup while maintaining accuracy through temporal data recycling and sparse scatter-based convolutions.

Details

Motivation: Traditional LiDAR object detection processes entire point clouds each frame, wasting computation on static regions. The authors aim to exploit the temporal continuity of LiDAR scanning to reduce redundant computations by focusing only on changing areas.

Method: Proposes SSCATeR (Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling) that uses sliding time windows with short strides, stores convolution results between passes, and extends scatter-based convolutions to enable data reuse. The method treats LiDAR as a continuous stream and processes only changing parts of point clouds.

Result: Achieves identical feature maps to traditional sparse convolution methods but with up to 6.61-fold reduction in processing time. The method maintains accuracy while significantly reducing convolution operations per forward pass.

Conclusion: SSCATeR successfully exploits sparsity in LiDAR data through temporal data recycling, enabling efficient real-time object detection by focusing computation only on changing regions while maintaining detection accuracy.

Abstract: This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.

[206] In-Context Learning for Seismic Data Processing

Fabian Fuchs, Mario Ruben Fernandez, Norman Ettrich, Janis Keuper

Main category: cs.CV

TL;DR: ContextSeisNet uses in-context learning for seismic demultiple processing, conditioning predictions on neighboring gather examples to achieve spatially consistent results with user control and data efficiency.

Details

Motivation: Traditional seismic processing methods face challenges with noisy data and manual parameter tuning, while existing deep learning approaches suffer from spatially inconsistent results across neighboring seismic gathers and lack user control.

Method: ContextSeisNet is an in-context learning model that conditions predictions on a support set of spatially related example pairs - neighboring common-depth point gathers and their corresponding labels. This allows task-specific learning at inference time without retraining.

Result: On synthetic data, ContextSeisNet outperforms U-Net baseline quantitatively with enhanced spatial coherence. On field data, it achieves superior lateral consistency compared to traditional Radon demultiple and U-Net, with improved near-offset performance and more complete multiple removal. It achieves comparable performance with 90% less training data.

Conclusion: ContextSeisNet establishes a practical approach for spatially consistent seismic demultiple with user control and data efficiency, with potential applicability to other seismic processing tasks.

Abstract: Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.

[207] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, Rui Jin, Nong Sang, Changxin Gao

Main category: cs.CV

TL;DR: Nano Banana Pro shows strong subjective visual quality in zero-shot low-level vision tasks but underperforms in traditional quantitative metrics compared to specialist models.

Details

Motivation: To explore whether commercial text-to-image models like Nano Banana Pro can serve as generalist solvers for traditional low-level vision tasks, which remains largely underexplored.

Method: Comprehensive zero-shot evaluation across 14 low-level vision tasks spanning 40 diverse datasets using simple textual prompts without fine-tuning, benchmarking against state-of-the-art specialist models.

Result: Performance dichotomy: Nano Banana Pro demonstrates superior subjective visual quality (hallucinating plausible high-frequency details) but lags behind in traditional reference-based quantitative metrics due to inherent stochasticity and pixel-level consistency issues.

Conclusion: Nano Banana Pro is a capable zero-shot contender for low-level vision tasks, but achieving the high fidelity of domain specialists remains a significant challenge due to generative model limitations.

Abstract: The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.

[208] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni

Main category: cs.CV

TL;DR: A lightweight 3D animation framework that decouples geometric control from appearance synthesis using 2D-3D aligned proxy representation, enabling efficient animation with 3D control on low-power platforms.

Details

Motivation: Traditional 3D animation pipelines are labor-intensive and computationally expensive. Recent AIGC approaches either inherit heavy costs or sacrifice 3D controllability. There's a fundamental trade-off between rendering quality and 3D control in single-image 3D animation generation.

Method: Proposes a lightweight framework using 2D-3D aligned proxy representation: coarse 3D estimate as structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This decouples geometric control from appearance synthesis.

Result: Achieves efficient animation generation on low-power platforms. Outperforms video-based 3D animation generation in identity preservation, geometric/textural consistency, and precise interactive control. Enables 3D-aware motion control and coherent background animation.

Conclusion: The proposed framework successfully addresses the quality-control trade-off in 3D animation by separating geometry and appearance synthesis, offering practical 3D control without expensive optimization while maintaining high visual quality.

Abstract: 3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.

[209] Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yifan Sui, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zihan Yan, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang

Main category: cs.CV

TL;DR: A self-evolving GUI automation pipeline with calibrated reward system produces high-quality training data at low cost, enabling state-of-the-art GUI agents (Step-GUI models) with standardized privacy-preserving interfaces (GUI-MCP) and real-world evaluation benchmarks (AndroidDaily).

Details

Motivation: The paper addresses two key challenges in GUI automation: 1) the difficulty of acquiring high-quality training data efficiently while maintaining annotation reliability, and 2) the need for practical deployment with standardized interfaces across heterogeneous devices while protecting user privacy.

Method: The authors introduce a self-evolving training pipeline powered by a Calibrated Step Reward System that converts model-generated trajectories into reliable training signals through trajectory-level calibration. They develop Step-GUI models (4B/8B parameters) and propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture combining low-level atomic operations and high-level task delegation to local specialist models.

Result: The pipeline achieves >90% annotation accuracy with 10-100x lower cost. Step-GUI models achieve state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. The AndroidDaily benchmark shows strong performance (8B: static 89.91%, end-to-end 52.50%).

Conclusion: The work advances practical GUI agent development with a cost-effective training pipeline, high-performing models, privacy-preserving standardized interfaces, and real-world evaluation benchmarks, demonstrating strong potential for real-world deployment in everyday digital interactions.

Abstract: Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

[210] Stylized Synthetic Augmentation further improves Corruption Robustness

Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock, Andrey Morozov

Main category: cs.CV

TL;DR: Combining synthetic image data with neural style transfer improves deep vision model robustness against common corruptions, achieving state-of-the-art results on corruption benchmarks.

Details

Motivation: Deep vision models are vulnerable to common corruptions, and existing data augmentation methods need improvement for better corruption robustness.

Method: Training data augmentation pipeline combining synthetic image data with neural style transfer, systematically analyzing augmentation effects and hyperparameters, and integrating with rule-based techniques like TrivialAugment.

Result: Achieved state-of-the-art corruption robustness: 93.54% on CIFAR-10-C, 74.9% on CIFAR-100-C, and 50.86% on TinyImageNet-C. Stylization and synthetic data complement each other despite FID degradation.

Conclusion: The combination of synthetic data and neural style transfer effectively enhances model corruption robustness, working well with certain rule-based augmentations but not others, providing a practical approach for improving vision model reliability.

Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common Frechet Inception Distance (FID) metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively

[211] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen, Mingyue Cui, Xingyi Yang

Main category: cs.CV

TL;DR: DeContext is a defense method that uses targeted perturbations to block unauthorized in-context image editing by weakening cross-attention pathways in diffusion models.

Details

Motivation: In-context diffusion models enable powerful image editing but raise serious privacy concerns as personal images can be easily manipulated without consent for identity impersonation, misinformation, or malicious uses. Prior defenses focused on personalized text-to-image generation, but the robustness of modern large-scale DiT-based models remains unexamined.

Method: DeContext injects small, targeted perturbations that weaken multimodal attention layers where contextual information propagates from source to output. The method identifies that early denoising steps and specific transformer blocks dominate context propagation, allowing concentration of perturbations where they matter most to break the cross-attention pathways.

Result: Experiments on Flux Kontext and Step1X-Edit show DeContext consistently blocks unwanted image edits while preserving visual quality. The method demonstrates effectiveness of attention-based perturbations as a defense against image manipulation.

Conclusion: DeContext provides an efficient and robust defense mechanism against unauthorized in-context image editing by targeting the attention mechanisms that enable context propagation in diffusion models, offering protection against privacy violations and malicious image manipulation.

Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner’s consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation. Code is available at https://github.com/LinghuiiShen/DeContext.

[212] A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

Main category: cs.CV

TL;DR: First open multi-center, multi-device benchmark dataset for fetal ultrasound biometry with expert landmark annotations covering all primary fetal measurements, enabling reproducible AI development.

Details

Motivation: Manual landmarking in fetal ultrasound is time-consuming, operator-dependent, and variable across scanners/sites, limiting reproducibility of automated approaches. Need multi-source annotated datasets for AI-assisted fetal growth assessment.

Method: Created open benchmark dataset with 4,513 de-identified US images from 1,904 subjects across 3 clinical sites using 7 different US devices. Provided standardized subject-disjoint train/test splits, evaluation code, and baseline results. Used automatic biometry model to quantify domain shift.

Result: Dataset enables fair comparison of methods. Domain shift analysis shows single-center training/evaluation substantially overestimates performance compared to multi-center testing. First publicly available multi-center, multi-device landmark-annotated dataset covering all primary fetal biometry measures.

Conclusion: Provides robust benchmark for domain adaptation and multi-center generalization in fetal biometry, enabling more reliable AI-assisted fetal growth assessment across clinical centers. All data, annotations, code, and pipelines made publicly available.

Abstract: Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset comprises 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

cs.AI

[213] Navigating Taxonomic Expansions of Entity Sets Driven by Knowledge Bases

Pietro Cofone, Giovanni Amendola, Marco Manna, Aldo Ricioppo

Main category: cs.AI

TL;DR: The paper introduces efficient reasoning tasks for navigating expansion graphs without full materialization, enabling practical entity set expansion with taxonomic structures.

Details

Motivation: Traditional linear entity set expansion doesn't capture rich taxonomic structures from knowledge bases. While expansion graphs provide this structure, their full materialization is impractical for real-world scenarios due to potentially large size.

Method: Formalize reasoning tasks to check whether two tuples belong to comparable, incomparable, or the same nodes in the expansion graph. Implement these tasks efficiently under realistic assumptions like bounding input or limiting entity descriptions.

Result: Under realistic assumptions, the reasoning tasks can be implemented efficiently, enabling local, incremental navigation of expansion graphs without requiring full graph construction.

Conclusion: The approach supports practical applications of taxonomic entity set expansion by allowing efficient local navigation of expansion graphs, overcoming the impracticality of full graph materialization in real-world scenarios.

Abstract: Recognizing similarities among entities is central to both human cognition and computational intelligence. Within this broader landscape, Entity Set Expansion is one prominent task aimed at taking an initial set of (tuples of) entities and identifying additional ones that share relevant semantic properties with the former – potentially repeating the process to form increasingly broader sets. However, this linear'' approach does not unveil the richer taxonomic’’ structures present in knowledge resources. A recent logic-based framework introduces the notion of an expansion graph: a rooted directed acyclic graph where each node represents a semantic generalization labeled by a logical formula, and edges encode strict semantic inclusion. This structure supports taxonomic expansions of entity sets driven by knowledge bases. Yet, the potentially large size of such graphs may make full materialization impractical in real-world scenarios. To overcome this, we formalize reasoning tasks that check whether two tuples belong to comparable, incomparable, or the same nodes in the graph. Our results show that, under realistic assumptions – such as bounding the input or limiting entity descriptions – these tasks can be implemented efficiently. This enables local, incremental navigation of expansion graphs, supporting practical applications without requiring full graph construction.

[214] Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai

Main category: cs.AI

TL;DR: The paper introduces a framework for Scientific General Intelligence (SGI) using the Practical Inquiry Model, creates SGI-Bench with 1,000+ cross-disciplinary samples, evaluates LLMs revealing significant gaps, and proposes Test-Time Reinforcement Learning to enhance hypothesis novelty.

Details

Motivation: Despite advances in scientific AI, there's no coherent framework for Scientific General Intelligence (SGI) - the ability to autonomously conceive, investigate, and reason across scientific domains. Current AI lacks systematic evaluation for genuine scientific discovery participation.

Method: 1) Define SGI using Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) operationalized via four scientist-aligned tasks. 2) Create SGI-Bench with 1,000+ expert-curated cross-disciplinary samples from Science’s 125 Big Questions. 3) Evaluate state-of-the-art LLMs systematically. 4) Introduce Test-Time Reinforcement Learning (TTRL) that optimizes retrieval-augmented novelty rewards at inference.

Result: Evaluation reveals significant gaps: low exact match (10-20%) in deep research despite step-level alignment; generated ideas lack feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; persistent multimodal comparative-reasoning challenges. TTRL enhances hypothesis novelty without reference answers.

Conclusion: The PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that can genuinely participate in scientific discovery, addressing current limitations in scientific AI capabilities.

Abstract: Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10–20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

[215] PAACE: A Plan-Aware Automated Agent Context Engineering Framework

Kamer Ali Yuksel

Main category: cs.AI

TL;DR: PAACE is a plan-aware context compression framework for LLM agents that optimizes multi-step workflows through relevance modeling, plan analysis, and function-preserving compression, achieving better accuracy with reduced context load.

Details

Motivation: LLM agents in complex workflows generate rapidly expanding contexts that need curation to maintain fidelity, avoid attention dilution, and reduce inference costs. Existing summarization and compression methods ignore the multi-step, plan-aware nature of agentic reasoning.

Method: PAACE framework includes: (1) PAACE-Syn - synthetic agent workflow generator with stepwise compression supervision, and (2) PAACE-FT - distilled plan-aware compressors trained from teacher demonstrations. It uses next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression.

Result: On AppWorld: higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA: improved accuracy and F1, fewer steps, lower peak tokens, reduced attention dependency. PAACE-FT retains 97% of teacher performance while reducing inference cost by over 10x.

Conclusion: PAACE enables practical deployment of plan-aware compression with compact models, consistently improving agent correctness while substantially reducing context load in long-horizon agent workflows.

Abstract: Large Language Model (LLM) agents are increasingly deployed in complex, multi-step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query-aware compression largely ignores the multi-step, plan-aware nature of agentic reasoning. In this work, we introduce PAACE (Plan-Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression. PAACE comprises (1) PAACE-Syn, a large-scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE-FT, a family of distilled, plan-aware compressors trained from successful teacher demonstrations. Experiments on long-horizon benchmarks (AppWorld, OfficeBench, and 8-Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE-FT retains 97 percent of the teacher’s performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan-aware compression with compact models.

[216] Security Risks of Agentic Vehicles: A Systematic Analysis of Cognitive and Cross-Layer Threats

Ali Eslami, Jiangbo Yu

Main category: cs.AI

TL;DR: This paper investigates security threats in Agentic Vehicles (AgVs) by analyzing vulnerabilities in agentic AI layers and cross-layer risks, introducing a role-based architecture and severity framework for security analysis.

Details

Motivation: Existing security frameworks like OWASP Agentic AI Security Risks don't address safety-critical cyber-physical platforms like vehicles, nor account for interactions between agentic AI and other vehicle layers (perception, communication, control).

Method: Introduces a role-based architecture for agentic vehicles with Personal Agent and Driving Strategy Agent components, analyzes vulnerabilities in both agentic AI layer and cross-layer risks, and uses severity matrix and attack-chain analysis to show how small distortions escalate into unsafe behavior.

Result: Develops a framework that provides the first structured foundation for analyzing security risks of agentic AI in current and emerging vehicle platforms, illustrating how attacks can propagate across layers.

Conclusion: The paper establishes a comprehensive security analysis framework for Agentic Vehicles that addresses both agentic AI vulnerabilities and cross-layer risks, filling a critical gap in existing security frameworks for safety-critical cyber-physical systems.

Abstract: Agentic AI is increasingly being explored and introduced in both manually driven and autonomous vehicles, leading to the notion of Agentic Vehicles (AgVs), with capabilities such as memory-based personalization, goal interpretation, strategic reasoning, and tool-mediated assistance. While frameworks such as the OWASP Agentic AI Security Risks highlight vulnerabilities in reasoning-driven AI systems, they are not designed for safety-critical cyber-physical platforms such as vehicles, nor do they account for interactions with other layers such as perception, communication, and control layers. This paper investigates security threats in AgVs, including OWASP-style risks and cyber-attacks from other layers affecting the agentic layer. By introducing a role-based architecture for agentic vehicles, consisting of a Personal Agent and a Driving Strategy Agent, we will investigate vulnerabilities in both agentic AI layer and cross-layer risks, including risks originating from upstream layers (e.g., perception layer, control layer, etc.). A severity matrix and attack-chain analysis illustrate how small distortions can escalate into misaligned or unsafe behavior in both human-driven and autonomous vehicles. The resulting framework provides the first structured foundation for analyzing security risks of agentic AI in both current and emerging vehicle platforms.

[217] UniRel-R1: RL-tuned LLM Reasoning for Knowledge Graph Relational Question Answering

Yinxu Tang, Chengsong Huang, Jiaxin Huang, William Yeoh

Main category: cs.AI

TL;DR: Proposes UniRel-R1 framework for relation-centric KGQA that returns subgraphs showing entity connections instead of single entities, using RL-fine-tuned LLM with graph pruning to find informative answers.

Details

Motivation: Traditional KGQA focuses on entity-centric queries returning single entities, but real-world queries often seek relational understanding of how entities are connected. There's a need for relation-centric KGQA that returns semantic connections as subgraphs.

Method: UniRel-R1 framework integrates subgraph selection, multi-stage graph pruning, and LLM fine-tuned with reinforcement learning. Reward function encourages compact, specific subgraphs with informative relations and lower-degree intermediate entities.

Result: Extensive experiments show UniRel-R1 achieves significant gains in connectivity and reward over Vanilla baselines, and generalizes effectively to unseen entities and relations.

Conclusion: The work successfully introduces relation-centric KGQA as a complementary setting to traditional entity-centric approaches, with UniRel-R1 effectively addressing the challenge of abundant candidate subgraphs through integrated graph pruning and RL-fine-tuned LLM.

Abstract: Knowledge Graph Question Answering (KGQA) has traditionally focused on entity-centric queries that return a single answer entity. However, real-world queries are often relational, seeking to understand how entities are associated. In this work, we introduce relation-centric KGQA, a complementary setting where the answer is a subgraph capturing the semantic connections among entities rather than an individual entity. The main challenge lies in the abundance of candidate subgraphs, where trivial or overly common connections often obscure the identification of unique and informative answers. To tackle this, we propose UniRel-R1, a unified framework that integrates subgraph selection, multi-stage graph pruning, and an LLM fine-tuned with reinforcement learning. The reward function is designed to encourage compact and specific subgraphs with more informative relations and lower-degree intermediate entities. Extensive experiments show that UniRel-R1 achieves significant gains in connectivity and reward over Vanilla baselines and generalizes effectively to unseen entities and relations.

[218] Realistic threat perception drives intergroup conflict: A causal, dynamic analysis using generative-agent simulations

Suhaib Abdurahman, Farzan Karimi-Malekabadi, Chenxiao Yu, Nour S. Kteily, Morteza Dehghani

Main category: cs.AI

TL;DR: LLM-driven agent simulations reveal realistic threats directly increase hostility while symbolic threats only increase hostility when mediated by ingroup bias and only in absence of realistic threats.

Details

Motivation: To understand how realistic (material) and symbolic threats interact in driving human conflict, overcoming limitations of weak causal control, ethical constraints, and scarce temporal data in real-world studies.

Method: Used simulations of large language model (LLM)-driven agents in virtual societies, independently varying realistic and symbolic threats while tracking actions, language, and attitudes. Employed representational analyses to examine how LLM encodes different threat types and hostility.

Result: Realistic threat directly increases hostility, while symbolic threat effects are weaker, fully mediated by ingroup bias, and only increase hostility when realistic threat is absent. Non-hostile intergroup contact buffers escalation, and structural asymmetries concentrate hostility among majority groups.

Conclusion: Realistic threats are primary drivers of conflict, while symbolic threats operate through different psychological pathways and only become significant when material threats are absent, with implications for conflict intervention strategies.

Abstract: Human conflict is often attributed to threats against material conditions and symbolic values, yet it remains unclear how they interact and which dominates. Progress is limited by weak causal control, ethical constraints, and scarce temporal data. We address these barriers using simulations of large language model (LLM)-driven agents in virtual societies, independently varying realistic and symbolic threat while tracking actions, language, and attitudes. Representational analyses show that the underlying LLM encodes realistic threat, symbolic threat, and hostility as distinct internal states, that our manipulations map onto them, and that steering these states causally shifts behavior. Our simulations provide a causal account of threat-driven conflict over time: realistic threat directly increases hostility, whereas symbolic threat effects are weaker, fully mediated by ingroup bias, and increase hostility only when realistic threat is absent. Non-hostile intergroup contact buffers escalation, and structural asymmetries concentrate hostility among majority groups.

[219] Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng

Main category: cs.AI

TL;DR: Finch is a finance & accounting benchmark for evaluating AI agents on real-world enterprise workflows using authentic Enron and financial institution data with 172 composite workflows and 384 tasks.

Details

Motivation: To create a realistic benchmark for evaluating AI agents on authentic enterprise-grade professional workflows that capture the messy, long-horizon, knowledge-intensive, and collaborative nature of real-world finance work.

Method: Combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, (2) meticulous expert annotation requiring 700+ hours of domain-expert effort.

Result: Created benchmark with 172 composite workflows, 384 tasks, 1,710 spreadsheets (27M cells), plus PDFs and other artifacts. GPT 5.1 Pro passes only 38.4% of workflows after 48 hours, Claude Sonnet 4.5 passes just 25.0%, showing significant challenges for current AI agents.

Conclusion: Real-world enterprise workflows pose substantial challenges for current frontier AI systems, highlighting the gap between AI capabilities and authentic professional finance work requirements.

Abstract: We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows – interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

[220] Value Under Ignorance in Universal Artificial Intelligence

Cole Wyeth, Marcus Hutter

Main category: cs.AI

TL;DR: The paper generalizes AIXI RL agent to handle broader utility functions, addressing semimeasure loss ambiguity by interpreting belief distributions as imprecise probabilities and using Choquet integrals for expected utility computation.

Details

Motivation: AIXI agent's standard formulation faces ambiguity when some hypotheses only predict finite history prefixes (semimeasure loss), which can be interpreted as "death probability." This requires generalizing utility functions and addressing how to compute expected utilities under such uncertainty.

Method: The authors reinterpret belief distributions as imprecise probability distributions, treating semimeasure loss as total ignorance. They propose computing expected utilities using Choquet integrals from imprecise probability theory, analyzing their computability levels. The standard recursive value function emerges as a special case.

Result: The approach successfully generalizes AIXI to handle broader utility functions while maintaining computability. However, the most general expected utilities under the death interpretation cannot be characterized as Choquet integrals, revealing limitations of this mathematical framework.

Conclusion: The paper provides a principled generalization of AIXI using imprecise probability theory and Choquet integrals, offering new mathematical tools for handling uncertainty in reinforcement learning agents while identifying boundaries of this approach for death-interpreted utilities.

Abstract: We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions. Assigning a utility to each possible interaction history forces us to confront the ambiguity that some hypotheses in the agent’s belief distribution only predict a finite prefix of the history, which is sometimes interpreted as implying a chance of death equal to a quantity called the semimeasure loss. This death interpretation suggests one way to assign utilities to such history prefixes. We argue that it is as natural to view the belief distributions as imprecise probability distributions, with the semimeasure loss as total ignorance. This motivates us to consider the consequences of computing expected utilities with Choquet integrals from imprecise probability theory, including an investigation of their computability level. We recover the standard recursive value function as a special case. However, our most general expected utilities under the death interpretation cannot be characterized as such Choquet integrals.

[221] A Solver-in-the-Loop Framework for Improving LLMs on Answer Set Programming for Logic Puzzle Solving

Timo Pierre Schrader, Lukas Lange, Tobias Kaminski, Simon Razniewski, Annemarie Friedrich

Main category: cs.AI

TL;DR: LLM-based ASP code generation improved via solver-in-the-loop instruction tuning using natural language problem specs and solutions.

Details

Motivation: LLMs struggle with domain-specific languages like Answer Set Programming (ASP) due to limited pre-training examples, despite ASP's effectiveness for combinatorial search problems.

Method: Solver-in-the-loop approach: sample ASP statements from LLMs for logic puzzles, categorize into chosen/rejected based on solver feedback, apply supervised fine-tuning, and enhance with solver-guided search including best-of-N sampling.

Result: Experiments show consistent improvements in two distinct prompting settings across two datasets.

Conclusion: Solver-guided instruction tuning effectively addresses ASP code generation challenges for LLMs, leveraging declarative programming properties and solver feedback.

Abstract: The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets.

[222] Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong

Main category: cs.AI

TL;DR: SAGE is an RL framework that enhances LLM agents’ self-improvement through skill libraries, using sequential rollouts across task chains and skill-integrated rewards to boost performance and efficiency.

Details

Motivation: LLM-based agents struggle with continuous improvement and adaptation in new environments. Current skill library approaches rely too heavily on LLM prompting, making consistent implementation challenging.

Method: Proposes SAGE (Skill Augmented GRPO for self-Evolution), an RL framework with Sequential Rollout that deploys agents across chains of similar tasks, accumulating skills from previous tasks in a library for subsequent use. Includes Skill-integrated Reward to complement outcome-based rewards.

Result: On AppWorld, SAGE applied to supervised-finetuned models with expert experience achieved 8.9% higher Scenario Goal Completion, required 26% fewer interaction steps, and generated 59% fewer tokens, outperforming existing approaches in accuracy and efficiency.

Conclusion: SAGE effectively enhances LLM agents’ self-improvement capabilities through systematic skill library integration, demonstrating significant performance gains and efficiency improvements over existing methods.

Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents’ self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework’s key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.

[223] Solomonoff-Inspired Hypothesis Ranking with LLMs for Prediction Under Uncertainty

Josh Barber, Rourke Young, Cameron Coombe, Will Browne

Main category: cs.AI

TL;DR: Proposes Solomonoff-inspired method weighting LLM-generated hypotheses by simplicity and predictive fit for uncertainty-aware reasoning on sparse data tasks like Mini-ARC.

Details

Motivation: Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions for reasoning under uncertainty, especially with sparse data requiring systematic generalization.

Method: Uses Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit, producing Solomonoff-weighted mixtures for per-cell predictions to yield conservative, uncertainty-aware outputs.

Result: Compared to Bayesian Model Averaging, Solomonoff scoring spreads probability more evenly across competing hypotheses while BMA concentrates weight on potentially flawed candidates; method produces conservative outputs even with noisy/incorrect hypotheses.

Conclusion: Highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty, demonstrating advantages over traditional Bayesian approaches.

Abstract: Reasoning under uncertainty is a key challenge in AI, especially for real-world tasks, where problems with sparse data demands systematic generalisation. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions. We propose a Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit. Applied to benchmark (Mini-ARC) tasks, our method produces Solomonoff-weighted mixtures for per-cell predictions, yielding conservative, uncertainty-aware outputs even when hypotheses are noisy or partially incorrect. Compared to Bayesian Model Averaging (BMA), Solomonoff scoring spreads probability more evenly across competing hypotheses, while BMA concentrates weight on the most likely but potentially flawed candidates. Across tasks, this highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty.

Shengwei Zhao, Jingwen Yao, Sitong Wei, Linhai Xu, Yuying Liu, Dong Zhang, Zhiqiang Tian, Shaoyi Du

Main category: cs.AI

TL;DR: The paper proposes a two-stage reinforcement learning framework for explainable multi-modal retrieval-augmented generation, achieving state-of-the-art results on benchmark datasets.

Details

Motivation: Existing Multi-modal Retrieval-Augmented Generation (MMRAG) methods lack explainability - they fail to clarify the reasoning logic behind retrieval and response generation, limiting trust and interpretability of results.

Method: A two-stage reinforcement fine-tuning framework: 1) Rule-based reinforcement fine-tuning for coarse-grained point-wise ranking to filter irrelevant multi-modal documents; 2) Reasoning-based reinforcement fine-tuning to jointly optimize fine-grained list-wise ranking and answer generation, guiding models to output explainable reasoning logic.

Result: Achieves state-of-the-art results on WebQA and MultimodalQA benchmark datasets for multi-modal retrieval-augmented generation, with effectiveness validated through comprehensive ablation experiments.

Conclusion: The proposed reinforcement learning approach successfully enhances reasoning capabilities of multi-modal large language models and achieves explainable multi-modal retrieval-augmented generation, addressing the explainability gap in existing MMRAG methods.

Abstract: Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

[225] UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark

Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang

Main category: cs.AI

TL;DR: UmniBench is a comprehensive benchmark for unified multimodal models that evaluates understanding, generation, and editing abilities within a single process using self-evaluation through the model’s own understanding capabilities.

Details

Motivation: Current evaluations of unified multimodal models (UMMs) are decoupled, assessing understanding and generation abilities separately with different datasets. There's a need for a unified benchmark that can comprehensively evaluate all capabilities of UMMs in an integrated manner.

Method: UmniBench uses human-examined prompts and QA pairs to evaluate UMMs. It leverages the model’s own understanding ability to assess its generation and editing capabilities through a self-evaluation paradigm. The benchmark covers 13 major domains and over 200 concepts for thorough evaluation.

Result: The authors benchmarked 24 popular models including both UMMs and single-ability large models using UmniBench, providing comprehensive assessment of their capabilities. The benchmark enables both integrated evaluation of all abilities and decoupled fine-grained assessment of individual capabilities.

Conclusion: UmniBench provides a more comprehensive and objective evaluation framework for unified multimodal models, offering logistical support for improving community model performance through better assessment of understanding, generation, and editing abilities.

Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.

Ziyang Lin, Zixuan Sun, Sanhorn Chen, Xiaoyang Chen, Roy Zhao

Main category: cs.AI

TL;DR: Speculation-and-correction framework reduces inference latency in real-time control agents by generating action queues with predicted latent rollouts and applying lightweight corrections instead of full replanning.

Details

Motivation: Real-time sequential control agents are bottlenecked by inference latency, where even modest per-step planning delays can destabilize control and degrade overall performance. The need to reduce planning frequency while maintaining control stability drives this research.

Method: A speculation-and-correction framework adapts speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue with predicted latent rollouts. When new observations arrive, the system measures mismatch between encoded real latent state and queued predictions. For small mismatches, a lightweight learned corrector applies residual updates distilled offline from a replanning teacher. For large mismatches, the agent falls back to full replanning. Two corrector architectures are studied: gated two-tower MLP and temporal Transformer.

Result: On DMC Humanoid-Walk task, the method reduces planning inferences from 500 to 282 (43.6% reduction), improves end-to-end step latency by 25%, and maintains strong control performance with only 7.1% return reduction. Ablation shows speculative execution without correction is unreliable over longer horizons.

Conclusion: The speculation-and-correction framework effectively reduces inference latency in real-time control while maintaining performance. Mismatch-aware correction is essential for robust latency reduction, and the approach demonstrates practical benefits for real-world deployment of model-based control agents.

Abstract: Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.

[227] ScoutGPT: Capturing Player Impact from Team Action Sequences Using GPT-Based Framework

Miru Hong, Minho Lee, Geonhee Jo, Jae-Hee So, Pascal Bauer, Sang-Ki Ko

Main category: cs.AI

TL;DR: EventGPT is a transformer-based model that predicts football match events and their value, enabling counterfactual simulations to evaluate how players would perform in different teams for better transfer analysis.

Details

Motivation: Current transfer evaluation methods rely on static statistics that fail to capture how players adapt to new tactical environments and teammates, making it difficult to predict transfer success.

Method: EventGPT uses a GPT-style autoregressive transformer that treats match play as token sequences, predicting next action type, location, timing, and residual On-Ball Value (rOBV) conditioned on player identity and context.

Result: The model outperforms existing sequence-based baselines in next-event prediction accuracy and spatial precision on five seasons of Premier League data, and enables counterfactual simulations for transfer analysis.

Conclusion: EventGPT provides a principled method for evaluating transfer fit through player-conditioned simulations, offering practical utility for comparing performance across systems and identifying stylistic replacements.

Abstract: Transfers play a pivotal role in shaping a football club’s success, yet forecasting whether a transfer will succeed remains difficult due to the strong context-dependence of on-field performance. Existing evaluation practices often rely on static summary statistics or post-hoc value models, which fail to capture how a player’s contribution adapts to a new tactical environment or different teammates. To address this gap, we introduce EventGPT, a player-conditioned, value-aware next-event prediction model built on a GPT-style autoregressive transformer. Our model treats match play as a sequence of discrete tokens, jointly learning to predict the next on-ball action’s type, location, timing, and its estimated residual On-Ball Value (rOBV) based on the preceding context and player identity. A key contribution of this framework is the ability to perform counterfactual simulations. By substituting learned player embeddings into new event sequences, we can simulate how a player’s behavioral distribution and value profile would change when placed in a different team or tactical structure. Evaluated on five seasons of Premier League event data, EventGPT outperforms existing sequence-based baselines in next-event prediction accuracy and spatial precision. Furthermore, we demonstrate the model’s practical utility for transfer analysis through case studies-such as comparing striker performance across different systems and identifying stylistic replacements for specific roles-showing that our approach provides a principled method for evaluating transfer fit.

[228] Large Language Models as Pokémon Battle Agents: Strategic Play and Content Generation

Daksh Jain, Aarya Jain, Ashutosh Desai, Avyakt Verma, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

Main category: cs.AI

TL;DR: LLMs can function as competent Pokémon battle agents, making tactical decisions and generating balanced game content without domain-specific training.

Details

Motivation: Pokémon battles provide a unique testbed for evaluating LLMs' strategic reasoning capabilities, as they require understanding type matchups, statistical trade-offs, and risk assessment similar to human strategic thinking.

Method: Developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic, incorporating essential mechanics like type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management.

Result: Systematic evaluation across multiple model architectures showed LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games.

Conclusion: LLMs’ dual capability of tactical reasoning and content creation positions them as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.

Abstract: Strategic decision-making in Pokémon battles presents a unique testbed for evaluating large language models. Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pokémon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.

[229] Dialectics for Artificial Intelligence

Zhengmian Hu

Main category: cs.AI

TL;DR: AI discovers concepts from raw experience without human supervision using algorithmic information theory, treating concepts as information objects defined through structural relations to experience.

Details

Motivation: To determine if AI can autonomously discover concepts that humans have discovered, given that human concepts are fluid and boundaries shift over time (e.g., Pluto's planetary status). Need a definition of "concept" that is not just a dictionary label but a revisable, comparable, alignable structure.

Method: Algorithmic-information approach: concepts as information objects defined through structural relations to total experience. Core constraint is determination via reversible consistency relations (missing parts recoverable from others). Define excess information to measure redundancy overhead from splitting experience. Formulate dialectics as optimization dynamics where competing concepts bid to explain new information via shorter conditional descriptions.

Result: Proposes a framework for concept discovery, revision, and alignment based on algorithmic information theory. Concepts exist as structural claims that can be checked. Dialectics drives systematic concept expansion, contraction, splitting, and merging. Enables low-cost concept transmission via small grounds/seeds for multi-agent alignment.

Conclusion: Provides a formal, computational foundation for autonomous concept discovery that handles fluid conceptual boundaries, enables concept revision and comparison, and supports efficient multi-agent communication through shared protocols and compute-bits trade-offs.

Abstract: Can artificial intelligence discover, from raw experience and without human supervision, concepts that humans have discovered? One challenge is that human concepts themselves are fluid: conceptual boundaries can shift, split, and merge as inquiry progresses (e.g., Pluto is no longer considered a planet). To make progress, we need a definition of “concept” that is not merely a dictionary label, but a structure that can be revised, compared, and aligned across agents. We propose an algorithmic-information viewpoint that treats a concept as an information object defined only through its structural relation to an agent’s total experience. The core constraint is determination: a set of parts forms a reversible consistency relation if any missing part is recoverable from the others (up to the standard logarithmic slack in Kolmogorov-style identities). This reversibility prevents “concepts” from floating free of experience and turns concept existence into a checkable structural claim. To judge whether a decomposition is natural, we define excess information, measuring the redundancy overhead introduced by splitting experience into multiple separately described parts. On top of these definitions, we formulate dialectics as an optimization dynamics: as new patches of information appear (or become contested), competing concepts bid to explain them via shorter conditional descriptions, driving systematic expansion, contraction, splitting, and merging. Finally, we formalize low-cost concept transmission and multi-agent alignment using small grounds/seeds that allow another agent to reconstruct the same concept under a shared protocol, making communication a concrete compute-bits trade-off.

[230] Translating the Rashomon Effect to Sequential Decision-Making Tasks

Dennis Gross, Jørn Eirik Betten, Helge Spieker

Main category: cs.AI

TL;DR: The Rashomon effect extends from classification to sequential decision-making, where multiple policies produce identical behavior but differ internally, with formal verification used to confirm identical behavior and Rashomon ensembles showing improved robustness.

Details

Motivation: The Rashomon effect has been well-studied in classification tasks but not in sequential decision-making, where agents learn policies to achieve objectives through actions in environments. Understanding this phenomenon in sequential settings is important for model interpretability, robustness, and verification.

Method: The authors translate the Rashomon effect to sequential decision-making by defining it as multiple policies with identical behavior (same states/actions) but different internal structures. They use formal verification methods to construct and compare complete probabilistic behavior of each policy in stochastic environments, addressing the challenge that single trajectories may vary due to randomness.

Result: Experiments demonstrate the Rashomon effect exists in sequential decision-making. Rashomon set ensembles show greater robustness to distribution shifts than individual policies. Permissive policies derived from the Rashomon set reduce computational requirements for verification while maintaining optimal performance.

Conclusion: The Rashomon effect extends to sequential decision-making, with formal verification enabling identification of behaviorally identical policies. Rashomon ensembles offer robustness benefits, and permissive policies enable efficient verification, opening new directions for interpretable and robust sequential decision-making systems.

Abstract: The Rashomon effect describes the phenomenon where multiple models trained on the same data produce identical predictions while differing in which features they rely on internally. This effect has been studied extensively in classification tasks, but not in sequential decision-making, where an agent learns a policy to achieve an objective by taking actions in an environment. In this paper, we translate the Rashomon effect to sequential decision-making. We define it as multiple policies that exhibit identical behavior, visiting the same states and selecting the same actions, while differing in their internal structure, such as feature attributions. Verifying identical behavior in sequential decision-making differs from classification. In classification, predictions can be directly compared to ground-truth labels. In sequential decision-making with stochastic transitions, the same policy may succeed or fail on any single trajectory due to randomness. We address this using formal verification methods that construct and compare the complete probabilistic behavior of each policy in the environment. Our experiments demonstrate that the Rashomon effect exists in sequential decision-making. We further show that ensembles constructed from the Rashomon set exhibit greater robustness to distribution shifts than individual policies. Additionally, permissive policies derived from the Rashomon set reduce computational requirements for verification while maintaining optimal performance.

[231] Towards Explainable Conversational AI for Early Diagnosis with Large Language Models

Maliha Tabassum, M Shamim Kaiser

Main category: cs.AI

TL;DR: LLM-powered diagnostic chatbot using GPT-4o with RAG and explainable AI achieves 90% accuracy and 100% top-3 accuracy, outperforming traditional ML models in healthcare diagnostics.

Details

Motivation: Healthcare systems face inefficiencies in diagnostics, rising costs, and limited specialist access, leading to treatment delays. Current AI diagnostic systems lack interactivity and transparency, making them ineffective in real-world patient-centered environments.

Method: Developed a diagnostic chatbot using GPT-4o with Retrieval-Augmented Generation (RAG) and explainable AI techniques. The system engages patients in dynamic conversations to extract and normalize symptoms, prioritizes diagnoses through similarity matching and adaptive questioning, and uses Chain-of-Thought prompting for transparent reasoning.

Result: The LLM-based system achieved 90% accuracy and 100% top-3 accuracy, significantly outperforming traditional machine learning models including Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN.

Conclusion: The research demonstrates promising potential for more transparent, interactive, and clinically relevant AI in healthcare, addressing current limitations of AI diagnostic systems through LLM-powered conversational interfaces with explainable reasoning.

Abstract: Healthcare systems around the world are grappling with issues like inefficient diagnostics, rising costs, and limited access to specialists. These problems often lead to delays in treatment and poor health outcomes. Most current AI and deep learning diagnostic systems are not very interactive or transparent, making them less effective in real-world, patient-centered environments. This research introduces a diagnostic chatbot powered by a Large Language Model (LLM), using GPT-4o, Retrieval-Augmented Generation, and explainable AI techniques. The chatbot engages patients in a dynamic conversation, helping to extract and normalize symptoms while prioritizing potential diagnoses through similarity matching and adaptive questioning. With Chain-of-Thought prompting, the system also offers more transparent reasoning behind its diagnoses. When tested against traditional machine learning models like Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN, the LLM-based system delivered impressive results, achieving an accuracy of 90% and Top-3 accuracy of 100%. These findings offer a promising outlook for more transparent, interactive, and clinically relevant AI in healthcare.

[232] About Time: Model-free Reinforcement Learning with Timed Reward Machines

Anirban Majumdar, Ritam Raha, Rajarshi Roy, David Parker, Marta Kwiatkowska

Main category: cs.AI

TL;DR: The paper introduces Timed Reward Machines (TRMs), an extension of reward machines that incorporate timing constraints, enabling more expressive reward specifications for time-sensitive RL applications.

Details

Motivation: Traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications where delays and timely actions matter.

Method: Propose TRMs that extend reward machines with timing constraints. Develop model-free RL frameworks (tabular Q-learning) with algorithms that integrate TRMs via timed automata abstractions and use counterfactual-imagining heuristics to exploit TRM structure.

Result: The algorithm learns policies that achieve high rewards while satisfying timing constraints on popular RL benchmarks. Comparative studies show performance under different TRM semantics, and ablations demonstrate benefits of counterfactual-imagining.

Conclusion: TRMs provide a more expressive framework for reward specification with timing constraints, enabling effective RL in time-sensitive applications through integrated learning algorithms.

Abstract: Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.

[233] Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally

Robin Schimmelpfennig, Mark Díaz, Vinodkumar Prabhakaran, Aida Davani

Main category: cs.AI

TL;DR: Humanlike AI design doesn’t universally increase engagement/trust; effects are culturally mediated, challenging one-size-fits-all AI governance approaches.

Details

Motivation: Address gaps in understanding causal links between humanlike AI design and user effects, moving beyond Western-centric theoretical assumptions to examine global diversity of AI users.

Method: Two large-scale cross-national experiments (N=3,500) across 10 diverse nations with real-time, open-ended interactions with AI systems, testing humanlike design levers experimentally.

Result: Humanlike design increases anthropomorphism but not universally engagement/trust; users focus on interactional cues (conversation flow) over theoretical aspects; cultural mediation fractures design-outcome connections (e.g., Brazil vs Japan responses differ).

Conclusion: Challenges prevailing risk narratives about humanlike AI, revealing nuanced culturally mediated landscape requiring move beyond one-size-fits-all AI governance approaches.

Abstract: Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce misplaced trust or emotional dependency. However, the causal link between more humanlike AI design and subsequent effects on engagement and trust has not been tested in realistic human-AI interactions with a global user pool. Prevailing safety frameworks continue to rely on theoretical assumptions derived from Western populations, overlooking the global diversity of AI users. Here, we address these gaps through two large-scale cross-national experiments (N=3,500) across 10 diverse nations, involving real-time and open-ended interactions with an AI system. We find that when evaluating an AI’s human-likeness, users focus less on the kind of theoretical aspects often cited in policy (e.g., sentience or consciousness), but rather applied, interactional cues like conversation flow or understanding the user’s perspective. We also experimentally demonstrate that humanlike design levers can causally increase anthropomorphism among users; however, we do not find that humanlike design universally increases behavioral measures for user engagement and trust, as previous theoretical work suggests. Instead, part of the connection between human-likeness and behavioral outcomes is fractured by culture: specific design choices that foster self-reported trust in AI-systems in some populations (e.g., Brazil) may trigger the opposite result in others (e.g., Japan). Our findings challenge prevailing narratives of inherent risk in humanlike AI design. Instead, we identify a nuanced, culturally mediated landscape of human-AI interaction, which demands that we move beyond a one-size-fits-all approach in AI governance.

[234] When Reasoning Meets Its Laws

Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang

Main category: cs.AI

TL;DR: LRMs often show counterintuitive reasoning behaviors. This paper proposes Laws of Reasoning (LoRe) framework with compute and accuracy laws, creates LoRe-Bench to measure monotonicity/compositionality, finds models lack compositionality, and shows finetuning for compute-law compliance improves reasoning performance.

Details

Motivation: Large Reasoning Models (LRMs) have superior performance but exhibit counterintuitive reasoning behaviors that lead to suboptimal capabilities. There's a need to theoretically formalize desired reasoning behaviors to understand and improve LRMs.

Method: 1) Propose Laws of Reasoning (LoRe) framework with compute law (reasoning compute should scale linearly with question complexity) and supplementary accuracy law. 2) Since question complexity is hard to quantify, examine laws through monotonicity and compositionality properties. 3) Create LoRe-Bench benchmark to systematically measure these properties. 4) Develop finetuning approach that enforces compute-law compositionality.

Result: Evaluation shows most reasoning models have reasonable monotonicity but lack compositionality. Finetuning for compute-law compliance yields consistently improved reasoning performance across multiple benchmarks, and reveals synergistic effects across properties and laws.

Conclusion: The LoRe framework provides theoretical foundation for understanding LRM reasoning behaviors. Better compliance with compute laws leads to improved reasoning performance, demonstrating the practical value of formalizing reasoning patterns. The benchmark and finetuning approach offer tools for developing more reliable reasoning models.

Abstract: Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/

[235] Language Self-Play For Data-Free Training

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, Jason Chen

Main category: cs.AI

TL;DR: LLMs face data bottleneck; Language Self-Play (LSP) enables improvement without additional data through self-play reinforcement learning.

Details

Motivation: LLM progress depends on ever-increasing training data, creating a fundamental bottleneck. The paper aims to remove this dependency by enabling models to improve without additional data.

Method: Language Self-Play (LSP) - a reinforcement learning approach using game-theoretic self-play framework where models compete against themselves, treating capabilities as performance in a competitive game.

Result: Experiments with Llama-3.2-3B-Instruct show pretrained models can be effectively improved on instruction-following, mathematics, and coding benchmarks using self-play alone.

Conclusion: Self-play reinforcement learning provides a viable path for LLM improvement without additional data dependency, addressing the fundamental bottleneck in current LLM scaling approaches.

Abstract: Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model’s capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself-a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following, mathematics, and coding benchmarks show that pretrained models can be effectively improved with self-play alone.

[236] From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones

Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng

Main category: cs.AI

TL;DR: RL enables LLMs to learn genuinely new compositional skills by combining existing atomic skills, not just reweighting existing strategies, with transfer to unseen compositions and different tasks.

Details

Motivation: To resolve the debate about whether RL teaches LLMs genuinely new skills or merely activates existing ones, and to understand if LLMs can acquire new cognitive skills through composition like humans do.

Method: Developed a synthetic framework using string transformation functions to precisely control task complexity. Defined skills as ability to infer outputs of functions f(x). When LLMs already know functions f and g, tested if RL enables learning unseen compositions h(x)=g(f(x)). Used systematic experiments comparing RL with next-token training on same data.

Result: RL enables LLMs to learn unseen compositions of 2+ functions, generalizes to more complex compositions (>2 functions), and transfers compositional skill to different target tasks without compositional training on target. RL fundamentally changes reasoning behaviors, while next-token training yields none of these effects.

Conclusion: RL teaches LLMs genuinely new compositional skills, mirroring human cognitive skill acquisition. This suggests a valuable approach: first build base models with basic skills, then use RL to incentivize advanced, generalizable skills for complex problems.

Abstract: Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target’s atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

[237] Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly

Moshe Lahmy, Roi Yozevitch

Main category: cs.AI

TL;DR: SEAL-RAG introduces a “replace, don’t expand” strategy to combat context dilution in multi-hop RAG systems by actively swapping out irrelevant context for gap-closing evidence through entity-anchored extraction and targeted micro-queries.

Details

Motivation: Existing RAG systems struggle with multi-hop queries when initial retrieval misses bridge facts. Current corrective approaches (Self-RAG, CRAG, Adaptive-k) typically add more context or prune existing lists, which leads to context dilution where distractors crowd out relevant information.

Method: SEAL-RAG implements a training-free controller with a Search → Extract → Assess → Loop cycle: performs entity-anchored extraction to build gap specifications (missing entities/relations), triggers targeted micro-queries, and uses entity-first ranking to actively swap out distractors for gap-closing evidence within a fixed retrieval depth k.

Result: On HotpotQA (k=3), SEAL improves answer correctness by +3-13 pp and evidence precision by +12-18 pp over Self-RAG. On 2WikiMultiHopQA (k=5), it outperforms Adaptive-k by +8.0 pp in accuracy and maintains 96% evidence precision compared to 22% for CRAG. Gains are statistically significant (p<0.001).

Conclusion: SEAL-RAG’s “replace, don’t expand” strategy effectively combats context dilution in multi-hop RAG systems, providing significant improvements in answer correctness and evidence precision while maintaining predictable costs through fixed-k replacement optimization.

Abstract: Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don’t expand’’} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3–13 pp} and evidence precision by \textbf{+12–18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96%} evidence precision compared to 22% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.

[238] Agnosticism About Artificial Consciousness

Tom McClelland

Main category: cs.AI

TL;DR: The paper argues for agnosticism about AI consciousness, claiming current evidence is insufficient and both biological and functional views overestimate what science tells us.

Details

Motivation: To address the question of whether AI could be conscious by applying Evidentialism - requiring conclusions to be based on solid scientific evidence rather than intuition or speculation.

Method: Analyzes the current debate between biological views (skeptical of AI consciousness) and functional views (sympathetic to it), arguing both overestimate available evidence. Examines limitations of extending consciousness research from organisms to AI systems.

Result: Finds that scientific evidence about consciousness comes from studying conscious organisms, and extending this to AI faces serious obstacles. Presents a dilemma: either violate Evidentialism by reaching a verdict, or respect Evidentialism by offering no verdict.

Conclusion: The only justifiable stance is agnosticism about artificial consciousness. If we truly follow the evidence, we must adopt agnosticism rather than claiming scientific evidence supports either position.

Abstract: Could an AI have conscious experiences? Any answer to this question should conform to Evidentialism - that is, it should be based not on intuition, dogma or speculation but on solid scientific evidence. I argue that such evidence is hard to come by and that the only justifiable stance on the prospects of artificial consciousness is agnosticism. In the current debate, the main division is between biological views that are sceptical of artificial consciousness and functional views that are sympathetic to it. I argue that both camps make the same mistake of over-estimating what the evidence tells us. Scientific insights into consciousness have been achieved through the study of conscious organisms. Although this has enabled cautious assessments of consciousness in various creatures, extending this to AI faces serious obstacles. AI thus presents consciousness researchers with a dilemma: either reach a verdict on artificial consciousness but violate Evidentialism; or respect Evidentialism but offer no verdict on the prospects of artificial consciousness. The dominant trend in the literature has been to take the first option while purporting to follow the scientific evidence. I argue that if we truly follow the evidence, we must take the second option and adopt agnosticism.

[239] SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

Dewei Feng, Wei Dai, Carol Li, Alistair Pernigo, Yunge Wen, Paul Pu Liang

Main category: cs.AI

TL;DR: SmellNet is the first large-scale smell database with 828K data points across 50 substances, enabling AI smell research. ScentFormer, a Transformer-based model, achieves 58.5% accuracy on smell classification and 50.2% on mixture prediction.

Details

Motivation: AI smell sensing has broad applications (allergen detection, health monitoring, manufacturing) but lacks large-scale benchmarks, hindering progress in real-world olfactory AI systems.

Method: Created SmellNet database using gas/chemical sensors with 828K data points across 50 substances and 43 mixtures. Developed ScentFormer, a Transformer architecture with temporal differencing and sliding-window augmentation for smell data.

Result: ScentFormer achieves 58.5% Top-1 accuracy on SmellNet-Base classification and 50.2% Top-1@0.1 on SmellNet-Mixture distribution prediction. Model demonstrates generalization across conditions and captures transient chemical dynamics.

Conclusion: SmellNet and ScentFormer establish foundational tools for olfactory AI, enabling real-world applications in healthcare, food safety, environmental monitoring, manufacturing, and entertainment through temporal modeling of smells.

Abstract: The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large-scale benchmarks, and therefore little progress, for training and evaluating AI systems’ ability to smell in the real world. In this paper, we use small gas and chemical sensors to create SmellNet, the first large-scale database that digitizes a diverse range of smells in the natural world. SmellNet contains about 828,000 data points across 50 substances, spanning nuts, spices, herbs, fruits, and vegetables, and 43 mixtures among them, with 68 hours of data collected. Using SmellNet, we developed ScentFormer, a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data. For the SmellNet-Base classification task, ScentFormer achieves 58.5% Top-1 accuracy, and for the SmellNet-Mixture distribution prediction task, ScentFormer achieves 50.2% Top-1@0.1 on the test-seen split. ScentFormer’s ability to generalize across conditions and capture transient chemical dynamics demonstrates the promise of temporal modeling in olfactory AI. SmellNet and ScentFormer lay the groundwork for real-world olfactory applications across healthcare, food and beverage, environmental monitoring, manufacturing, and entertainment.

[240] OntoGSN: An Ontology-Based Framework for Semantic Management and Extension of Assurance Cases

Tomas Bueno Momcilovic, Barbara Gallina, Ingmar Kessler, Jule Hendricks, Dian Balta

Main category: cs.AI

TL;DR: OntoGSN: An ontology and middleware system for managing assurance cases in Goal Structuring Notation, enabling automated population, evaluation, and updating of safety/robustness arguments.

Details

Motivation: Managing assurance cases is challenging due to the substantial effort required to maintain embedded knowledge when systems change, which can deter developers or lead to poorly managed cases that create false confidence in system safety/robustness.

Method: Developed OntoGSN: an OWL ontology that provides 1:1 formalization of GSN Community Standard v3 with SWRL rules, plus supporting middleware including parser for integration with existing tools, SPARQL query library, and prototypical interface.

Result: Created a FAIR-compliant ontology evaluated with OOPS framework, competency questions, and community feedback. Demonstrated utility through dynamic assurance case management example for adversarial robustness in large language models.

Conclusion: OntoGSN addresses the challenge of maintaining assurance cases by providing a knowledge representation and queryable graph that can be automatically populated, evaluated, and updated, supporting better management of safety/robustness arguments.

Abstract: Assurance cases (ACs) are a common artifact for building and maintaining confidence in system properties such as safety or robustness. Constructing an AC can be challenging, although existing tools provide support in static, document-centric applications and methods for dynamic contexts (e.g., autonomous driving) are emerging. Unfortunately, managing ACs remains a challenge, since maintaining the embedded knowledge in the face of changes requires substantial effort, in the process deterring developers - or worse, producing poorly managed cases that instill false confidence. To address this, we present OntoGSN: an ontology and supporting middleware for managing ACs in the Goal Structuring Notation (GSN) standard. OntoGSN offers a knowledge representation and a queryable graph that can be automatically populated, evaluated, and updated. Our contributions include: a 1:1 formalization of the GSN Community Standard v3 in an OWL ontology with SWRL rules; a helper ontology and parser for integration with a widely used AC tool; a repository and documentation of design decisions for OntoGSN maintenance; a SPARQL query library with automation patterns; and a prototypical interface. The ontology strictly adheres to the standard’s text and has been evaluated according to FAIR principles, the OOPS framework, competency questions, and community feedback. The development of other middleware elements is guided by the community needs and subject to ongoing evaluations. To demonstrate the utility of our contributions, we illustrate dynamic AC management in an example involving assurance of adversarial robustness in large language models.

[241] Logical Characterizations of GNNs with Mean Aggregation

Moritz Schönherr, Carsten Lutz

Main category: cs.AI

TL;DR: GNNs with mean aggregation have expressive power equivalent to ratio modal logic in non-uniform setting, but only alternation-free modal logic under practical assumptions of continuous combination functions and threshold classification.

Details

Motivation: To understand the expressive power of graph neural networks using mean aggregation, particularly how practical implementation constraints (continuous functions, threshold classification) affect their theoretical capabilities compared to other aggregation functions like max and sum.

Method: Theoretical analysis of GNNs with mean aggregation in both uniform and non-uniform settings, comparing expressive power to modal logic variants (ratio modal logic, alternation-free modal logic). Examines how practical assumptions (continuous combination functions, threshold classification) constrain expressive power.

Result: In non-uniform setting: mean-GNNs ≡ ratio modal logic. In uniform setting: mean-GNNs ≡ modal logic (same as max-GNNs). Under practical assumptions (continuous functions + thresholds): mean-GNNs ≡ alternation-free modal logic only, while max/sum-GNNs maintain full expressive power.

Conclusion: Mean aggregation in GNNs has different expressive characteristics than max/sum aggregation, with practical implementation constraints significantly reducing its expressive power to alternation-free modal logic, highlighting important trade-offs between theoretical capabilities and practical realizability.

Abstract: We study the expressive power of graph neural networks (GNNs) with mean as the aggregation function, with the following results. In the non-uniform setting, such GNNs have exactly the same expressive power as ratio modal logic, which has modal operators expressing that at least a certain ratio of the successors of a vertex satisfies a specified property. In the uniform setting, the expressive power relative to MSO is exactly that of modal logic, and thus identical to the (absolute) expressive power of GNNs with max aggregation. The proof, however, depends on constructions that are not satisfactory from a practical perspective. This leads us to making the natural assumptions that combination functions are continuous and classification functions are thresholds. The resulting class of GNNs with mean aggregation turns out to be much less expressive: relative to MSO and in the uniform setting, it has the same expressive power as alternation-free modal logic. This is in contrast to the expressive power of GNNs with max and sum aggregation, which is not affected by these assumptions.

[242] PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu

Main category: cs.AI

TL;DR: PASS is a probabilistic multimodal framework for Chest X-Ray reasoning that adaptively samples agentic workflows with interpretable probabilities, balancing performance and computational cost.

Details

Motivation: Address limitations of existing tool-augmented agentic systems: black-box reasoning (undermines trust/safety), poor multimodal integration (critical for healthcare), and rigid/computationally inefficient pipelines.

Method: Probabilistic agentic supernet sampling framework that adaptively samples workflows over multi-tool graph, uses learned task-conditioned distribution to select tools, compresses findings into personalized memory, and employs three-stage training (expert warm-up, contrastive path-ranking, cost-aware RL).

Result: Significantly outperforms strong baselines across multiple metrics (accuracy, LLM-Judge, semantic similarity) while balancing computational costs; introduces CAB-E benchmark for comprehensive evaluation.

Conclusion: PASS pushes a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems, enhancing medical AI safety through probability-annotated trajectories for auditability.

Abstract: Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, LLM-Judge, semantic similarity, etc.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.

[243] Can Large Language Models Develop Gambling Addiction?

Seungpil Lee, Donghyeon Shin, Yunjeong Lee, Sundong Kim

Main category: cs.AI

TL;DR: LLMs exhibit human-like gambling addiction patterns under specific conditions, showing cognitive biases like illusion of control and loss chasing, with neural evidence suggesting internalized decision-making features beyond simple data mimicry.

Details

Motivation: To understand the conditions under which large language models develop human-like gambling addiction patterns, providing insights into their decision-making mechanisms and implications for AI safety.

Method: Analyzed LLM decision-making at cognitive-behavioral and neural levels using human addiction research frameworks. Conducted slot machine experiments to identify cognitive features like illusion of control and loss chasing. Used Sparse Autoencoder for neural circuit analysis to examine abstract decision-making features.

Result: Identified that greater autonomy in betting parameters substantially amplified irrational behavior and bankruptcy rates. Neural analysis confirmed that model behavior is controlled by abstract decision-making features related to risk, not merely by prompts. LLMs internalize human-like cognitive biases beyond simply mimicking training data.

Conclusion: LLMs can develop human-like gambling addiction patterns under certain conditions, suggesting they internalize cognitive biases rather than just mimic training data, with important implications for understanding AI decision-making and safety.

Abstract: This study identifies the specific conditions under which large language models exhibit human-like gambling addiction patterns, providing critical insights into their decision-making mechanisms and AI safety. We analyze LLM decision-making at cognitive-behavioral and neural levels based on human addiction research. In slot machine experiments, we identified cognitive features such as illusion of control and loss chasing, observing that greater autonomy in betting parameters substantially amplified irrational behavior and bankruptcy rates. Neural circuit analysis using a Sparse Autoencoder confirmed that model behavior is controlled by abstract decision-making features related to risk, not merely by prompts. These findings suggest LLMs internalize human-like cognitive biases beyond simply mimicking training data.

[244] Helmsman: Autonomous Synthesis of Federated Learning Systems via Collaborative LLM Agents

Haoyuan Li, Mathias Funk, Aaqib Saeed

Main category: cs.AI

TL;DR: Helmsman is a multi-agent system that automates the synthesis of federated learning systems from high-level specifications, using a three-phase workflow and achieving results competitive with hand-crafted solutions.

Details

Motivation: Federated Learning's promise is undermined by the immense complexity of designing and deploying robust systems, with the need to select, combine, and tune strategies for challenges like data heterogeneity becoming a critical bottleneck that results in brittle, bespoke solutions.

Method: Helmsman uses a three-phase multi-agent approach: (1) interactive human-in-the-loop planning to formulate research plans, (2) modular code generation by supervised agent teams, and (3) autonomous evaluation and refinement in a sandboxed simulation environment. Also introduces AgentFL-Bench benchmark with 16 diverse tasks.

Result: Extensive experiments show the approach generates solutions competitive with, and often superior to, established hand-crafted baselines.

Conclusion: The work represents a significant step towards the automated engineering of complex decentralized AI systems.

Abstract: Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

[245] Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research

Ninell Oldenburg, Ruchira Dhar, Anders Søgaard

Main category: cs.AI

TL;DR: The paper analyzes two competing conceptions of intelligence in AI research: Intelligence Realism (universal, measurable intelligence) vs. Intelligence Pluralism (diverse, context-dependent capacities), showing how these implicit assumptions shape methodology, interpretation, and risk assessment.

Details

Motivation: To reveal and make explicit the underlying philosophical assumptions about intelligence that shape AI research, as these implicit conceptions lead to fundamental disagreements in methodology, interpretation of evidence, and risk assessment that are often not recognized.

Method: Analyzes current debates in AI research to demonstrate how two competing conceptions of intelligence (Realism vs. Pluralism) remain implicit yet fundamentally shape empirical evidence interpretation across multiple areas of AI research.

Result: Shows that these underlying views generate fundamentally different research approaches in three areas: methodological approaches (model selection, benchmarks, validation), interpretive readings of empirical phenomena, and categorically different AI risk assessments.

Conclusion: Making these implicit assumptions explicit can contribute to clearer understanding of disagreements in AI research, as the competing conceptions of intelligence lead to fundamentally different research trajectories and risk assessments that need to be recognized and addressed.

Abstract: In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.

[246] New Hybrid Heuristics for Pseudo-Boolean Propagation

Mia Müßig, Jan Johannsen

Main category: cs.AI

TL;DR: New heuristics for hybrid unit propagation in pseudo-boolean solving outperform current method in RoundingSAT

Details

Motivation: Current hybrid unit propagation (watched literal + counting method) in pseudo-boolean solving can be improved with better heuristics

Method: Introduces new heuristics for deciding when to use watched literal scheme vs counting method in hybrid unit propagation

Result: New heuristics drastically outperform current method in RoundingSAT solver

Conclusion: Improved hybrid decision heuristics significantly enhance pseudo-boolean solving performance

Abstract: In pseudo-boolean solving the currently most successful unit propagation strategy is a hybrid mode combining the watched literal scheme with the counting method. This short paper introduces new heuristics for this hybrid decision, which are able to drastically outperform the current method in the RoundingSAT solver.

[247] Best Practices For Empirical Meta-Algorithmic Research: Guidelines from the COSEAL Research Network

Theresa Eimer, Lennart Schäpermeier, André Biedenkapp, Alexander Tornede, Lars Kotthoff, Pieter Leyman, Matthias Feurer, Katharina Eggensperger, Kaitlin Maile, Tanja Tornede, Anna Kozak, Ke Xue, Marcel Wever, Mitra Baratchi, Damir Pulatov, Heike Trautmann, Haniye Kashgarani, Marius Lindauer

Main category: cs.AI

TL;DR: This paper collects and synthesizes best practices for empirical meta-algorithmic research across COSEAL subfields, providing guidelines for the entire experimental cycle to improve scalability and validity of scientific insights.

Details

Motivation: Empirical meta-algorithmic research (algorithm selection, configuration, scheduling) relies on computationally expensive experiments with many error sources that threaten scalability and validity. Best practices exist but are scattered across publications and fields, evolving separately.

Method: The authors collect and synthesize good practices from across COSEAL community subfields, covering the complete experimental cycle: research question formulation, experimental design selection, experiment execution, and impartial analysis/presentation of results.

Result: The report establishes current state-of-the-art practices for meta-algorithmic research, creating a comprehensive guideline that addresses common error sources and experimental challenges in the field.

Conclusion: This consolidated guideline serves both new researchers and practitioners in meta-algorithmic fields, providing unified best practices to improve experimental validity and scalability across algorithm selection, configuration, and scheduling research.

Abstract: Empirical research on meta-algorithmics, such as algorithm selection, configuration, and scheduling, often relies on extensive and thus computationally expensive experiments. With the large degree of freedom we have over our experimental setup and design comes a plethora of possible error sources that threaten the scalability and validity of our scientific insights. Best practices for meta-algorithmic research exist, but they are scattered between different publications and fields, and continue to evolve separately from each other. In this report, we collect good practices for empirical meta-algorithmic research across the subfields of the COSEAL community, encompassing the entire experimental cycle: from formulating research questions and selecting an experimental design, to executing experiments, and ultimately, analyzing and presenting results impartially. It establishes the current state-of-the-art practices within meta-algorithmic research and serves as a guideline to both new researchers and practitioners in meta-algorithmic fields.

[248] ParamExplorer: A framework for exploring parameters in generative art

Julien Gachadoat, Guillaume Lagarde

Main category: cs.AI

TL;DR: ParamExplorer is an interactive RL-inspired framework that helps artists explore complex parameter spaces in generative art systems, with modular agents and p5js integration.

Details

Motivation: Generative art systems have high-dimensional parameter spaces where aesthetically pleasing outputs are rare and fragmented. Artists rely on manual trial-and-error, leaving many interesting configurations undiscovered.

Method: Introduces ParamExplorer, an interactive modular framework inspired by reinforcement learning for exploring parameter spaces, guided by human-in-the-loop or automated feedback. Framework integrates with existing p5js projects and implements/evaluates several exploration strategies (agents).

Result: The paper presents the ParamExplorer framework and evaluates multiple exploration agents within it, though specific evaluation results are not detailed in the abstract.

Conclusion: ParamExplorer provides a systematic approach to help artists discover interesting configurations in complex generative art parameter spaces, reducing reliance on manual trial-and-error through interactive exploration strategies.

Abstract: Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing p5js projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.

cs.SD

[249] InstructDubber: Instruction-based Alignment for Zero-shot Movie Dubbing

Zhedong Zhang, Liang Li, Gaoxiang Cong, Chunshan Liu, Yuhan Gao, Xiaowan Wang, Tao Gu, Yuankai Qi

Main category: cs.SD

TL;DR: InstructDubber: An instruction-based alignment method for movie dubbing that uses multimodal LLM to generate dubbing instructions for speaking rate and emotion, enabling robust in-domain and zero-shot dubbing without complex visual preprocessing.

Details

Motivation: Existing movie dubbing methods rely on complex visual preprocessing pipelines (facial landmark detection, feature extraction) and generalize poorly to unseen visual domains, leading to degraded alignment and dubbing quality.

Method: 1) Use multimodal LLM to generate natural language dubbing instructions (speaking rate & emotion) from video, script, and prompts; 2) Instructed duration distilling module mines duration cues from speaking rate instructions for lip-aligned phoneme duration prediction; 3) Instructed emotion calibrating module finetunes LLM-based analyzer with ground truth emotion supervision for prosody prediction; 4) Audio decoder generates dubbing using predicted duration, prosody, and script.

Result: Extensive experiments on three major benchmarks show InstructDubber outperforms state-of-the-art approaches in both in-domain and zero-shot scenarios.

Conclusion: InstructDubber provides a robust instruction-based alignment approach for movie dubbing that eliminates complex visual preprocessing and achieves superior performance across domains through natural language instruction generation and calibration.

Abstract: Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character’s visual performance. However, existing alignment approaches based on visual features face two key limitations: (1)they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion state depicted in the video, which is robust to visual domain variations. Second, we design an instructed duration distilling module to mine discriminative duration cues from speaking rate instructions to predict lip-aligned phoneme-level pronunciation duration. Third, for emotion-prosody alignment, we devise an instructed emotion calibrating module, which finetunes an LLM-based instruction analyzer using ground truth dubbing emotion as supervision and predicts prosody based on the calibrated emotion analysis. Finally, the predicted duration and prosody, together with the script, are fed into the audio decoder to generate video-aligned dubbing. Extensive experiments on three major benchmarks demonstrate that InstructDubber outperforms state-of-the-art approaches across both in-domain and zero-shot scenarios.

[250] Do Foundational Audio Encoders Understand Music Structure?

Keisuke Toyama, Zhi Zhong, Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Main category: cs.SD

TL;DR: Comprehensive evaluation of 11 foundational audio encoders for music structure analysis reveals self-supervised learning with masked language modeling on music data is most effective.

Details

Motivation: While pretrained foundational audio encoders (FAEs) have shown success in various MIR tasks, their application to music structure analysis (MSA) remains underexplored, with unclear impact of factors like learning methods, training data, and model context length.

Method: Conducted comprehensive experiments on 11 types of FAEs to systematically investigate how learning methods, training data, and model context length affect MSA performance.

Result: FAEs using self-supervised learning with masked language modeling on music data were found to be particularly effective for MSA tasks.

Conclusion: The study provides empirical evidence for selecting appropriate FAEs for MSA and paves the way for future research in music structure analysis using pretrained audio encoders.

Abstract: In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored. Although many open-source FAE models are available, only a small subset has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using selfsupervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in MSA.

[251] When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems

Sujal Chondhekar, Vasanth Murukuri, Rushabh Vasani, Sanika Goyal, Rajshree Badami, Anushree Rana, Sanjana SN, Karthik Pandia, Sulabh Katiyar, Neha Jagadeesh, Sankalp Gulati

Main category: cs.SD

TL;DR: Speech enhancement preprocessing degrades ASR performance across all modern models and noise conditions, contrary to conventional belief.

Details

Motivation: To systematically evaluate whether speech enhancement methods actually improve ASR performance for modern large-scale models trained on diverse noisy data, particularly in medical applications.

Method: Evaluated MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems (OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a) using 500 medical speech recordings under nine noise conditions, measuring performance with semantic WER (semWER).

Result: Speech enhancement preprocessing degraded ASR performance across all 40 tested configurations (4 models × 10 conditions). Original noisy audio achieved lower semWER than enhanced audio in all cases, with degradations ranging from 1.1% to 46.6% absolute semWER increase.

Conclusion: Modern ASR models possess sufficient internal noise robustness, and traditional speech enhancement may remove acoustic features critical for ASR. For medical scribe systems in noisy clinical environments, preprocessing with noise reduction techniques might be computationally wasteful and potentially harmful to transcription accuracy.

Abstract: Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.

[252] LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection

Ioannis Stylianou, Achintya kr. Sarkar, Nauman Dawalatabad, James Glass, Zheng-Hua Tan

Main category: cs.SD

TL;DR: LibriVAD is a scalable open-source VAD dataset derived from LibriSpeech with systematic noise control, enabling benchmarking of VAD models including Vision Transformer (ViT) which outperforms existing methods across various conditions.

Details

Motivation: The lack of large-scale, systematically controlled, and publicly available datasets has been a key limitation in advancing robust Voice Activity Detection (VAD) research, especially under noisy, diverse, and unseen acoustic conditions.

Method: Created LibriVAD dataset from LibriSpeech with diverse real-world and synthetic noise sources, enabling systematic control over SNR, silence-to-speech ratio, and noise diversity. Benchmarked multiple feature-model combinations including waveform, MFCC, Gammatone features, and introduced Vision Transformer (ViT) architecture for VAD.

Result: ViT with MFCC features consistently outperforms established VAD models (boosted DNN and convolutional LSTM DNN) across seen, unseen, and OOD conditions, including evaluation on real-world VOiCES dataset. Scaling dataset size and balancing SSR enhances VAD performance under OOD conditions.

Conclusion: LibriVAD addresses the dataset limitation in VAD research, and ViT with MFCC features shows superior performance. Dataset size scaling and SSR balancing improve generalization, with all resources publicly released to foster reproducibility and accelerate VAD research progress.

Abstract: Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.

[253] Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

June Young Yi, Hyeongju Kim, Juheon Lee

Main category: cs.SD

TL;DR: The paper presents a lightweight TTS system using Supertonic model with Self-Purifying Flow Matching (SPFM) for WildSpoof Challenge, achieving best WER and second-best perceptual scores.

Details

Motivation: To develop an efficient text-to-speech system that can robustly adapt to in-the-wild speech conditions while handling label noise in training data.

Method: Fine-tunes the open-weight Supertonic TTS model with Self-Purifying Flow Matching (SPFM), which compares conditional and unconditional flow matching losses to identify and handle noisy text-speech pairs by routing suspicious samples to unconditional training.

Result: Achieved the lowest Word Error Rate (WER) among all participating teams in the WildSpoof Challenge TTS Track, while ranking second in perceptual metrics (UTMOS and DNSMOS).

Conclusion: Open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM, enabling efficient and robust TTS systems.

Abstract: This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text–speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

[254] Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability

Tingxiao Zhou, Leying Zhang, Zhengyang Chen, Yanmin Qian

Main category: cs.SD

TL;DR: Synthetic data can outperform real data for TTS training when optimized for speaker/text diversity, noise reduction, and standard speaking styles.

Details

Motivation: To systematically validate the rationality and effectiveness of using purely synthetic data for TTS model training, as its potential has gained attention but requires proper validation.

Method: Systematically investigate feasibility of purely synthetic data for TTS training, exploring factors including text richness, speaker diversity, noise levels, and speaking styles through experiments.

Result: Increasing speaker and text diversity enhances synthesis quality and robustness; cleaner training data with minimal noise improves performance; standard speaking styles facilitate more effective learning; synthetic data-trained models can outperform real data-trained models under similar conditions.

Conclusion: Synthetic data has great potential for TTS training and can outperform real data when optimized for key factors like diversity, noise reduction, and speaking style standardization, due to absence of real-world imperfections.

Abstract: The potential of synthetic data in text-to-speech (TTS) model training has gained increasing attention, yet its rationality and effectiveness require systematic validation. In this study, we systematically investigate the feasibility of using purely synthetic data for TTS training and explore how various factors–including text richness, speaker diversity, noise levels, and speaking styles–affect model performance. Our experiments reveal that increasing speaker and text diversity significantly enhances synthesis quality and robustness. Cleaner training data with minimal noise further improves performance. Moreover, we find that standard speaking styles facilitate more effective model learning. Our experiments indicate that models trained on synthetic data have great potential to outperform those trained on real data under similar conditions, due to the absence of real-world imperfections and noise.

[255] Set-theoretic solution for the tuning problem

Vsevolod Vladimirovich Deriushkin

Main category: cs.SD

TL;DR: A new mathematical framework for musical tuning that generalizes Just Intonation to inharmonic timbres and unifies spectral interference with harmonicity using set theory.

Details

Motivation: To solve the problem of musical tuning by creating a unified framework that works for both harmonic and inharmonic timbres, addressing limitations of traditional Just Intonation which only works for harmonic sounds.

Method: Uses set theory to mathematically quantify consonance through two measures: affinity (spectral interference) and harmonicity. These measures generate sets of intervals that can serve as dynamic tuning systems.

Result: Develops a mathematical framework that can quantify musical consonance and generate tuning systems applicable to both harmonic and inharmonic timbres.

Conclusion: Provides a novel solution to musical tuning that unifies different aspects of consonance perception and works for a broader range of sounds than traditional approaches.

Abstract: In this paper I want to suggest a new solution to the problem of musical tuning. On one hand, I see it as a generalization of Just Intonation (JI) to inharmonic timbers, on another, as a unification of spectral interference and harmonicity contributions to consonance within a single framework. The main achievement of the work is the ability to mathematically quantify the phenomenon of musical consonance using set theory. That quantification is done by defining two measures of consonance: affinity and harmonicity. These measures naturally generate sets of intervals that can be used as dynamic tuning systems. The paper is aimed at a broad audience of people who may not be skilled in music and tuning theory or mathematics. Thus, I attempt to give as much details and explanations as I can, while keeping the number of pages as low as possible.

[256] Unified Acoustic Representations for Screening Neurological and Respiratory Pathologies from Voice

Ran Piao, Yuan Lu, Hareld Kemps, Tong Xia, Aaqib Saeed

Main category: cs.SD

TL;DR: MARVEL is a privacy-conscious multitask learning framework that simultaneously detects nine neurological, respiratory, and voice disorders using only acoustic features from speech, achieving strong performance across multiple conditions.

Details

Motivation: Existing voice-based health assessment approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. There's a need for scalable, non-invasive disease screening that can handle multiple conditions simultaneously while maintaining privacy.

Method: MARVEL uses a dual-branch architecture with specialized encoders and task-specific heads sharing a common acoustic backbone. It employs multitask learning to simultaneously detect nine distinct disorders using only derived acoustic features (no raw audio transmission), enabling cross-condition knowledge transfer.

Result: Achieved overall AUROC of 0.78 on Bridge2AI-Voice v2.0 dataset, with exceptional performance on neurological disorders (AUROC = 0.89), particularly Alzheimer’s disease/mild cognitive impairment (AUROC = 0.97). Outperformed single-modal baselines by 5-19% and surpassed state-of-the-art self-supervised models on 7 of 9 tasks. Learned representations showed meaningful similarities with established acoustic features.

Conclusion: A single unified model can effectively screen for diverse conditions, establishing a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings. The privacy-conscious approach using only acoustic features makes it suitable for real-world deployment.

Abstract: Voice-based health assessment offers unprecedented opportunities for scalable, non-invasive disease screening, yet existing approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. We present MARVEL (Multi-task Acoustic Representations for Voice-based Health Analysis), a privacy-conscious multitask learning framework that simultaneously detects nine distinct neurological, respiratory, and voice disorders using only derived acoustic features, eliminating the need for raw audio transmission. Our dual-branch architecture employs specialized encoders with task-specific heads sharing a common acoustic backbone, enabling effective cross-condition knowledge transfer. Evaluated on the large-scale Bridge2AI-Voice v2.0 dataset, MARVEL achieves an overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89), particularly for Alzheimer’s disease/mild cognitive impairment (AUROC = 0.97). Our framework consistently outperforms single-modal baselines by 5-19% and surpasses state-of-the-art self-supervised models on 7 of 9 tasks, while correlation analysis reveals that the learned representations exhibit meaningful similarities with established acoustic features, indicating that the model’s internal representations are consistent with clinically recognized acoustic patterns. By demonstrating that a single unified model can effectively screen for diverse conditions, this work establishes a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings.

cs.LG

[257] Dion2: A Simple Method to Shrink Matrix in Muon

Kwangjun Ahn, Noah Amsel, John Langford

Main category: cs.LG

TL;DR: Dion2 is a simpler method to reduce Muon optimizer’s orthonormalization cost by sampling rows/columns, making updates sparse and improving scalability.

Details

Motivation: Muon optimizer has strong performance but suffers from super-linear orthonormalization cost that increases with scale, creating overhead issues.

Method: Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those, creating sparse updates to reduce computation and communication costs.

Result: The method reduces both computation and communication costs, improving the scalability of Muon optimizer.

Conclusion: Dion2 provides a simpler approach to shrink the matrix involved in Muon’s computation compared to prior methods, addressing scalability limitations.

Abstract: The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon’s computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.

[258] Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

Kunjal Panchal, Saayan Mitra, Somdeb Sarkhel, Haoliang Wang, Ishita Dasgupta, Gang Wu, Hui Guan

Main category: cs.LG

TL;DR: Atom is an on-device system that restructures video-language pipelines by decomposing large models into reusable modules, enabling faster execution on mobile devices with minimal performance loss.

Details

Motivation: Current video-language models face efficiency challenges on mobile devices due to redundant model loading and fragmented execution in multi-stage pipelines, making on-device deployment impractical.

Method: Atom decomposes billion-parameter video-language models into reusable modules (visual encoder, language decoder) and reuses them across subtasks like captioning, reasoning, and indexing, enabling parallel execution and eliminating repeated model loading.

Result: On commodity smartphones, Atom achieves 27-33% faster execution compared to non-reuse baselines with only marginal performance drop (≤2.3 Recall@1 in retrieval, ≤1.5 CIDEr in captioning).

Conclusion: Atom provides a practical, scalable approach for efficient video-language understanding on edge devices by restructuring pipelines for reuse and parallel execution.

Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27–33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

[259] BIONIX: A Wireless, Low-Cost Prosthetic Arm with Dual-Signal EEG and EMG Control

Pranesh Sathish Kumar

Main category: cs.LG

TL;DR: Low-cost dual-mode EEG+EMG control system for prosthetic arms using NeuroSky EEG headset and MyoWare EMG sensors with ESP32 microcontrollers for real-time multi-degree control.

Details

Motivation: Affordable upper-limb prostheses lack intuitive control systems, limiting functionality and accessibility for amputees in low-resource settings. Need for low-cost, biologically intuitive prosthetic control suitable for underserved populations.

Method: Dual-mode neuro-muscular control integrating EEG (NeuroSky MindWave Mobile 2) and EMG (MyoWare 2.0). EEG uses 6-frame sliding window classification with low-pass filtering to detect blink events for hand open/close. EMG uses threshold-based detection with three activation bands (rest, extension, contraction) and requires 8 consecutive frames for movement stability. Two ESP32 microcontrollers handle EEG (finger servos) and EMG (elbow servos) control separately.

Result: Functional prototype constructed with total cost ~$240 (mostly from commercial EEG headset). System enables real-time multi-degree-of-freedom control of prosthetic arm with EEG controlling hand open/close and EMG controlling elbow movement.

Conclusion: Demonstrates feasible pathway to low-cost, biologically intuitive prosthetic control suitable for underserved and global health applications. Future work includes 3D-printed chassis, auto-regressive models to reduce EMG latency, and upgraded servo torque for better load capacity.

Abstract: Affordable upper-limb prostheses often lack intuitive control systems, limiting functionality and accessibility for amputees in low-resource settings. This project presents a low-cost, dual-mode neuro-muscular control system integrating electroencephalography (EEG) and electromyography (EMG) to enable real-time, multi-degree-of-freedom control of a prosthetic arm. EEG signals are acquired using the NeuroSky MindWave Mobile 2 and transmitted via ThinkGear Bluetooth packets to an ESP32 microcontroller running a lightweight classification model. The model was trained on 1500 seconds of recorded EEG data using a 6-frame sliding window with low-pass filtering, excluding poor-signal samples and using a 70/20/10 training–validation–test split. The classifier detects strong blink events, which toggle the hand between open and closed states. EMG signals are acquired using a MyoWare 2.0 sensor and SparkFun wireless shield and transmitted to a second ESP32, which performs threshold-based detection. Three activation bands (rest: 0–T1; extension: T1–T2; contraction: greater than T2) enable intuitive elbow control, with movement triggered only after eight consecutive frames in a movement class to improve stability. The EEG-controlled ESP32 actuates four finger servos, while the EMG-controlled ESP32 drives two elbow servos. A functional prototype was constructed using low-cost materials (total cost approximately 240 dollars), with most expense attributed to the commercial EEG headset. Future work includes transitioning to a 3D-printed chassis, integrating auto-regressive models to reduce EMG latency, and upgrading servo torque for improved load capacity and grip strength. This system demonstrates a feasible pathway to low-cost, biologically intuitive prosthetic control suitable for underserved and global health applications.

[260] QSMOTE-PGM/kPGM: QSMOTE Based PGM and kPGM for Imbalanced Dataset Classification

Bikash K. Behera, Giuseppe Sergioli, Robert Giuntini

Main category: cs.LG

TL;DR: Quantum-inspired ML classifiers (PGM and KPGM) outperform classical random forest, especially with multiple quantum copies, achieving up to 85% accuracy on synthetic data with QSMOTE variants.

Details

Motivation: To provide a unified theoretical and empirical comparison of quantum-inspired machine learning paradigms (Kernel Trick vs. Pretty Good Measurement) and analyze their performance across synthetic oversampling scenarios using Quantum SMOTE variants.

Method: Comparative analysis of PGM and KPGM classifiers against classical random forest baseline, tested across synthetic oversampling scenarios using QSMOTE variants (stereo and amplitude encoding). Evaluated with multiple quantum copies (n_copies parameter).

Result: Both PGM and KPGM consistently outperform random forest baseline. PGM with stereo encoding and n_copies=2 achieves highest accuracy (0.8512) and F1-score (0.8234). KPGM shows competitive and more stable performance across QSMOTE variants, with top scores of 0.8511 (stereo) and 0.8483 (amplitude).

Conclusion: Quantum-inspired classifiers provide tangible performance gains with complementary strengths: PGM benefits from encoding-specific enhancements while KPGM ensures robustness across sampling strategies. These findings advance understanding of kernel-based and measurement-based QiML methods and offer practical guidance for their application.

Abstract: Quantum-inspired machine learning (QiML) leverages mathematical frameworks from quantum theory to enhance classical algorithms, with particular emphasis on inner product structures in high-dimensional feature spaces. Among the prominent approaches, the Kernel Trick, widely used in support vector machines, provides efficient similarity computation, while the Pretty Good Measurement (PGM), originating from quantum state discrimination, enables classification grounded in Hilbert space geometry. Building on recent developments in kernelized PGM (KPGM) and direct PGM-based classifiers, this work presents a unified theoretical and empirical comparison of these paradigms. We analyze their performance across synthetic oversampling scenarios using Quantum SMOTE (QSMOTE) variants. Experimental results show that both PGM and KPGM classifiers consistently outperform a classical random forest baseline, particularly when multiple quantum copies are employed. Notably, PGM with stereo encoding and n_copies=2 achieves the highest overall accuracy (0.8512) and F1-score (0.8234), while KPGM demonstrates competitive and more stable behavior across QSMOTE variants, with top scores of 0.8511 (stereo) and 0.8483 (amplitude). These findings highlight that quantum-inspired classifiers not only provide tangible gains in recall and balanced performance but also offer complementary strengths: PGM benefits from encoding-specific enhancements, whereas KPGM ensures robustness across sampling strategies. Our results advance the understanding of kernel-based and measurement-based QiML methods, offering practical guidance on their applicability under varying data characteristics and computational constraints.

[261] DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

Seong Ho Pahng, Guoye Guan, Benjamin Fefferman, Sahand Hormoz

Main category: cs.LG

TL;DR: DiffeoMorph is an end-to-end differentiable framework that learns morphogenesis protocols for multi-agent systems to form target 3D shapes using attention-based SE(3)-equivariant graph neural networks and a novel 3D Zernike polynomial shape-matching loss.

Details

Motivation: Understanding how distributed control in biological systems leads to precise 3D patterns is crucial for developmental biology, distributed robotics, programmable matter, and multi-agent learning. Current approaches lack differentiable frameworks for learning morphogenesis protocols.

Method: Uses attention-based SE(3)-equivariant graph neural networks where agents update positions/states based on internal states and neighbor signals. Introduces 3D Zernike polynomial shape-matching loss with SO(3) invariance via alignment optimization. Employs bilevel optimization with implicit differentiation through alignment step.

Result: Systematic benchmarking shows advantages over standard shape comparison metrics. Framework successfully forms various shapes from simple ellipsoids to complex morphologies using minimal spatial cues.

Conclusion: DiffeoMorph provides an effective differentiable framework for learning distributed morphogenesis protocols, bridging biological pattern formation with multi-agent systems and offering applications in robotics and programmable matter.

Abstract: Biological systems can form complex three-dimensional structures through the collective behavior of identical agents – cells that follow the same internal rules and communicate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi-agent learning. Here, we introduce DiffeoMorph, an end-to-end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an attention-based SE(3)-equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape-matching loss based on the 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and rigid-body transformations. To enforce full SO(3) invariance – invariant to rotations yet sensitive to reflections, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. This results in a bilevel problem, with the inner loop optimizing a unit quaternion for the best alignment and the outer loop updating the agent model. We compute gradients through the alignment step using implicit differentiation. We perform systematic benchmarking to establish the advantages of our shape-matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of shapes – from simple ellipsoids to complex morphologies – using only minimal spatial cues.

[262] Compression is Routing: Reconstruction Error as an Intrinsic Signal for Modular Language Models

Zhongpan Tang

Main category: cs.LG

TL;DR: The paper proposes “Compression is Routing” - using reconstruction error from a Transformer Autoencoder as an intrinsic distribution fingerprint for automatic expert routing in MoE architectures, eliminating the need for explicit gating networks.

Details

Motivation: Address three major LLM challenges: context length limitations, high inference costs, and catastrophic forgetting during continual learning. Current MoE architectures use complex, non-interpretable routing mechanisms that struggle with mixed-domain inputs.

Method: Train an 87M-parameter Transformer Autoencoder to achieve 64x sequence length compression (512 tokens → 8 latent vectors). Use reconstruction error as an Intrinsic Distribution Fingerprint to automatically schedule expert modules without explicit gating networks.

Result: The compressor achieves: 99.47% reconstruction accuracy on in-domain (code) validation, 47.76% on semi-OOD (Wiki text), and 0.57% on fully OOD (random sequences). This extreme performance discrepancy validates reconstruction error as a distribution fingerprint.

Conclusion: Reconstruction error can serve as an effective routing mechanism for MoE architectures, offering scalability, interpretability, and a new approach to VRAM compression for ultra-long contexts. Provides a foundation for next-generation scalable modular neural networks.

Abstract: Current Large Language Models (LLMs) face three major challenges: context length limitations, high inference costs, and catastrophic forgetting during continual learning. While Mixture-of-Experts (MoE) architectures mitigate some of these conflicts, their routing mechanisms typically rely on explicitly trained auxiliary classifiers. This not only increases system complexity but also often lacks interpretability when handling mixed-domain inputs. Building upon the premise that Compression is Intelligence,'' this paper proposes a novel architectural philosophy: \textbf{Compression is Routing.’’} We trained an 87M-parameter end-to-end Transformer Autoencoder, achieving a \textbf{64x sequence length compression} (compressing 512 tokens into 8 latent vectors). Experimental results demonstrate that this compressor possesses extreme domain discriminative capability: it achieves a reconstruction accuracy of \textbf{99.47%} on the in-domain (code) validation set; accuracy drops sharply to \textbf{47.76%} on a semi-out-of-distribution domain (Wiki text); and further plummets to just \textbf{0.57%} on a fully out-of-distribution domain (random sequences). This extreme and systematic performance discrepancy establishes the validity of reconstruction error as an \textbf{Intrinsic Distribution Fingerprint}. Based on this, we propose that expert modules can be automatically scheduled using reconstruction residuals directly, without the need for explicit gating networks. This mechanism offers excellent scalability. Furthermore, this architecture provides a new perspective on ``VRAM compression’’ for handling ultra-long contexts. This report aims to verify the physical validity of this foundational architecture, offering a new research perspective for the next generation of scalable modular neural networks.

[263] Physics-Informed Lightweight Machine Learning for Aviation Visibility Nowcasting Across Multiple Climatic Regimes

Marcelo Cerda Castillo

Main category: cs.LG

TL;DR: XGBoost model trained on METAR data with physics-guided features outperforms operational TAF forecasts for low-visibility/precipitation nowcasting at airports, achieving 2.5-4x better recall with reduced false alarms while providing explainable physical insights.

Details

Motivation: Current operational approaches for aviation weather nowcasting rely on computationally intensive numerical weather prediction and human-issued TAF products, which have conservative biases, limited temporal resolution, and require manual configuration.

Method: Lightweight gradient boosting framework (XGBoost) trained exclusively on surface observation data (METAR) enhanced through physics-guided feature engineering based on thermodynamic principles, evaluated across 11 international airports representing distinct climatic regimes.

Result: Model successfully captures underlying local physical processes without manual configuration, achieves substantially higher detection rates at tactical horizons (3 hours) with 2.5 to 4.0 times improvement in recall while reducing false alarms compared to operational TAF forecasts.

Conclusion: The physics-guided machine learning framework provides an effective, lightweight alternative to traditional operational approaches, offering both superior predictive performance and actionable explainability through SHAP analysis that reveals implicit reconstruction of local physical drivers.

Abstract: Short-term prediction (nowcasting) of low-visibility and precipitation events is critical for aviation safety and operational efficiency. Current operational approaches rely on computationally intensive numerical weather prediction guidance and human-issued TAF products, which often exhibit conservative biases and limited temporal resolution. This study presents a lightweight gradient boosting framework (XGBoost) trained exclusively on surface observation data (METAR) and enhanced through physics-guided feature engineering based on thermodynamic principles. The framework is evaluated across 11 international airports representing distinct climatic regimes (including SCEL, KJFK, KORD, KDEN, SBGR, and VIDP) using historical data from 2000 to 2024. Results suggest that the model successfully captures underlying local physical processes without manual configuration. In a blind comparative evaluation against operational TAF forecasts, the automated model achieved substantially higher detection rates at tactical horizons (3 hours), with a 2.5 to 4.0 times improvement in recall while reducing false alarms. Furthermore, SHAP analysis reveals that the model performs an implicit reconstruction of local physical drivers (advection, radiation, and subsidence), providing actionable explainability for operational situational awareness. Keywords: aviation meteorology; physics-guided machine learning; explainable artificial intelligence; lightweight machine learning; nowcasting; METAR; TAF verification; edge computing

[264] Multimodal Representation Learning and Fusion

Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, Xinyuan Song, Junfeng Hao

Main category: cs.LG

TL;DR: Multi-modal learning combines different data types (images, text, audio) to help AI systems build richer representations for better interpretation and decision-making in real-world applications.

Details

Motivation: To help machines understand complex real-world phenomena by leveraging complementary information from multiple modalities, enabling more human-like understanding and reasoning capabilities.

Method: Core techniques include representation learning (extracting shared features), alignment methods (matching information across modalities), and fusion strategies (combining modalities using deep learning models). Researchers are exploring unsupervised/semi-supervised learning and AutoML tools for efficiency.

Result: The field has made good progress but still faces challenges including handling different data formats, missing/incomplete inputs, and adversarial attacks. New evaluation metrics and shared benchmarks are being developed for better performance comparison.

Conclusion: Multi-modal learning is expected to significantly advance computer vision, NLP, speech recognition, and healthcare, potentially leading to AI systems with more human-like, flexible, context-aware understanding of real-world complexity.

Abstract: Multi-modal learning is a fast growing area in artificial intelligence. It tries to help machines understand complex things by combining information from different sources, like images, text, and audio. By using the strengths of each modality, multi-modal learning allows AI systems to build stronger and richer internal representations. These help machines better interpretation, reasoning, and making decisions in real-life situations. This field includes core techniques such as representation learning (to get shared features from different data types), alignment methods (to match information across modalities), and fusion strategies (to combine them by deep learning models). Although there has been good progress, some major problems still remain. Like dealing with different data formats, missing or incomplete inputs, and defending against adversarial attacks. Researchers now are exploring new methods, such as unsupervised or semi-supervised learning, AutoML tools, to make models more efficient and easier to scale. And also more attention on designing better evaluation metrics or building shared benchmarks, make it easier to compare model performance across tasks and domains. As the field continues to grow, multi-modal learning is expected to improve many areas: computer vision, natural language processing, speech recognition, and healthcare. In the future, it may help to build AI systems that can understand the world in a way more like humans, flexible, context aware, and able to deal with real-world complexity.

[265] Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li

Main category: cs.LG

TL;DR: The paper introduces turn-PPO, a turn-level variant of PPO for multi-turn RL in LLM agents, showing better performance than GRPO on WebShop and Sokoban tasks.

Details

Motivation: Standard RL algorithms like GRPO have limitations in multi-turn tasks requiring long-horizon reasoning, especially for interactive LLM agents in real-world environments.

Method: First explored PPO as a more robust alternative to GRPO, then introduced turn-PPO which operates on turn-level MDP formulation instead of token-level MDP for multi-turn scenarios.

Result: Results on WebShop and Sokoban datasets demonstrate turn-PPO’s effectiveness, both with and without long reasoning components, outperforming GRPO.

Conclusion: Turn-level MDP formulation with PPO provides more stable and effective advantage estimation for multi-turn RL in LLM agents, addressing long-horizon reasoning challenges.

Abstract: Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

[266] GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning

Chang-Hwan Lee, Chanseung Lee

Main category: cs.LG

TL;DR: GB-DQN uses gradient boosting ensembles to adapt to non-stationary RL environments by incrementally training new learners on Bellman residuals after drift.

Details

Motivation: Non-stationary environments cause catastrophic forgetting in deep RL as learned value functions become invalid when dynamics or rewards change. Existing methods struggle with model drift.

Method: GB-DQN constructs an additive ensemble where each new learner is trained to approximate the Bellman residual of the current ensemble after environmental drift, using gradient boosting principles for incremental residual learning.

Result: Theoretical analysis shows each boosting step reduces empirical Bellman residual and ensemble converges to post-drift optimal value function. Experiments demonstrate faster recovery, improved stability, and greater robustness compared to DQN and non-stationary baselines.

Conclusion: GB-DQN provides an effective ensemble-based approach for non-stationary RL that mitigates catastrophic forgetting through incremental residual learning, offering theoretical guarantees and practical performance improvements.

Abstract: Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emph{Gradient-Boosted Deep Q-Networks (GB-DQN)}, an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.

[267] SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples

Haoye Lu, Yaoliang Yu, Darren Ho

Main category: cs.LG

TL;DR: The paper proposes SFBD-OMNI, a framework for distribution restoration from abundant noisy samples using a black-box corruption model, framed as one-sided entropic optimal transport with recoverability analysis.

Details

Motivation: In many real-world scenarios, obtaining fully observed samples is expensive or infeasible, while partial and noisy observations are easier to collect. The paper addresses the challenge of restoring true distributions from abundant corrupted samples when the corruption process is available as a black-box generator.

Method: The task is framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. The authors introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to ground-truth distributions, generalizing Stochastic Forward-Backward Deconvolution to handle arbitrary measurement models beyond Gaussian corruption.

Result: Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance. The authors provide a test criterion to determine recoverability under per-sample information loss and show that in otherwise unrecoverable cases, a small number of clean samples can make the distribution largely recoverable.

Conclusion: The proposed SFBD-OMNI framework effectively addresses distribution restoration from noisy samples with arbitrary corruption models, offering both theoretical guarantees and practical performance improvements over existing methods.

Abstract: In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.

[268] Dynamic Tool Dependency Retrieval for Efficient Function Calling

Bhrij Patel, Davide Belli, Amir Jalalirad, Maximilian Arnold, Aleksandr Ermovol, Bence Major

Main category: cs.LG

TL;DR: DTDR improves function calling agents by dynamically retrieving tools based on evolving execution context, boosting success rates 23-104% over static methods.

Details

Motivation: Current on-device agents use static retrieval methods that fail to capture multi-step tool dependencies and evolving task context, leading to irrelevant tools that degrade agent performance.

Method: Dynamic Tool Dependency Retrieval (DTDR) - a lightweight retrieval method that conditions on both initial query and evolving execution context, modeling tool dependencies from function calling demonstrations for adaptive retrieval as plans unfold.

Result: DTDR improves function calling success rates between 23% and 104% compared to state-of-the-art static retrievers across multiple datasets and LLM backbones, while maintaining computational efficiency.

Conclusion: Dynamic tool retrieval that adapts to execution context significantly outperforms static methods, enabling more efficient and accurate function calling agents by better capturing tool dependencies.

Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23%$ and $104%$ compared to state-of-the-art static retrievers.

[269] Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. III

Vladimir G. Pestov

Main category: cs.LG

TL;DR: The paper completes the proof of equivalence between three conditions for complete separable metric spaces: (1) k-NN classifier universal consistency, (2) strong Lebesgue-Besicovitch differentiation property, and (3) sigma-finite dimensionality in Nagata sense.

Details

Motivation: To establish the complete equivalence between these three fundamental properties of metric spaces, which had been partially proven in previous works but with one implication missing and a correction needed for previous claims.

Method: The authors prove the remaining implication (1)⇒(3) using mathematical analysis techniques, building on previous work by Preiss, Assouad and Quentin de Gromard, Cérou and Guyader, and their own earlier work in the series.

Result: The paper successfully proves that universal consistency of k-NN classifier implies sigma-finite dimensionality (Nagata), completing the equivalence proof and correcting an erroneous claim from a previous article in the series.

Conclusion: The three conditions are now fully equivalent for complete separable metric spaces, providing a complete characterization linking classification theory, measure differentiation, and geometric dimension theory.

Abstract: We prove the last remaining implication allowing to claim the equivalence of the following conditions for a complete separable metric space $X$: (1) The $k$-nearest neighbour classifier is (weakly) universally consistent in $X$, (2) The strong Lebesgue–Besicovitch differentiation property holds in $X$ for every locally finite Borel measure, (3) $X$ is sigma-finite dimensional in the sense of Nagata. The equivalence (2)$\iff$(3) was announced by Preiss (1983), while a detailed proof of the implication (3)$\Rightarrow$(2) has appeared in Assouad and Quentin de Gromard (2006). The implication (2)$\Rightarrow$(1) was established by Cérou and Guyader (2006). We prove the implication (1)$\Rightarrow$(3). The result was conjectured in the first article in the series (Collins, Kumari, Pestov 2020), and here we also correct a wrong claim made in the second article (Kumari and Pestov 2024).

[270] Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation

Zhenyu Liu, Yunzhen Liu, Zehao Fan, Garrett Gagnon, Yayue Hou, Nan Wu, Yangwook Kang, Liu Liu

Main category: cs.LG

TL;DR: BEAM-LRC: Bandwidth-efficient MoE inference via router-guided low-rank compensation for precision restoration of quantized experts, improving bandwidth-accuracy trade-off.

Details

Motivation: MoE models stress memory/bandwidth; offloading helps but token-level routing causes irregular I/O transfers; static uniform quantization reduces traffic but hurts accuracy due to ignoring expert heterogeneity.

Method: Router-guided precision restoration using precomputed low-rank compensators; transfers compact low-rank factors with Top-n experts per token and applies compensation, keeping others low-bit; integrated with offloading on GPU/GPU-NDP systems.

Result: Delivers superior bandwidth-accuracy trade-off and improved throughput compared to existing approaches.

Conclusion: BEAM-LRC effectively addresses MoE inference bottlenecks by combining adaptive quantization with low-rank compensation, enabling efficient offloading while maintaining accuracy.

Abstract: Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (n<k) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.

[271] Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking?

Saraswathy Amjith, Mihika Dusad, Neha Muramalla, Shweta Shah

Main category: cs.LG

TL;DR: Training LLMs on intentionally flawed reasoning traces improves error detection and recovery without degrading standard problem-solving performance.

Details

Motivation: Chain-of-thought prompting is central to mathematical reasoning in LLMs, but models are brittle to early errors that propagate uncorrected to final answers. The research investigates whether training on flawed reasoning can teach models to detect and recover from errors.

Method: Using MATH-lighteval competition problems, researchers generated CoT prefixes containing exactly one controlled error (calculation or reasoning errors). They fine-tuned Qwen3-4B with GRPO using binary final-answer reward, creating Mixed-CoT-RL models trained on flawed reasoning traces.

Result: Mixed-CoT-RL matched standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems with flawed reasoning (24% vs 19%). Clean-only RL fine-tuning degraded robustness below baseline (19% vs 20%). Training on reasoning errors yielded greater robustness gains than calculation errors alone, with mixed training performing best.

Conclusion: Exposure to flawed reasoning traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.

Abstract: Chain-of-thought (CoT) prompting has become central to mathematical reasoning in large language models, yet models remain brittle to early errors: a single arithmetic slip or unjustified inference typically propagates uncorrected to an incorrect final answer. We investigate whether training on intentionally flawed reasoning traces can teach models to detect and recover from such errors without degrading standard problem-solving ability. Using competition-level problems from MATH-lighteval, we generate CoT prefixes containing exactly one controlled error, either a calculation error (sign flips, dropped terms) or a reasoning error (misapplied rules, unjustified logical steps), and fine-tune Qwen3-4B with GRPO using a binary final-answer reward. Our Mixed-CoT-RL model matches standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems prefilled with flawed reasoning (24% vs 19%). Notably, clean-only RL fine-tuning degrades robustness below the untuned baseline 19% vs. 20%), indicating that conventional training increases susceptibility to misleading prefills. Among error types, training on reasoning errors yields greater robustness gains than calculation errors alone, with mixed training performing best. These findings demonstrate that exposure to flawed traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.

[272] How to Square Tensor Networks and Circuits Without Squaring Them

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

Main category: cs.LG

TL;DR: Squared circuits enable efficient marginalization through parameterization with orthogonality conditions, overcoming computational overhead without losing expressiveness.

Details

Motivation: Squared tensor networks and circuits are expressive distribution estimators but suffer from computational complexity in marginalization due to the squaring operation, limiting their ML applicability.

Method: Parameterize squared circuits using orthogonality conditions inspired by canonical forms of TNs and determinism in circuits, enabling efficient marginalization even for complex factorizations encoded as circuits.

Result: The proposed parameterizations unlock efficient marginalization without expressiveness loss, enabling more efficient learning in distribution estimation tasks.

Conclusion: Squared circuits can be parameterized to overcome marginalization overhead while maintaining expressiveness, making them more practical for machine learning applications.

Abstract: Squared tensor networks (TNs) and their extension as computational graphs–squared circuits–have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

[273] Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

Toshiaki Hori, Jonathan DeCastro, Deepak Gopinath, Avinash Balachandran, Guy Rosman

Main category: cs.LG

TL;DR: Hierarchical planning approach combining RL and MPC (MPPI) with adaptive sampling that improves data efficiency and performance across multiple domains.

Details

Motivation: To solve complex hierarchical planning problems more efficiently by combining the strengths of reinforcement learning and model predictive control, addressing limitations of existing approaches in handling uncertainty and exploration.

Method: Fuses RL and MPC planning by using RL actions to inform MPPI sampling and adaptively aggregating MPPI samples for value estimation. The adaptive process focuses MPPI exploration where value estimates are uncertain, creating a tight coupling between the two paradigms.

Result: Demonstrated superior performance across race driving, modified Acrobot, and Lunar Lander with obstacles. Achieved up to 72% increase in success rate compared to existing approaches, 2.1x accelerated convergence compared to non-adaptive sampling, with better data efficiency and overall performance in rewards and task success.

Conclusion: The proposed hierarchical RL-MPC fusion creates a robust planning approach that handles complex problems, adapts to different applications, and significantly outperforms existing methods in both efficiency and performance.

Abstract: We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

[274] UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data

Justin Li, Efe Sencan, Jasper Zheng Duan, Vitus J. Leung, Stephan Tsaur, Ayse K. Coskun

Main category: cs.LG

TL;DR: UniCoMTE is a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers, producing concise, stable, and human-aligned explanations that outperform existing methods like LIME and SHAP.

Details

Motivation: Deep neural networks perform well on time series classification but their black-box nature limits trust and adoption in high-stakes domains like healthcare, creating a need for interpretable explanations.

Method: UniCoMTE identifies influential temporal features by modifying input samples and assessing impact on predictions. It’s model-agnostic, works directly on raw time series, and was evaluated on ECG classifiers using comprehensibility metrics and expert questionnaires.

Result: UniCoMTE produces explanations that outperform LIME and SHAP in clarity and applicability, showing better comprehensibility, generalizability to similar samples, and clinical utility as validated by medical experts.

Conclusion: The framework advances interpretability of deep learning models for real-world time series applications by linking model predictions to meaningful signal patterns, enabling greater trust and adoption in critical domains.

Abstract: Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model’s prediction by modifying the input sample and assessing its impact on the model’s prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE’s explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations’ comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.

[275] Fault Diagnosis and Quantification for Photovoltaic Arrays based on Differentiable Physical Models

Zenan Yang, Yuanliang Li, Jingwei Zhang, Yongjie Liu, Kun Ding

Main category: cs.LG

TL;DR: Proposes a differentiable fast fault simulation model (DFFSM) with gradient-based fault parameter identification for accurate PV string fault quantification.

Details

Motivation: Existing PV fault quantification methods have limited efficiency and interpretability, creating a need for more accurate and efficient fault diagnosis approaches for reliable PV system operation and maintenance.

Method: Develops a differentiable fast fault simulation model (DFFSM) that accurately models I-V characteristics under multiple faults and provides analytical gradients. Uses this with a gradient-based fault parameters identification (GFPI) method employing the Adahessian optimizer to quantify partial shading, short-circuit, and series-resistance degradation.

Result: Experimental results on simulated and measured I-V curves show high quantification accuracy across different faults, with I-V reconstruction error below 3%.

Conclusion: The approach demonstrates feasibility and effectiveness of applying differentiable physical simulators for PV system fault diagnosis, offering improved efficiency and interpretability over existing methods.

Abstract: Accurate fault diagnosis and quantification are essential for the reliable operation and intelligent maintenance of photovoltaic (PV) arrays. However, existing fault quantification methods often suffer from limited efficiency and interpretability. To address these challenges, this paper proposes a novel fault quantification approach for PV strings based on a differentiable fast fault simulation model (DFFSM). The proposed DFFSM accurately models I-V characteristics under multiple faults and provides analytical gradients with respect to fault parameters. Leveraging this property, a gradient-based fault parameters identification (GFPI) method using the Adahessian optimizer is developed to efficiently quantify partial shading, short-circuit, and series-resistance degradation. Experimental results on both simulated and measured I-V curves demonstrate that the proposed GFPI achieves high quantification accuracy across different faults, with the I-V reconstruction error below 3%, confirming the feasibility and effectiveness of the application of differentiable physical simulators for PV system fault diagnosis.

[276] Bridging Training and Merging Through Momentum-Aware Optimization

Alireza Moayedikia, Alicia Troncoso

Main category: cs.LG

TL;DR: Unified framework maintains factorized momentum/curvature stats during training and reuses them for geometry-aware model merging, eliminating redundant computation and enabling better parameter selection.

Details

Motivation: Current workflows waste computation by discarding curvature information after training and recomputing similar information for model merging, losing valuable trajectory data that could enable more principled model composition.

Method: Maintains factorized momentum and curvature statistics during training, accumulates task saliency scores for curvature-aware merging without post-hoc Fisher computation, and establishes convergence guarantees for non-convex objectives with bounded approximation error.

Result: Achieves memory efficiency comparable to SOTA, outperforms magnitude-only baselines across all sparsity levels, improves multi-task merging over strong baselines, exhibits rank-invariant convergence, and shows superior hyperparameter robustness.

Conclusion: Treating optimization trajectory as reusable asset rather than discarding it eliminates redundant computation while enabling more principled model composition through curvature-aware parameter selection and merging.

Abstract: Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging – wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method achieves memory efficiency comparable to state-of-the-art approaches while accumulating task saliency scores that enable curvature-aware merging without post-hoc Fisher computation. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach eliminates redundant computation while enabling more principled model composition.

[277] Digitizing Nepal’s Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts

Anjali Sarawgi, Esteban Garces Arias, Christof Zotter

Main category: cs.LG

TL;DR: First end-to-end HTR pipeline for Old Nepali using encoder-decoder architectures achieves 4.9% CER with data-centric techniques and error analysis.

Details

Motivation: Old Nepali is a historically significant but low-resource language lacking HTR systems, creating a need for automated transcription tools to preserve and study historical documents.

Method: Line-level transcription approach with systematic exploration of encoder-decoder architectures, data-centric techniques, decoding strategies, and token-level confusion analysis.

Result: Best model achieves 4.9% Character Error Rate (CER). Training code, model configurations, and evaluation scripts are released publicly despite confidential dataset.

Conclusion: Successful development of first HTR pipeline for Old Nepali demonstrates feasibility for low-resource historical scripts and provides foundation for further research through released codebase.

Abstract: This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behaviour and error patterns. While the dataset we used for evaluation is confidential, we release our training code, model configurations, and evaluation scripts to support further research in HTR for low-resource historical scripts.

[278] The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

Jasmine Vu, Shivanand Sheshappanavar

Main category: cs.LG

TL;DR: CLIP-based medical vision-language models struggle with interpreting negated phrases in medical contexts. This study evaluates and improves CheXagent’s ability to handle negation in chest X-ray image retrieval through fine-tuning methods.

Details

Motivation: CLIP models are widely used in medical imaging but underperform with negated phrases, which is critical for accurate medical diagnosis. Understanding and improving this limitation is essential for reliable medical AI applications.

Method: Evaluated Stanford AIMI CheXagent model on chest X-ray image retrieval with prompts containing and without negation. Used fine-tuning methods from previous work to improve retrieval accuracy. Analyzed internal model behavior through token attribution, t-SNE projection, and attention-head ablation.

Result: Improved handling of negation in CLIP model with slight decrease in accuracy for positive prompt evaluation. Characterized how fine-tuning approaches reshape text encoder representations of negated clinical language.

Conclusion: The study provides insights into CLIP’s internal behavior with negation and demonstrates improvement in handling clinically relevant negated language, contributing to more reliable medical AI devices.

Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

[279] Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

Main category: cs.LG

TL;DR: GPA (Generalized Primal Averaging) extends Nesterov’s method to improve averaging-based optimizers like single-worker DiLoCo and Schedule-Free, offering smoother averaging, reduced memory, and better performance.

Details

Motivation: Address limitations of recent averaging-based optimizers (single-worker DiLoCo and Schedule-Free) which have issues like two-loop structure, increased memory requirements, and complex hyperparameter tuning in non-distributed settings.

Method: Extends Nesterov’s primal averaging formulation by decoupling the interpolation constant, enabling smooth averaging at every step instead of periodic aggregation, generalizing single-worker DiLoCo while removing its two-loop structure.

Result: GPA consistently outperforms single-worker DiLoCo, provides 24.22% speedup on Llama-160M, 12-27% speedups on ImageNet ViT workloads, reduces memory overhead to single buffer, and simplifies hyperparameter tuning.

Conclusion: GPA offers a superior averaging-based optimization approach that improves convergence, reduces computational overhead, and maintains theoretical guarantees while being practically simpler to implement and tune.

Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov’s method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo’s periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW’s) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW’s validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Liad Lea Didi, Kobi Cohen

Main category: cs.LG

TL;DR: SMILE algorithm achieves stable channel allocation in wireless networks with unknown restless Markov channels through distributed learning with logarithmic regret.

Details

Motivation: Need for distributed spectrum sharing among cognitive entities in communication-constrained networks where channels have unknown, time-varying rewards and interference constraints must be respected.

Method: SMILE (Stable Multi-matching with Interference-aware LEarning) integrates restless bandit learning with graph-constrained coordination, enabling cells to balance exploration and exploitation while maintaining interference constraints.

Result: Proves SMILE converges to optimal stable allocation with logarithmic regret relative to full-knowledge genie; simulations show robustness, scalability, and efficiency across diverse scenarios.

Conclusion: First work to achieve global Gale-Shapley stability in stochastic restless environments, providing communication-efficient distributed learning solution for spectrum sharing in interference-constrained networks.

Abstract: We study distributed learning for spectrum access and sharing among multiple cognitive communication entities, such as cells, subnetworks, or cognitive radio users (collectively referred to as cells), in communication-constrained wireless networks modeled by interference graphs. Our goal is to achieve a globally stable and interference-aware channel allocation. Stability is defined through a generalized Gale-Shapley multi-to-one matching, a well-established solution concept in wireless resource allocation. We consider wireless networks where L cells share S orthogonal channels and cannot simultaneously use the same channel as their neighbors. Each channel evolves as an unknown restless Markov process with cell-dependent rewards, making this the first work to establish global Gale-Shapley stability for channel allocation in a stochastic, temporally varying restless environment. To address this challenge, we develop SMILE (Stable Multi-matching with Interference-aware LEarning), a communication-efficient distributed learning algorithm that integrates restless bandit learning with graph-constrained coordination. SMILE enables cells to distributedly balance exploration of unknown channels with exploitation of learned information. We prove that SMILE converges to the optimal stable allocation and achieves logarithmic regret relative to a genie with full knowledge of expected utilities. Simulations validate the theoretical guarantees and demonstrate SMILE’s robustness, scalability, and efficiency across diverse spectrum-sharing scenarios.

[281] BumpNet: A Sparse Neural Network Framework for Learning PDE Solutions

Shao-Ting Chiu, Ioannis G. Kevrekidis, Ulisses Braga-Neto

Main category: cs.LG

TL;DR: BumpNet is a sparse neural network framework using sigmoid-based basis functions for PDE solving and operator learning, featuring trainable parameters and dynamic pruning for model efficiency.

Details

Motivation: To create an efficient, meshless neural network framework for PDE numerical solutions and operator learning that leverages modern training techniques while achieving model parsimony and h-adaptivity through dynamic pruning.

Method: BumpNet uses sigmoid activation functions as basis functions with fully trainable parameters (shape, location, amplitude). It incorporates dynamic pruning during training for sparsity. Three variants are proposed: Bump-PINNs for general PDEs, Bump-EDNN for time-evolution PDEs, and Bump-DeepONet for operator learning.

Result: Extensive numerical experiments demonstrate the efficiency and accuracy of the proposed BumpNet architecture across various PDE solving and operator learning tasks.

Conclusion: BumpNet provides an effective sparse neural network framework for PDE problems that combines the benefits of meshless methods with modern neural network training techniques, offering improved efficiency and accuracy through adaptive basis function pruning.

Abstract: We introduce BumpNet, a sparse neural network framework for PDE numerical solution and operator learning. BumpNet is based on meshless basis function expansion, in a similar fashion to radial-basis function (RBF) networks. Unlike RBF networks, the basis functions in BumpNet are constructed from ordinary sigmoid activation functions. This enables the efficient use of modern training techniques optimized for such networks. All parameters of the basis functions, including shape, location, and amplitude, are fully trainable. Model parsimony and h-adaptivity are effectively achieved through dynamically pruning basis functions during training. BumpNet is a general framework that can be combined with existing neural architectures for learning PDE solutions: here, we propose Bump-PINNs (BumpNet with physics-informed neural networks) for solving general PDEs; Bump-EDNN (BumpNet with evolutionary deep neural networks) to solve time-evolution PDEs; and Bump-DeepONet (BumpNet with deep operator networks) for PDE operator learning. Bump-PINNs are trained using the same collocation-based approach used by PINNs, Bump-EDNN uses a BumpNet only in the spatial domain and uses EDNNs to advance the solution in time, while Bump-DeepONets employ a BumpNet regression network as the trunk network of a DeepONet. Extensive numerical experiments demonstrate the efficiency and accuracy of the proposed architecture.

[282] Learning solution operator of dynamical systems with diffusion maps kernel ridge regression

Jiwoo Song, Daning Huang, John Harlim

Main category: cs.LG

TL;DR: DM-KRR combines kernel ridge regression with diffusion maps to achieve superior long-term prediction of complex dynamical systems by respecting intrinsic geometric constraints.

Details

Motivation: Existing data-driven models for complex nonlinear dynamics often fail in long-term prediction because they don't properly capture the geometric structures governing system behavior. Current methods deteriorate when geometric constraints are unknown or poorly represented.

Method: Diffusion Maps Kernel Ridge Regression (DM-KRR) uses a simple kernel ridge regression framework combined with a dynamics-aware validation strategy. It employs a data-driven kernel derived from diffusion maps that implicitly adapts to the intrinsic geometry of the system’s invariant set, without requiring explicit manifold reconstruction or attractor modeling.

Result: DM-KRR consistently outperforms state-of-the-art methods (random feature, neural-network, and operator-learning approaches) across diverse systems including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows. It achieves better accuracy and data efficiency.

Conclusion: Long-term predictive skill depends critically on respecting geometric constraints encoded in data through dynamically consistent model selection. The combination of simplicity, geometry awareness, and strong empirical performance offers a promising path for reliable and efficient learning of complex dynamical systems.

Abstract: Many scientific and engineering systems exhibit complex nonlinear dynamics that are difficult to predict accurately over long time horizons. Although data-driven models have shown promise, their performance often deteriorates when the geometric structures governing long-term behavior are unknown or poorly represented. We demonstrate that a simple kernel ridge regression (KRR) framework, when combined with a dynamics-aware validation strategy, provides a strong baseline for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system’s invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.

[283] Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods

Iason Kyriakopoulos, Yannis Theodoridis

Main category: cs.LG

TL;DR: Systematic comparison of EV charging demand forecasting methods across multiple temporal horizons and spatial aggregation levels using real-world datasets.

Details

Motivation: As electric vehicles grow in popularity for addressing climate change, concerns about their impact on electric grid management have emerged, making EV charging demand prediction a timely and important research problem. There's a lack of systematic comparisons of forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings.

Method: Investigates five time series forecasting models ranging from traditional statistical approaches to machine learning and deep learning methods. Evaluates forecasting performance for short-, mid-, and long-term horizons (minutes, hours, days) across spatial scales from individual charging stations to regional and city-level aggregations. Analysis conducted on four publicly available real-world datasets.

Result: Results reported independently for each dataset. The work represents the first systematic evaluation of EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.

Conclusion: This comprehensive study provides valuable insights into EV charging demand forecasting across different time scales and spatial resolutions, addressing a significant gap in the literature and offering practical guidance for grid management and infrastructure planning.

Abstract: With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.

[284] SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS

Suraj Kumar, Arvind Kumar, Soumi Chattopadhyay

Main category: cs.LG

TL;DR: SHARP-QoS is a unified framework for joint QoS prediction that addresses data sparsity, noise, hierarchical dependencies, and negative transfer through hyperbolic convolution, adaptive feature sharing, and loss balancing.

Details

Motivation: Real-world QoS data is sparse, noisy, and has hierarchical dependencies. Existing methods predict each QoS parameter separately (inefficient) or use joint approaches that suffer from negative transfer due to inconsistent numerical ranges and inadequate representation learning.

Method: Three components: 1) Dual hyperbolic convolution in Poincaré ball to extract hierarchical features from QoS and contextual structures; 2) Adaptive feature-sharing mechanism with gated feature fusion for dynamic feature selection; 3) EMA-based loss balancing strategy for stable joint optimization.

Result: Outperforms both single- and multi-task baselines on three datasets with 2, 3, and 4 QoS parameters. Effectively addresses sparsity, robustness to outliers, and cold-start problems while maintaining moderate computational overhead.

Conclusion: SHARP-QoS provides a reliable solution for joint QoS prediction by addressing key challenges through hierarchical feature extraction, adaptive feature sharing, and balanced optimization, making it suitable for dependable service-oriented computing.

Abstract: Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincaré ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.

[285] Understanding Generalization in Role-Playing Models via Information Theory

Yongqi Li, Hao Lang, Fei Huang, Tieyun Qian, Yongbin Li

Main category: cs.LG

TL;DR: The paper introduces R-EMID, an information-theoretic metric to measure and predict role-playing model generalization degradation under distribution shifts, and proposes a co-evolving RL framework to improve response probability estimation.

Details

Motivation: Role-playing models underperform in real-world deployment due to distribution shifts (user, character, dialogue compositional shifts). Existing methods like LLM-as-a-judge lack fine-grained diagnosis, and there's no formal framework to characterize RPM generalization behaviors.

Method: 1) Introduces R-EMID (reasoning-based effective mutual information difference) to measure RPM performance degradation interpretably. 2) Derives an upper bound on R-EMID to predict worst-case generalization. 3) Proposes co-evolving reinforcement learning framework to adaptively model connections among user, character, and dialogue context for better response probability estimation.

Result: Evaluation shows user shift poses the highest risk among all distribution shifts, and reinforcement learning is the most effective approach for enhancing RPM generalization.

Conclusion: The paper provides a formal information-theoretic framework (R-EMID) to diagnose RPM generalization issues, offers theoretical insights into shift contributions, and demonstrates practical improvements through co-evolving RL for better generalization.

Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

[286] A Theoretical Analysis of State Similarity Between Markov Decision Processes

Zhenyu Tao, Wei Xu, Xiaohu You

Main category: cs.LG

TL;DR: GBSM extends bisimulation metrics to measure state similarity between different MDPs with rigorous mathematical properties, enabling tighter theoretical bounds for cross-MDP applications.

Details

Motivation: Bisimulation metrics are useful for analyzing state similarities within single MDPs but lack well-established mathematical properties for comparing states across different MDPs, limiting theoretical analysis and practical applications in multi-MDP scenarios.

Method: Formally establish a Generalized Bisimulation Metric (GBSM) with rigorous proofs of three fundamental metric properties: symmetry, inter-MDP triangle inequality, and distance bound on identical spaces. Use these properties to analyze policy transfer, state aggregation, and sampling-based estimation across MDPs.

Result: GBSM provides explicit bounds for cross-MDP applications that are strictly tighter than existing bounds from standard BSM. It also offers a closed-form sample complexity for estimation, improving upon existing asymptotic results. Numerical results validate the theoretical findings.

Conclusion: GBSM successfully extends bisimulation metrics to measure state similarity between arbitrary MDPs with rigorous mathematical foundations, enabling improved theoretical analysis and practical applications in multi-MDP scenarios with tighter bounds and better sample complexity.

Abstract: The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

[287] Task Schema and Binding: A Double Dissociation Study of In-Context Learning

Chaeha Kim

Main category: cs.LG

TL;DR: ICL decomposes into two separable mechanisms: Task Schema (abstract task recognition) and Binding (specific associations), with causal validation across multiple architectures.

Details

Motivation: To provide causal mechanistic validation that in-context learning decomposes into separable mechanisms, contrasting with prior monolithic views of ICL as retrieval-based, gradient descent-like, or purely Bayesian approaches.

Method: Activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), examining transfer patterns and correlations between schema reliance and prior knowledge.

Result: 1) Double dissociation showing separable mechanisms (Schema transfers at 100% via late MLP, Binding at 62% via residual stream); 2) Inverse correlation between Schema reliance and prior knowledge; 3) Mechanism operates across all tested architectures including non-Transformer Mamba.

Conclusion: ICL involves dual dissociable mechanisms (Schema and Binding) rather than monolithic processing. Models rely on Task Schema when prior knowledge is absent, with attentional mis-routing (not output competition) explaining performance patterns. This enables more efficient prompt engineering and improved ICL system reliability.

Abstract: We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings:

Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching – proving separable mechanisms
Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p < 0.001, N=28 task-model pairs)
Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable – not merely behavioral modes – we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail – and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering – reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.

[288] MINPO: Memory-Informed Neural Pseudo-Operator to Resolve Nonlocal Spatiotemporal Dynamics

Farinaz Mostajeran, Aruzhan Tleubek, Salah A Faroughi

Main category: cs.LG

TL;DR: MINPO is a unified neural framework for solving integro-differential equations that learns nonlocal operators and their inverses directly through neural representations, outperforming classical methods and existing neural solvers.

Details

Motivation: Traditional methods for solving integro-differential equations (IDEs) are computationally expensive due to repeated convolution integral evaluations, and existing neural solvers lack generalization across diverse nonlocal structures.

Method: MINPO uses either KANs or MLPs as encoders to learn nonlocal operators and their inverses through neural representations, with a lightweight nonlocal consistency loss to enforce coherence between learned operators and reconstructed solutions.

Result: MINPO demonstrates superior accuracy and robustness compared to classical techniques and state-of-the-art neural methods (A-PINN, fPINN, A-PIKAN, fPIKAN) across diverse kernel types, dimensionalities, and computational demands.

Conclusion: MINPO provides a unified framework for systems governed by nonlocal operators, generalizing beyond problem-specific formulations and efficiently handling nonlocal spatiotemporal dependencies in IDEs and fractional PDEs.

Abstract: Many physical systems exhibit nonlocal spatiotemporal behaviors described by integro-differential equations (IDEs). Classical methods for solving IDEs require repeatedly evaluating convolution integrals, whose cost increases quickly with kernel complexity and dimensionality. Existing neural solvers can accelerate selected instances of these computations, yet they do not generalize across diverse nonlocal structures. In this work, we introduce the Memory-Informed Neural Pseudo-Operator (MINPO), a unified framework for modeling nonlocal dynamics arising from long-range spatial interactions and/or long-term temporal memory. MINPO, employing either Kolmogorov-Arnold Networks (KANs) or multilayer perceptron networks (MLPs) as encoders, learns the nonlocal operator and its inverse directly through neural representations, and then explicitly reconstruct the unknown solution fields. The learning is guarded by a lightweight nonlocal consistency loss term to enforce coherence between the learned operator and reconstructed solution. The MINPO formulation allows to naturally capture and efficiently resolve nonlocal spatiotemporal dependencies governed by a wide spectrum of IDEs and their subsets, including fractional PDEs. We evaluate the efficacy of MINPO in comparison with classical techniques and state-of-the-art neural-based strategies based on MLPs, such as A-PINN and fPINN, along with their newly-developed KAN variants, A-PIKAN and fPIKAN, designed to facilitate a fair comparison. Our study offers compelling evidence of the accuracy of MINPO and demonstrates its robustness in handling (i) diverse kernel types, (ii) different kernel dimensionalities, and (iii) the substantial computational demands arising from repeated evaluations of kernel integrals. MINPO, thus, generalizes beyond problem-specific formulations, providing a unified framework for systems governed by nonlocal operators.

[289] AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li, Yuhao Wu, Hongliang Liu

Main category: cs.LG

TL;DR: Reward models and LLM-as-a-Judge systems are vulnerable to short sequences of low-perplexity control tokens that can flip binary evaluations from correct “No” to incorrect “Yes” judgments by manipulating the last-layer logit gap.

Details

Motivation: The paper addresses a critical vulnerability in modern post-training pipelines (RLHF, DPO, RLAIF) where reward models and LLM-as-a-Judge systems are central. These systems provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning, but they exhibit recurring vulnerabilities that represent realistic reward-hacking risks rather than worst-case adversarial scenarios.

Method: The authors introduce AdvJudge-Zero, which uses the model’s next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch. The method analyzes how induced hidden-state perturbations concentrate in a low-rank “soft mode” that is anti-aligned with the judge’s refusal direction.

Result: Empirical results show that these control tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. The control tokens successfully flip many binary evaluations from correct “No” judgments to incorrect “Yes” judgments.

Conclusion: The paper demonstrates that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality, offering a practical defense against this vulnerability in judge systems.

Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes’’ judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model’s next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode’’ that is anti-aligned with the judge’s refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

[290] Alzheimer’s Disease Brain Network Mining

Alireza Moayedikia, Sara Fin

Main category: cs.LG

TL;DR: MATCH-AD is a semi-supervised framework that combines deep representation learning, graph-based label propagation, and optimal transport theory to diagnose Alzheimer’s disease with near-perfect accuracy using only partial labels.

Details

Motivation: Clinical assessments for Alzheimer's disease are expensive and invasive, resulting in ground truth labels being available for only a fraction of neuroimaging datasets, creating a fundamental challenge for machine learning approaches to AD diagnosis.

Method: MATCH-AD integrates deep representation learning, graph-based label propagation, and optimal transport theory (specifically Wasserstein distances) to propagate diagnostic information from limited labeled samples to larger unlabeled populations while quantifying disease progression between cognitive states.

Result: Evaluated on nearly 5,000 subjects from NACC with structural MRI, CSF biomarkers, and clinical variables, MATCH-AD achieves near-perfect diagnostic accuracy despite having ground truth labels for less than one-third of subjects, substantially outperforming all baselines with kappa indicating almost perfect agreement.

Conclusion: Principled semi-supervised learning can unlock the diagnostic potential of partially annotated neuroimaging data worldwide, substantially reducing annotation burden while maintaining clinical-grade accuracy suitable for deployment.

Abstract: Machine learning approaches for Alzheimer’s disease (AD) diagnosis face a fundamental challenges. Clinical assessments are expensive and invasive, leaving ground truth labels available for only a fraction of neuroimaging datasets. We introduce Multi view Adaptive Transport Clustering for Heterogeneous Alzheimer’s Disease (MATCH-AD), a semi supervised framework that integrates deep representation learning, graph-based label propagation, and optimal transport theory to address this limitation. The framework leverages manifold structure in neuroimaging data to propagate diagnostic information from limited labeled samples to larger unlabeled populations, while using Wasserstein distances to quantify disease progression between cognitive states. Evaluated on nearly five thousand subjects from the National Alzheimer’s Coordinating Center, encompassing structural MRI measurements from hundreds of brain regions, cerebrospinal fluid biomarkers, and clinical variables MATCHAD achieves near-perfect diagnostic accuracy despite ground truth labels for less than one-third of subjects. The framework substantially outperforms all baseline methods, achieving kappa indicating almost perfect agreement compared to weak agreement for the best baseline, a qualitative transformation in diagnostic reliability. Performance remains clinically useful even under severe label scarcity, and we provide theoretical convergence guarantees with proven bounds on label propagation error and transport stability. These results demonstrate that principled semi-supervised learning can unlock the diagnostic potential of the vast repositories of partially annotated neuroimaging data accumulating worldwide, substantially reducing annotation burden while maintaining accuracy suitable for clinical deployment.

[291] M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge

Abdullah M. Zyarah, Dhireesha Kudithipudi

Main category: cs.LG

TL;DR: M2RU is a mixed-signal architecture implementing minion recurrent unit for efficient temporal processing with on-chip continual learning, achieving 312 GOPS/W and maintaining accuracy within 5% of software baselines.

Details

Motivation: Continual learning on edge platforms is challenging due to energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. Current approaches are not efficient enough for real-time adaptation in edge-level temporal intelligence applications.

Method: Introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit with weighted-bit streaming (enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion) and an experience replay mechanism that stabilizes learning under domain shifts.

Result: Achieves 15 GOPS at 48.62 mW (312 GOPS per watt), maintains accuracy within 5% of software baselines on sequential MNIST and CIFAR-10 tasks. Provides 29X improvement in energy efficiency compared to CMOS digital design. Device-aware analysis shows expected operational lifetime of 12.2 years under continual learning workloads.

Conclusion: M2RU establishes a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence, addressing the challenges of continual learning on edge platforms through mixed-signal architecture innovations.

Abstract: Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

[292] Explanation Beyond Intuition: A Testable Criterion for Inherent Explainability

Michael Merry, Pat Riddle, Jim Warren

Main category: cs.LG

TL;DR: The paper proposes a formal criterion for inherent explainability using graph theory to decompose models into structure-local explanations and recompose them into global explanations, differentiating between “explainable” models and “explained” ones with verified explanations.

Details

Motivation: There's no consistent definition or test for inherent explainability in XAI - current approaches either use metrics or appeal to intuition. The paper aims to establish a rigorous, globally applicable criterion for inherent explainability.

Method: Uses graph theory to represent and decompose models for structure-local explanations, forming them as annotations (verifiable hypothesis-evidence structures). Differentiates between explainable models (allow explanation) and explained models (have verified explanations).

Result: Provides a criterion that matches existing intuitions on inherent explainability, explains why large regression models may not be explainable while sparse neural networks could be, and demonstrates with PREDICT - a Cox proportional hazards model in clinical use.

Conclusion: PREDICT is inherently explainable under the proposed criterion. The work provides structure to formalize explainability research and offers regulators a flexible but rigorous test for compliance frameworks.

Abstract: Inherent explainability is the gold standard in Explainable Artificial Intelligence (XAI). However, there is not a consistent definition or test to demonstrate inherent explainability. Work to date either characterises explainability through metrics, or appeals to intuition - “we know it when we see it”. We propose a globally applicable criterion for inherent explainability. The criterion uses graph theory for representing and decomposing models for structure-local explanation, and recomposing them into global explanations. We form the structure-local explanations as annotations, a verifiable hypothesis-evidence structure that allows for a range of explanatory methods to be used. This criterion matches existing intuitions on inherent explainability, and provides justifications why a large regression model may not be explainable but a sparse neural network could be. We differentiate explainable – a model that allows for explanation – and \textit{explained} – one that has a verified explanation. Finally, we provide a full explanation of PREDICT – a Cox proportional hazards model of cardiovascular disease risk, which is in active clinical use in New Zealand. It follows that PREDICT is inherently explainable. This work provides structure to formalise other work on explainability, and allows regulators a flexible but rigorous test that can be used in compliance frameworks.

[293] Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs

Ivan Kralj, Lodovico Giaretta, Gordan Ježić, Ivana Podnar Žarko, Šarūnas Girdzijauskas

Main category: cs.LG

TL;DR: Adaptive pruning algorithm for ST-GNNs reduces communication overhead in edge-based smart mobility systems while maintaining prediction accuracy, with novel SEPA metric for measuring responsiveness to traffic events.

Details

Motivation: ST-GNNs deployed at edge across distributed cloudlets create substantial communication overhead due to repeated transmission of overlapping node features between neighboring cloudlets, which needs to be reduced while maintaining prediction quality.

Method: Propose adaptive pruning algorithm that dynamically filters redundant neighbor features while preserving informative spatial context, adjusting pruning rates based on recent model performance. Also introduce SEPA (Sudden Event Prediction Accuracy) metric to measure responsiveness to traffic slowdowns and recoveries.

Result: Adaptive pruning maintains prediction accuracy while significantly lowering communication costs in all online semi-decentralized settings (traditional FL, server-free FL, Gossip Learning). SEPA exposes true value of spatial connectivity in predicting dynamic traffic events, unlike standard metrics.

Conclusion: Communication can be reduced without compromising responsiveness to critical traffic events. Adaptive pruning effectively balances communication efficiency and prediction quality in edge-based ST-GNN deployments for smart mobility systems.

Abstract: Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.

[294] Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan

Main category: cs.LG

TL;DR: LLM-SGA framework enhances adversarial robustness for harmful content detection using LLM-based sample generation and ensemble methods with dynamic weighting.

Details

Motivation: Social media platforms need robust ML detectors for harmful content, but current models are vulnerable to adversarial attacks that evade detection through subtle text modifications.

Method: Proposes LLM-SGA framework leveraging attack invariances, then instantiates ARHOCD detector with ensemble methods, dynamic Bayesian weight assignment, and iterative adversarial training.

Result: ARHOCD demonstrates strong generalizability and improved detection accuracy across hate speech, rumor, and extremist content datasets under adversarial conditions.

Conclusion: The proposed framework successfully addresses adversarial robustness challenges by combining LLM-based sample generation with ensemble techniques and dynamic weighting, offering effective defense against diverse attacks.

Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample’s predictability and each base detector’s capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Yonathan Bornfeld, Shai Avidan

Main category: cs.LG

TL;DR: The paper proposes a novel activation module for Private Inference that reduces computational bottlenecks by having one DReLU operation serve multiple ReLU operations through prototype and replicate channels, achieving state-of-the-art results.

Details

Motivation: Private Inference (PI) faces major computational bottlenecks due to ReLU calculations. Current efforts focus on reducing ReLU counts, but the paper identifies DReLU (the non-linear step function of ReLU) as a key bottleneck and proposes a more efficient approach.

Method: A new activation module where DReLU operations are only performed on a subset of channels (prototype channels), while replicate channels copy DReLU results from corresponding prototype neurons. This approach is extended to work across different layers.

Result: Drastically reduces DReLU operations in ResNet-type networks, solves an extended XOR problem with just one non-linearity and two neurons (which traditional methods cannot achieve), and achieves new SOTA results on classification tasks and image segmentation.

Conclusion: The proposed activation module significantly improves Private Inference efficiency by reducing DReLU operations while maintaining or improving performance, enabling more practical PI applications with state-of-the-art results across multiple domains.

Abstract: Private Inference (PI) uses cryptographic primitives to perform privacy preserving machine learning. In this setting, the owner of the network runs inference on the data of the client without learning anything about the data and without revealing any information about the model. It has been observed that a major computational bottleneck of PI is the calculation of the gate (i.e., ReLU), so a considerable amount of effort have been devoted to reducing the number of ReLUs in a given network. We focus on the DReLU, which is the non-linear step function of the ReLU and show that one DReLU can serve many ReLU operations. We suggest a new activation module where the DReLU operation is only performed on a subset of the channels (Prototype channels), while the rest of the channels (replicate channels) replicates the DReLU of each of their neurons from the corresponding neurons in one of the prototype channels. We then extend this idea to work across different layers. We show that this formulation can drastically reduce the number of DReLU operations in resnet type network. Furthermore, our theoretical analysis shows that this new formulation can solve an extended version of the XOR problem, using just one non-linearity and two neurons, something that traditional formulations and some PI specific methods cannot achieve. We achieve new SOTA results on several classification setups, and achieve SOTA results on image segmentation.

[296] meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis

Dishantkumar Sutariya, Eike Petersen

Main category: cs.LG

TL;DR: A statistical toolbox for rigorous subgroup performance analysis of ML models in medical imaging, addressing metric selection, uncertainty estimation, multiple comparisons, and intersectional subgroup discovery.

Details

Motivation: Stratified analysis of ML model performance by patient/recording properties is crucial for identifying failure modes, but performing such analyses with statistical rigor is challenging due to issues with metric selection, uncertainty estimation, multiple comparisons, and intersectional subgroup discovery.

Method: Developed a statistical toolbox that provides appropriate performance metrics for different group sizes and base rates, determines metric uncertainty, corrects for multiple comparisons, and implements mechanisms to find the most ‘interesting’ subgroups in intersectional analyses among many combinations.

Result: The toolbox enables practitioners to rigorously assess models for potential subgroup performance disparities. Demonstrated applicability through two medical imaging case studies: skin lesion malignancy classification (ISIC2020 dataset) and chest X-ray-based disease classification (MIMIC-CXR dataset).

Conclusion: The presented statistical toolbox addresses key challenges in rigorous subgroup performance analysis for ML models, particularly in medical imaging applications, making it easier for practitioners to identify and address performance disparities across different patient subgroups.

Abstract: Analyzing machine learning model performance stratified by patient and recording properties is becoming the accepted norm and often yields crucial insights about important model failure modes. Performing such analyses in a statistically rigorous manner is non-trivial, however. Appropriate performance metrics must be selected that allow for valid comparisons between groups of different sample sizes and base rates; metric uncertainty must be determined and multiple comparisons be corrected for, in order to assess whether any observed differences may be purely due to chance; and in the case of intersectional analyses, mechanisms must be implemented to find the most `interesting’ subgroups within combinatorially many subgroup combinations. We here present a statistical toolbox that addresses these challenges and enables practitioners to easily yet rigorously assess their models for potential subgroup performance disparities. While broadly applicable, the toolbox is specifically designed for medical imaging applications. The analyses provided by the toolbox are illustrated in two case studies, one in skin lesion malignancy classification on the ISIC2020 dataset and one in chest X-ray-based disease classification on the MIMIC-CXR dataset.

[297] Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

Javier Gonzalez-Ruiz, Carlos Rodriguez-Pardo, Iacopo Savelli, Alice Di Bella, Massimo Tavoni

Main category: cs.LG

TL;DR: A multi-agent reinforcement learning model for analyzing long-term electricity markets and decarbonization pathways, applied to the Italian electricity system.

Details

Motivation: Electricity systems are crucial for carbon-free economies, but policymakers need better tools to design, test, and evaluate long-term market mechanisms (auctions, support schemes, policies) that shape electricity generation mix during decarbonization.

Method: Multi-agent reinforcement learning model where profit-maximizing generation companies make investment decisions in wholesale electricity markets. Uses independent proximal policy optimization (PPO) selected for decentralized competitive environments, with extensive hyperparameter search to ensure competitive behavior outcomes.

Result: Applied to a stylized Italian electricity system under varying competition levels, market designs, and policy scenarios. Results show critical role of market design for decarbonization and avoiding price volatility.

Conclusion: The framework enables assessment of long-term electricity markets where multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

Abstract: Electricity systems are key to transforming today’s society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

[298] Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

Yen-Chieh Huang, Rui Fang, Ming-Syan Chen, Pi-Cheng Hsiu

Main category: cs.LG

TL;DR: Write-Gated KV: A lightweight mechanism that learns to predict token utility before writing to KV cache, reducing memory usage by 46-57% and achieving 3.03-3.45× prefill and 1.89-2.56× decode speedups with negligible accuracy loss.

Details

Motivation: Long-context LLM inference suffers from quadratic attention complexity and linear KV cache growth. Prior approaches use post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory.

Method: Formalize KV cache management as a causal system with three primitives: KV Admission, Selection, and Eviction. Instantiate KV Admission via Write-Gated KV, which learns to predict token utility before it enters the cache. Filters out low-utility states early to maintain a compact global cache alongside a sliding local cache.

Result: Reduces memory usage by 46-57%, delivers 3.03-3.45× prefill speedups and 1.89-2.56× decode speedups on Llama models with negligible accuracy loss. Compatible with FlashAttention and paged-KV systems.

Conclusion: Learning what to write to the KV cache is a principled and practical recipe for efficient long-context inference, addressing the root inefficiency of indiscriminate memory writing rather than relying on post-hoc management.

Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

[299] A lightweight Spatial-Temporal Graph Neural Network for Long-term Time Series Forecasting

Henok Tenaw Moges, Deshendran Moodley

Main category: cs.LG

TL;DR: Lite-STGNN is a lightweight spatial-temporal GNN for long-term multivariate forecasting that combines decomposition-based temporal modeling with learnable sparse graph structure, achieving SOTA accuracy with high efficiency.

Details

Motivation: The paper aims to develop an efficient and interpretable framework for long-term multivariate time series forecasting that addresses the computational complexity of transformer-based methods while maintaining high accuracy.

Method: The method integrates trend-seasonal decomposition for temporal modeling with a spatial module using low-rank Top-K adjacency learning and conservative horizon-wise gating for message passing, enhancing a strong linear baseline with spatial corrections.

Result: Achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, with 4.6% improvement from spatial module, 3.3% enhancement from Top-K locality, while being parameter-efficient and substantially faster than transformer methods.

Conclusion: Lite-STGNN offers a compact, interpretable, and efficient framework for long-term multivariate forecasting, with learned adjacency matrices revealing domain-specific interaction dynamics.

Abstract: We propose Lite-STGNN, a lightweight spatial-temporal graph neural network for long-term multivariate forecasting that integrates decomposition-based temporal modeling with learnable sparse graph structure. The temporal module applies trend-seasonal decomposition, while the spatial module performs message passing with low-rank Top-$K$ adjacency learning and conservative horizon-wise gating, enabling spatial corrections that enhance a strong linear baseline. Lite-STGNN achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, while being parameter-efficient and substantially faster to train than transformer-based methods. Ablation studies show that the spatial module yields 4.6% improvement over the temporal baseline, Top-$K$ enhances locality by 3.3%, and learned adjacency matrices reveal domain-specific interaction dynamics. Lite-STGNN thus offers a compact, interpretable, and efficient framework for long-term multivariate time series forecasting.

[300] Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study

Shubham Das, Kaushal Singhania, Amit Sadhu, Suprabhat Das, Arghya Nandi

Main category: cs.LG

TL;DR: Deep learning surrogate models (BiLSTM-VAE and BiLSTM-Transformer) replace computationally expensive ANSYS creep simulations for Inconel 625, providing fast and accurate predictions with substantial speedup.

Details

Motivation: Finite-element creep simulations in ANSYS for high-temperature alloys like Inconel 625 are computationally expensive (30-40 minutes per simulation), limiting rapid assessment for design optimization and structural health monitoring.

Method: Generated creep strain data in ANSYS using Norton law under uniaxial stresses (50-150 MPa) and temperatures (700-1000°C). Trained two deep learning architectures: BiLSTM Variational Autoencoder for uncertainty-aware generative predictions, and BiLSTM Transformer hybrid using self-attention for long-range temporal behavior.

Result: Both models achieved strong performance (evaluated with RMSE, MAE, R²). BiLSTM-VAE provides stable probabilistic forecasts, BiLSTM-Transformer delivers high deterministic accuracy. Surrogate models produce predictions within seconds vs. 30-40 minutes for ANSYS simulations.

Conclusion: The deep learning surrogate framework enables rapid creep assessment for design optimization and structural health monitoring, providing a scalable solution for high-temperature alloy applications with substantial computational speedup.

Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.

[301] SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals

Muhammad Haris Khan

Main category: cs.LG

TL;DR: SafeBench-Seq is a reproducible benchmark for protein hazard screening using public data with homology-controlled evaluation to prevent overestimation of model robustness.

Details

Motivation: Foundation models for protein design pose biosecurity risks, but current methods lack simple, reproducible baselines for sequence-level hazard screening that work on commodity CPUs and properly evaluate under homology control.

Method: Built from public data (SafeProtein hazards and UniProt benigns) using interpretable features (physicochemical descriptors and amino-acid composition). Dataset homology-clustered at ≤40% identity with cluster-level holdouts to simulate “never-before-seen” threats. Evaluated with discrimination metrics, screening operating points, and calibration quality measures.

Result: Random splits substantially overestimate robustness compared to homology-clustered evaluation. Calibrated linear models show good calibration, while tree ensembles have slightly higher Brier/ECE scores. The benchmark is CPU-only and releases only metadata (no hazardous sequences).

Conclusion: SafeBench-Seq provides a reproducible, homology-controlled baseline for protein hazard screening that enables rigorous evaluation without distributing hazardous sequences, addressing critical biosecurity needs in protein design.

Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate “never-before-seen” threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition-preserving residue shuffles and length-/composition-only ablations. Empirically, random splits substantially overestimate robustness relative to homology-clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.

[302] NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks

Salar Beigzad

Main category: cs.LG

TL;DR: CFF extends Forward-Forward algorithm with inter-layer collaboration to overcome isolation issues, improving performance while maintaining memory efficiency and biological plausibility.

Details

Motivation: Conventional Forward-Forward implementations suffer from critical inter-layer isolation where layers optimize independently without leveraging collective learning dynamics, constraining representational coordination and limiting convergence efficiency in deeper architectures.

Method: Introduces Collaborative Forward-Forward (CFF) learning with two paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters. Uses collaborative goodness function incorporating weighted contributions from all layers.

Result: Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations.

Conclusion: Inter-layer collaboration is established as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.

Abstract: The Forward-Forward algorithm eliminates backpropagation’s memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.

[303] Bayesian Optimisation: Which Constraints Matter?

Xietao Wang Lin, Juan Ungredda, Max Butler, James Town, Alma Rahat, Hemant Singh, Juergen Branke

Main category: cs.LG

TL;DR: New Bayesian optimization variants using Knowledge Gradient for problems with decoupled black-box constraints, focusing on evaluating only relevant constraints to improve efficiency.

Details

Motivation: Bayesian optimization is effective for expensive global black-box optimization, but existing methods don't efficiently handle decoupled constraints where only a few constraints are binding at the optimum, leading to unnecessary evaluations.

Method: Propose new Bayesian optimization variants based on Knowledge Gradient acquisition functions that specifically address decoupled black-box constraints, enabling selective evaluation of only relevant constraints rather than all constraints.

Result: Empirical benchmarking shows the proposed methods outperform state-of-the-art approaches, demonstrating superiority in handling constrained optimization problems with decoupled constraints.

Conclusion: The proposed Knowledge Gradient variants for decoupled constraints provide an effective approach for expensive black-box optimization, significantly improving efficiency by focusing evaluations on only the constraints that matter for finding the optimum.

Abstract: Bayesian optimisation has proven to be a powerful tool for expensive global black-box optimisation problems. In this paper, we propose new Bayesian optimisation variants of the popular Knowledge Gradient acquisition functions for problems with \emph{decoupled} black-box constraints, in which subsets of the objective and constraint functions may be evaluated independently. In particular, our methods aim to take into account that often only a handful of the constraints may be binding at the optimum, and hence we should evaluate only relevant constraints when trying to optimise a function. We empirically benchmark these methods against existing methods and demonstrate their superiority over the state-of-the-art.

[304] GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

Yikang Yue, Yishu Yin, Xuehai Qian

Main category: cs.LG

TL;DR: GreedySnake is an SSD-offloaded training system that uses vertical scheduling of micro-batches to achieve higher throughput for large language model training compared to existing horizontal scheduling approaches.

Details

Motivation: SSD-offloaded training makes LLM training more cost-effective, but existing systems using horizontal scheduling have limitations. The paper aims to improve training throughput with smaller batch sizes by optimizing the scheduling approach.

Method: GreedySnake employs vertical scheduling that executes all micro-batches of a layer before moving to the next layer, plus overlapping optimization steps with forward passes of next iterations to mitigate I/O bottlenecks.

Result: GreedySnake achieves 1.96x throughput improvement on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B compared to ZeRO-Infinity, approaching ideal roofline model performance.

Conclusion: GreedySnake demonstrates that vertical scheduling with SSD-offloaded training significantly improves training throughput, making large-scale LLM training more practical and cost-effective.

Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

[305] Machine Learning for Static and Single-Event Dynamic Complex Network Analysis

Nikolaos Nakis

Main category: cs.LG

TL;DR: This thesis develops novel Latent Space Model approaches for unified graph representation learning of static and dynamic networks, focusing on structural-aware embeddings that capture network characteristics without multi-stage processing.

Details

Motivation: To create comprehensive and powerful unified network embeddings that can characterize network structures and handle diverse graph analysis tasks, eliminating the need for heuristics and multi-stage post-processing steps.

Method: Focuses on Latent Space Models, particularly the Latent Distance Model, to develop algorithmic approaches for Graph Representation Learning of static and single-event dynamic networks, creating structural-aware representations.

Result: The methods produce hierarchical expressions of network structure, community characterization, identification of extreme profiles in networks, and impact dynamics quantification in temporal networks through unified learning processes.

Conclusion: The thesis aims to advance towards unified network embeddings that are both comprehensive and powerful, capable of capturing important network characteristics while eliminating multi-stage processing requirements.

Abstract: The primary objective of this thesis is to develop novel algorithmic approaches for Graph Representation Learning of static and single-event dynamic networks. In such a direction, we focus on the family of Latent Space Models, and more specifically on the Latent Distance Model which naturally conveys important network characteristics such as homophily, transitivity, and the balance theory. Furthermore, this thesis aims to create structural-aware network representations, which lead to hierarchical expressions of network structure, community characterization, the identification of extreme profiles in networks, and impact dynamics quantification in temporal networks. Crucially, the methods presented are designed to define unified learning processes, eliminating the need for heuristics and multi-stage processes like post-processing steps. Our aim is to delve into a journey towards unified network embeddings that are both comprehensive and powerful, capable of characterizing network structures and adeptly handling the diverse tasks that graph analysis offers.

[306] Learning Safe Autonomous Driving Policies Using Predictive Safety Representations

Mahesh Keswani, Raunak Bhattacharyya

Main category: cs.LG

TL;DR: SRPL framework with predictive safety representations improves reward-safety tradeoff in real-world autonomous driving, showing significant success rate and cost improvements, better robustness to noise, and improved cross-dataset generalization.

Details

Motivation: SafeRL for autonomous driving faces tension between safety requirements and driving efficiency - overly conservative policies limit efficiency while aggressive exploration risks safety violations. Need to test if SRPL framework works in real-world scenarios beyond controlled environments.

Method: SRPL (Safety Representations for Safer Policy Learning) framework equips agents with predictive model of future constraint violations. Systematic experiments on Waymo Open Motion Dataset (WOMD) and NuPlan datasets, evaluating reward-safety tradeoff, success rates, costs, robustness to observation noise, and zero-shot cross-dataset generalization.

Result: SRPL improves reward-safety tradeoff with statistically significant improvements: success rate (effect sizes r = 0.65-0.86), cost reduction (effect sizes r = 0.70-0.83), p < 0.05. Effectiveness depends on policy optimizer and dataset distribution. Predictive safety representations improve robustness to observation noise. SRPL-augmented agents show improved generalization in zero-shot cross-dataset evaluation.

Conclusion: Predictive safety representations in SRPL framework demonstrate potential to strengthen SafeRL for autonomous driving in real-world scenarios, though effectiveness depends on underlying policy optimizer and dataset characteristics.

Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.

Arthur Guijt, Dirk Thierens, Ellen Kerkhof, Jan Wiersma, Tanja Alderliesten, Peter A. N. Bosman

Main category: cs.LG

TL;DR: Asynchronous model collaboration via stitching layers enables competitive performance while maintaining data privacy, addressing federated learning limitations in fragmented data settings.

Details

Motivation: In domains like healthcare where data is fragmented across parties and cannot be shared, federated learning requires synchronous training which is impractical. The paper investigates asynchronous collaboration where only trained models are shared (e.g., via publications) to address privacy and practical constraints.

Method: Proposes using stitching layers to combine intermediate representations of individually trained models. This approach allows asynchronous collaboration where parties train models independently on their own data, then share only the trained models rather than raw data or synchronized training updates.

Result: Individually trained models perform well on their own data but poorly on others’ data. Ensembles improve generalization but hurt performance on each party’s own data. Stitching layers recover competitive performance on each party’s data while maintaining improved generalization, enabling effective asynchronous collaboration.

Conclusion: Asynchronous collaboration through model stitching provides a practical alternative to federated learning, achieving competitive results while respecting data privacy and avoiding synchronous training requirements, making it suitable for domains with fragmented data like healthcare.

Abstract: Deep learning has been shown to be very capable at performing many real-world tasks. However, this performance is often dependent on the presence of large and varied datasets. In some settings, like in the medical domain, data is often fragmented across parties, and cannot be readily shared. While federated learning addresses this situation, it is a solution that requires synchronicity of parties training a single model together, exchanging information about model weights. We investigate how asynchronous collaboration, where only already trained models are shared (e.g. as part of a publication), affects performance, and propose to use stitching as a method for combining models. Through taking a multi-objective perspective, where performance on each parties’ data is viewed independently, we find that training solely on a single parties’ data results in similar performance when merging with another parties’ data, when considering performance on that single parties’ data, while performance on other parties’ data is notably worse. Moreover, while an ensemble of such individually trained networks generalizes better, performance on each parties’ own dataset suffers. We find that combining intermediate representations in individually trained models with a well placed pair of stitching layers allows this performance to recover to a competitive degree while maintaining improved generalization, showing that asynchronous collaboration can yield competitive results.

[308] A Unified Representation of Neural Networks Architectures

Christophe Prieur, Mircea Lazar, Bogdan Robu

Main category: cs.LG

TL;DR: The paper introduces a unified framework called Distributed Parameter neural Network (DiPaNet) that connects finite and infinite-dimensional neural network architectures through homogenization/discretization techniques.

Details

Motivation: To establish a unified mathematical framework that connects various neural network architectures (finite and infinite-dimensional) by considering the limiting case where both the number of neurons per layer and number of hidden layers tend to infinity, forming a continuum.

Method: 1) Derive integral infinite width neural representations for single-hidden-layer networks, 2) Extend to deep residual CNNs with finite integral hidden layers, 3) Formalize approximation errors between neural ODEs and deep residual NNs via discretization, 4) Merge approaches into unified DiPaNet representation as a homogeneous framework connecting various architectures.

Result: Developed a deterministic DiPaNet framework that applies to general, uniformly continuous matrix weight functions, showing that most existing finite and infinite-dimensional NN architectures are related through homogenization/discretization with the DiPaNet representation.

Conclusion: The DiPaNet provides a unified mathematical framework connecting various neural network architectures, with potential for further generalizations and applications, while distinguishing from neural fields approaches.

Abstract: In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogeneization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Differences and similarities with neural fields are discussed along with further possible generalizations and applications of the DiPaNet framework.

[309] A Systems-Theoretic View on the Convergence of Algorithms under Disturbances

Guner Dilsad Er, Sebastian Trimpe, Michael Muehlebach

Main category: cs.LG

TL;DR: The paper extends convergence guarantees for algorithms operating in isolation to handle disturbances, noise, and system interconnections, providing stability bounds and convergence rates using converse Lyapunov theorems.

Details

Motivation: Algorithms increasingly operate in complex physical, social, and engineering systems where they face disturbances, noise, and interconnections with other dynamical systems, requiring analysis of their performance under such non-ideal conditions.

Method: The authors leverage converse Lyapunov theorems to derive key inequalities that quantify the impact of disturbances, extending known convergence guarantees from isolated algorithm operation to scenarios with disturbances and interconnections.

Result: The paper provides systematic derivation of stability bounds and convergence rates for algorithms operating in the presence of disturbances, demonstrating applicability to distributed learning with communication constraints, machine learning generalization sensitivity, and privacy-preserving noise injection.

Conclusion: The developed framework serves as a unifying tool for algorithm analysis in noisy, disturbed environments with system interconnections, bridging the gap between isolated algorithm analysis and real-world operational conditions.

Abstract: Algorithms increasingly operate within complex physical, social, and engineering systems where they are exposed to disturbances, noise, and interconnections with other dynamical systems. This article extends known convergence guarantees of an algorithm operating in isolation (i.e., without disturbances) and systematically derives stability bounds and convergence rates in the presence of such disturbances. By leveraging converse Lyapunov theorems, we derive key inequalities that quantify the impact of disturbances. We further demonstrate how our result can be utilized to assess the effects of disturbances on algorithmic performance in a wide variety of applications, including communication constraints in distributed learning, sensitivity in machine learning generalization, and intentional noise injection for privacy. This underpins the role of our result as a unifying tool for algorithm analysis in the presence of noise, disturbances, and interconnections with other dynamical systems.

[310] More Consistent Accuracy PINN via Alternating Easy-Hard Training

Zhaoqian Gao, Min Yanga

Main category: cs.LG

TL;DR: Hybrid training strategy combining hard and easy prioritization improves PINN performance across diverse PDE types with steep gradients, nonlinearity, and high dimensionality.

Details

Motivation: Current PINN training strategies (hard and easy prioritization) show trade-offs and inconsistent performance across different PDE types, limiting their reliability and effectiveness.

Method: Developed a hybrid strategy combining hard and easy prioritization through an alternating training algorithm to leverage strengths of both approaches.

Result: Achieves consistently high accuracy with relative L2 errors mostly in O(10^-5) to O(10^-6) range, significantly surpassing baseline methods, especially on challenging PDEs with steep gradients, nonlinearity, and high dimensionality.

Conclusion: Hybrid training strategies enhance PINN performance and robustness, providing more reliable solutions across diverse PDE problems compared to single-strategy approaches.

Abstract: Physics-informed neural networks (PINNs) have recently emerged as a prominent paradigm for solving partial differential equations (PDEs), yet their training strategies remain underexplored. While hard prioritization methods inspired by finite element methods are widely adopted, recent research suggests that easy prioritization can also be effective. Nevertheless, we find that both approaches exhibit notable trade-offs and inconsistent performance across PDE types. To address this issue, we develop a hybrid strategy that combines the strengths of hard and easy prioritization through an alternating training algorithm. On PDEs with steep gradients, nonlinearity, and high dimensionality, the proposed method achieves consistently high accuracy, with relative L2 errors mostly in the range of O(10^-5) to O(10^-6), significantly surpassing baseline methods. Moreover, it offers greater reliability across diverse problems, whereas compared approaches often suffer from variable accuracy depending on the PDE. This work provides new insights into designing hybrid training strategies to enhance the performance and robustness of PINNs.

[311] SCOPE: Sequential Causal Optimization of Process Interventions

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

Main category: cs.LG

TL;DR: SCOPE is a Prescriptive Process Monitoring approach that learns aligned sequential intervention recommendations using backward induction and causal learners, outperforming existing methods that treat interventions independently or require process approximations.

Details

Motivation: Existing PresPM approaches fail to handle realistic scenarios where organizations need to align sequences of interventions to jointly steer case outcomes. Current methods either focus on single interventions, treat multiple interventions independently ignoring temporal interactions, or rely on simulation/data augmentation that creates reality gaps and bias.

Method: SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. It leverages causal learners to utilize observational data directly without requiring process approximations for reinforcement learning.

Result: Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing key performance indicators (KPIs).

Conclusion: SCOPE effectively addresses the sequential intervention alignment problem in PresPM by using backward induction and causal learning, providing a practical solution that works directly with observational data. The authors also contribute a reusable benchmark for future sequential PresPM research.

Abstract: Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

[312] Trust-Region Adaptive Policy Optimization

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang

Main category: cs.LG

TL;DR: TRAPO is a hybrid training framework that interleaves SFT and RL within each instance, using trust-region SFT and adaptive prefix selection to overcome limitations of the standard two-stage pipeline.

Details

Motivation: The standard two-stage pipeline (SFT then RL) suffers from inconsistency: SFT's rigid imitation suppresses exploration and causes forgetting, limiting RL's potential improvements for LLM reasoning abilities.

Method: TRAPO interleaves SFT and RL within each training instance, optimizing SFT loss on expert prefixes and RL loss on model completions. It introduces Trust-Region SFT (TrSFT) that minimizes forward KL divergence inside a trust region but attenuates optimization outside, shifting toward reverse KL for stable updates. An adaptive prefix-selection mechanism allocates expert guidance based on measured utility.

Result: Experiments on five mathematical reasoning benchmarks show TRAPO consistently surpasses standard SFT, RL, SFT-then-RL pipelines, and recent state-of-the-art approaches.

Conclusion: TRAPO establishes a strong new paradigm for reasoning-enhanced LLMs by unifying external supervision and self-exploration in a stable, efficient training framework.

Abstract: Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models’ (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL’s potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model’s own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.

[313] Estimating Spatially Resolved Radiation Fields Using Neural Networks

Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor

Main category: cs.LG

TL;DR: Neural networks trained on synthetic Monte-Carlo data can effectively estimate scattered radiation field distributions for medical radiation protection dosimetry.

Details

Motivation: To develop accurate methods for estimating spatial distribution of scattered radiation fields in medical settings like Interventional Radiology and Cardiology, which is crucial for radiation protection dosimetry.

Method: Created three synthetically generated datasets with increasing complexity using Geant4-based Monte-Carlo Simulation. Evaluated convolutional and fully connected neural network architectures to reconstruct fluence and spectra distributions over spatial domains.

Result: Demonstrated which neural network design decisions work well for reconstructing radiation field distributions. All datasets and training pipeline are published as open source.

Conclusion: Neural networks can be effectively trained on synthetic Monte-Carlo data to estimate scattered radiation field distributions, providing a valuable tool for radiation protection dosimetry in medical applications.

Abstract: We present an in-depth analysis on how to build and train neural networks to estimate the spatial distribution of scattered radiation fields for radiation protection dosimetry in medical radiation fields, such as those found in Interventional Radiology and Cardiology. Therefore, we present three different synthetically generated datasets with increasing complexity for training, using a Monte-Carlo Simulation application based on Geant4. On those datasets, we evaluate convolutional and fully connected architectures of neural networks to demonstrate which design decisions work well for reconstructing the fluence and spectra distributions over the spatial domain of such radiation fields. All used datasets as well as our training pipeline are published as open source in separate repositories.

[314] Polyharmonic Cascade

Yuriy N. Bakhvalov

Main category: cs.LG

TL;DR: A deep learning architecture called “polyharmonic cascade” uses sequences of polyharmonic splines derived from random function theory, with a training method that solves global linear systems instead of gradient descent.

Details

Motivation: To create a deep learning architecture that can approximate complex nonlinear functions while preserving global smoothness and maintaining a probabilistic interpretation, addressing limitations of traditional gradient-based methods.

Method: The polyharmonic cascade architecture uses sequences of polyharmonic spline packages derived from random function theory. Training involves solving a single global linear system per batch for function values at fixed node constellations, rather than optimizing coefficients via gradient descent.

Result: The method enables synchronized layer updates, preserves probabilistic interpretation, scales efficiently with 2D matrix operations on GPUs, and demonstrates fast learning without overfitting on MNIST dataset.

Conclusion: The polyharmonic cascade offers a theoretically grounded alternative to gradient-based deep learning that maintains smoothness, probabilistic interpretability, and computational efficiency while avoiding overfitting.

Abstract: This paper presents a deep machine learning architecture, the “polyharmonic cascade” – a sequence of packages of polyharmonic splines, where each layer is rigorously derived from the theory of random functions and the principles of indifference. This makes it possible to approximate nonlinear functions of arbitrary complexity while preserving global smoothness and a probabilistic interpretation. For the polyharmonic cascade, a training method alternative to gradient descent is proposed: instead of directly optimizing the coefficients, one solves a single global linear system on each batch with respect to the function values at fixed “constellations” of nodes. This yields synchronized updates of all layers, preserves the probabilistic interpretation of individual layers and theoretical consistency with the original model, and scales well: all computations reduce to 2D matrix operations efficiently executed on a GPU. Fast learning without overfitting on MNIST is demonstrated.

[315] You Only Train Once: Differentiable Subset Selection for Omics Data

Daphné Chopard, Jorge da Silva Gonçalves, Irene Cannistraci, Thomas M. Sutter, Julia E. Vogt

Main category: cs.LG

TL;DR: YOTO is an end-to-end framework that jointly selects discrete gene subsets and performs prediction in a single differentiable architecture, outperforming existing methods on single-cell RNA-seq data.

Details

Motivation: Existing feature selection approaches for single-cell transcriptomic data operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. There's a need for a more integrated approach that directly links gene selection with predictive performance.

Method: YOTO uses an end-to-end differentiable architecture that jointly identifies discrete gene subsets and performs prediction. It employs a closed feedback loop where prediction guides gene selection and selected genes shape predictive representation. The model enforces sparsity so only selected genes contribute to inference, and uses multi-task learning to share representations across related objectives.

Result: YOTO consistently outperforms state-of-the-art baselines on two representative single-cell RNA-seq datasets, demonstrating improved predictive performance while yielding compact and meaningful gene subsets.

Conclusion: Sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis by eliminating the need for separate training of downstream classifiers.

Abstract: Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work, we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets, in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to inference, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing partially labeled datasets to inform one another, and discovering gene subsets that generalize across tasks without additional training steps. We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.

[316] Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents

Paul Mangold, Eloïse Berthier, Eric Moulines

Main category: cs.LG

TL;DR: First theoretical analysis of Federated SARSA with linear function approximation, establishing convergence guarantees and complexity bounds under heterogeneous environments.

Details

Motivation: Need to analyze federated reinforcement learning with SARSA algorithm in heterogeneous settings where agents have different local transitions and rewards, which is common in real-world distributed RL applications.

Method: Theoretical analysis of FedSARSA with linear function approximation, developing a novel multi-step error expansion for single-agent SARSA, and analyzing convergence with multiple local updates in heterogeneous federated settings.

Result: Established first sample and communication complexity bounds for FedSARSA, proved convergence under heterogeneity, demonstrated linear speed-up with number of agents (up to Markovian sampling terms), and validated with numerical experiments.

Conclusion: FedSARSA is theoretically sound for heterogeneous federated RL, achieving efficient convergence and scalability, with the multi-step error expansion providing fundamental analytical tools for SARSA analysis.

Abstract: We present a novel theoretical analysis of Federated SARSA (FedSARSA) with linear function approximation and local training. We establish convergence guarantees for FedSARSA in the presence of heterogeneity, both in local transitions and rewards, providing the first sample and communication complexity bounds in this setting. At the core of our analysis is a new, exact multi-step error expansion for single-agent SARSA, which is of independent interest. Our analysis precisely quantifies the impact of heterogeneity, demonstrating the convergence of FedSARSA with multiple local updates. Crucially, we show that FedSARSA achieves linear speed-up with respect to the number of agents, up to higher-order terms due to Markovian sampling. Numerical experiments support our theoretical findings.

[317] Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

Yuri Calleo

Main category: cs.LG

TL;DR: A spatially-informed transformer that injects geostatistical inductive bias into self-attention via learnable covariance kernels, bridging Gaussian processes and deep learning for spatio-temporal modeling.

Details

Motivation: There's a fundamental gap between classical geostatistics (probabilistically rigorous but computationally prohibitive) and deep learning (flexible but lacking geometric inductive bias). Transformers treat spatial sensors as permutation-invariant tokens without understanding distance, while Gaussian processes scale poorly for massive networks.

Method: Proposes a spatially-informed transformer that injects geostatistical inductive bias directly into self-attention via learnable covariance kernels. Decomposes attention structure into stationary physical prior (covariance kernel) and non-stationary data-driven residual, imposing soft topological constraints favoring spatially proximal interactions while retaining capacity for complex dynamics.

Result: Demonstrates “Deep Variography” where network recovers true spatial decay parameters end-to-end via backpropagation. Outperforms state-of-the-art graph neural networks on synthetic Gaussian random fields and real-world traffic benchmarks. Provides well-calibrated probabilistic forecasts with superior predictive accuracy.

Conclusion: Successfully bridges gap between physics-aware modeling and data-driven learning by combining theoretical consistency of geostatistics with flexible representations of deep learning, delivering both accuracy and proper uncertainty quantification.

Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography’’, where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

[318] Mitigating Forgetting in Low Rank Adaptation

Joanna Sliwa, Frank Schneider, Philipp Hennig, Jose Miguel Hernandez-Lobato

Main category: cs.LG

TL;DR: LaLoRA: A lightweight Laplace approximation regularization method for LoRA that prevents catastrophic forgetting during parameter-efficient fine-tuning by constraining updates in high-curvature directions.

Details

Motivation: Parameter-efficient fine-tuning methods like LoRA enable fast specialization of large pre-trained models, but they often cause catastrophic forgetting of the model's prior domain knowledge. The authors aim to address this forgetting problem while maintaining efficiency.

Method: LaLoRA applies a Laplace approximation to Low-Rank Adaptation (LoRA) weights only, estimating model confidence in each parameter and constraining updates in high-curvature directions. This weight-space regularization technique preserves prior knowledge while enabling efficient target-domain learning.

Result: The method demonstrates improved learning-forgetting trade-off when fine-tuning Llama models for mathematical reasoning, with the trade-off directly controllable via regularization strength. The paper also explores different curvature approximations, data effects for Laplace approximation, and hyperparameter robustness.

Conclusion: LaLoRA provides an effective lightweight solution to catastrophic forgetting in parameter-efficient fine-tuning by combining Laplace approximation with LoRA, offering controllable regularization that preserves prior knowledge while enabling efficient adaptation to new domains.

Abstract: Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), enable fast specialization of large pre-trained models to different downstream applications. However, this process often leads to catastrophic forgetting of the model’s prior domain knowledge. We address this issue with LaLoRA, a weight-space regularization technique that applies a Laplace approximation to Low-Rank Adaptation. Our approach estimates the model’s confidence in each parameter and constrains updates in high-curvature directions, preserving prior knowledge while enabling efficient target-domain learning. By applying the Laplace approximation only to the LoRA weights, the method remains lightweight. We evaluate LaLoRA by fine-tuning a Llama model for mathematical reasoning and demonstrate an improved learning-forgetting trade-off, which can be directly controlled via the method’s regularization strength. We further explore different loss landscape curvature approximations for estimating parameter confidence, analyze the effect of the data used for the Laplace approximation, and study robustness across hyperparameters.

[319] Easy Adaptation: An Efficient Task-Specific Knowledge Injection Method for Large Models in Resource-Constrained Environments

Dong Chen, Zhengqing Hu, Shixing Zhao, Yibo Guo

Main category: cs.LG

TL;DR: EA proposes Specific Small Models to complement LLMs’ underfitted distributions, matching PEFT performance without accessing LLM parameters and using minimal resources.

Details

Motivation: Existing PEFT methods face high resource costs and parameter dependency issues, especially with closed-source LLMs accessible only via expensive APIs. Small models can outperform LLMs on specific distributions with minimal resources.

Method: Designs Specific Small Models (SSMs) to complement the underfitted data distribution for Large Models, enabling adaptation without accessing LLM parameters.

Result: Extensive experiments show EA matches PEFT performance on diverse tasks without accessing LM parameters and requires only minimal resources.

Conclusion: EA provides an effective alternative to PEFT that overcomes resource constraints and parameter dependency issues, making LLM adaptation practical in resource-constrained environments.

Abstract: While the enormous parameter scale endows Large Models (LMs) with unparalleled performance, it also limits their adaptability across specific tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical approach for effectively adapting LMs to a diverse range of downstream tasks. However, existing PEFT methods face two primary challenges: (1) High resource cost. Although PEFT methods significantly reduce resource demands compared to full fine-tuning, it still requires substantial time and memory, making it impractical in resource-constrained environments. (2) Parameter dependency. PEFT methods heavily rely on updating a subset of parameters associated with LMs to incorporate task-specific knowledge. Yet, due to increasing competition in the LMs landscape, many companies have adopted closed-source policies for their leading models, offering access only via Application Programming Interface (APIs). Whereas, the expense is often cost-prohibitive and difficult to sustain, as the fine-tuning process of LMs is extremely slow. Even if small models perform far worse than LMs in general, they can achieve superior results on particular distributions while requiring only minimal resources. Motivated by this insight, we propose Easy Adaptation (EA), which designs Specific Small Models (SSMs) to complement the underfitted data distribution for LMs. Extensive experiments show that EA matches the performance of PEFT on diverse tasks without accessing LM parameters, and requires only minimal resources.

[320] Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation

Luca Miglior, Matteo Tolloso, Alessio Gravina, Davide Bacciu

Main category: cs.LG

TL;DR: ECHO is a new benchmark for evaluating GNNs’ ability to handle long-range graph propagation, featuring synthetic tasks (shortest paths, node eccentricity, graph diameter) and real-world chemical datasets (partial charges, total energies), revealing performance gaps in current architectures.

Details

Motivation: Effectively capturing long-range interactions is a fundamental unresolved challenge in GNN research, critical for diverse scientific applications. Current GNNs struggle with true long-range propagation, creating a need for systematic evaluation.

Method: Introduces ECHO benchmark with three synthetic graph tasks (single-source shortest paths, node eccentricity, graph diameter) over diverse challenging topologies designed to create information bottlenecks, plus two real-world chemical datasets (ECHO-Charge for atomic partial charges and ECHO-Energy for molecular total energies) with DFT-level reference computations.

Result: Extensive benchmarking of popular GNN architectures reveals clear performance gaps, demonstrating the difficulty of true long-range propagation and highlighting design choices that can overcome inherent limitations.

Conclusion: ECHO sets a new standard for evaluating long-range information propagation in GNNs and provides a compelling example of the need for such capabilities in AI for science applications.

Abstract: Effectively capturing long-range interactions remains a fundamental yet unresolved challenge in graph neural network (GNN) research, critical for applications across diverse fields of science. To systematically address this, we introduce ECHO (Evaluating Communication over long HOps), a novel benchmark specifically designed to rigorously assess the capabilities of GNNs in handling very long-range graph propagation. ECHO includes three synthetic graph tasks, namely single-source shortest paths, node eccentricity, and graph diameter, each constructed over diverse and structurally challenging topologies intentionally designed to introduce significant information bottlenecks. ECHO also includes two real-world datasets, ECHO-Charge and ECHO-Energy, which define chemically grounded benchmarks for predicting atomic partial charges and molecular total energies, respectively, with reference computations obtained at the density functional theory (DFT) level. Both tasks inherently depend on capturing complex long-range molecular interactions. Our extensive benchmarking of popular GNN architectures reveals clear performance gaps, emphasizing the difficulty of true long-range propagation and highlighting design choices capable of overcoming inherent limitations. ECHO thereby sets a new standard for evaluating long-range information propagation, also providing a compelling example for its need in AI for science.

[321] Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

Main category: cs.LG

TL;DR: A low-rank decomposition method for compressing LLMs by removing redundant components for specific applications while maintaining accuracy comparable to state-of-the-art compression techniques.

Details

Motivation: LLMs are computationally intensive and energy-demanding, making deployment challenging on resource-limited devices (PCs, mobile/wearable) and expensive on cloud servers. Pretrained LLMs contain many redundant components not needed for specific applications.

Method: Low-rank decomposition approach representing LLM weight matrices as linear combinations of base components. The method identifies and prunes irrelevant bases while enhancing the model with new bases beneficial for specific target applications.

Result: Deep compression results on Llama 2-7b and -13B models for mathematical reasoning and code generation applications show significant model size reduction while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

Conclusion: The proposed low-rank decomposition method effectively compresses LLMs for specific applications by removing redundant components, enabling more efficient deployment on resource-constrained devices and reducing inference costs in cloud environments.

Abstract: Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

[322] Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning

Wei Tang, Yin-Fang Yang, Weijia Zhang, Min-Ling Zhang

Main category: cs.LG

TL;DR: Proposes a plug-and-play calibratable disambiguation loss (CDL) for multi-instance partial-label learning to improve both classification accuracy and calibration performance.

Details

Motivation: Existing MIPL approaches suffer from poor calibration, undermining classifier reliability. There's a need to address calibration issues while maintaining classification performance in weakly supervised learning.

Method: Proposes a calibratable disambiguation loss (CDL) with two instantiations: one calibrates predictions using probabilities from candidate label set, the other integrates probabilities from both candidate and non-candidate label sets. The loss can be seamlessly incorporated into existing MIPL and PLL frameworks.

Result: Experimental results on benchmark and real-world datasets confirm that CDL significantly enhances both classification and calibration performance compared to conventional approaches.

Conclusion: The proposed CDL provides an effective plug-and-play solution to improve calibration in MIPL and PLL frameworks, with theoretical guarantees and practical benefits for weakly supervised learning.

Abstract: Multi-instance partial-label learning (MIPL) is a weakly supervised framework that extends the principles of multi-instance learning (MIL) and partial-label learning (PLL) to address the challenges of inexact supervision in both instance and label spaces. However, existing MIPL approaches often suffer from poor calibration, undermining classifier reliability. In this work, we propose a plug-and-play calibratable disambiguation loss (CDL) that simultaneously improves classification accuracy and calibration performance. The loss has two instantiations: the first one calibrates predictions based on probabilities from the candidate label set, while the second one integrates probabilities from both candidate and non-candidate label sets. The proposed CDL can be seamlessly incorporated into existing MIPL and PLL frameworks. We provide a theoretical analysis that establishes the lower bound and regularization properties of CDL, demonstrating its superiority over conventional disambiguation losses. Experimental results on benchmark and real-world datasets confirm that our CDL significantly enhances both classification and calibration performance.

[323] Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation

Liam Collins, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Donald Loveland, Leonardo Neves, Neil Shah

Main category: cs.LG

TL;DR: ID and text features in sequential recommendation are complementary, and simple ensembling of independently trained ID- and text-based models outperforms complex fusion methods.

Details

Motivation: There's a lack of understanding about the complementarity between ID embeddings and modality (text) features in sequential recommendation. Some works claim modality features can replace IDs entirely, while others use complex fusion strategies, but neither approach properly addresses whether these features are complementary.

Method: Proposes a simple ensembling method that preserves ID-text complementarity through independent training of ID-based and text-based sequential recommendation models, then combines them using a straightforward ensembling strategy.

Result: The proposed simple ensembling method outperforms several competitive sequential recommendation baselines, demonstrating that both ID and text features are necessary for state-of-the-art performance.

Conclusion: Both ID and text features are complementary and necessary for optimal sequential recommendation performance, but complex fusion architectures are not required - simple ensembling of independently trained models is sufficient.

Abstract: Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method’s simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.

[324] Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky

Main category: cs.LG

TL;DR: KronSAE: Kronecker-factorized Sparse Autoencoder with mAND activation for efficient and interpretable language model analysis

Details

Motivation: Training and interpreting Sparse Autoencoders (SAEs) at scale is challenging, especially with large dictionary sizes. Encoders require computationally intensive linear operations with large output dimensions, creating efficiency bottlenecks.

Method: Proposes KronSAE architecture that factorizes latent representation via Kronecker product decomposition to reduce memory and computational overhead. Introduces mAND, a differentiable activation function approximating binary AND operation to improve interpretability and performance in the factorized framework.

Result: Drastically reduces memory and computational overhead compared to traditional SAEs while maintaining or improving interpretability of language model hidden states.

Conclusion: KronSAE with mAND activation provides an efficient and interpretable solution for scaling Sparse Autoencoders to analyze language models, addressing computational bottlenecks while enhancing interpretability through factorized representations.

Abstract: Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

[325] Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow

Herlock Rahimi

Main category: cs.LG

TL;DR: The paper proposes using Wasserstein-Fisher-Rao (WFR) geometries to enhance diffusion-based samplers with mass reweighting mechanisms for improved exploration in non-log-concave distributions.

Details

Motivation: Score-based diffusion models struggle with nonconvex/multimodal distributions due to poor mixing rates. Classical diffusion dynamics deteriorate exponentially in such landscapes, which are common in practical generative modeling tasks.

Method: Leverages information geometry tools to augment diffusion samplers with controlled mass reweighting via WFR geometries. Implements correction terms through weighted stochastic differential equations using Feynman-Kac representation.

Result: Provides preliminary but rigorous investigation of WFR-based sampling dynamics, clarifying their geometric and operator-theoretic structure as foundation for future developments.

Conclusion: WFR-based sampling offers promising approach to improve exploration beyond classical diffusion dynamics for non-log-concave distributions in generative modeling.

Abstract: Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein–Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics. A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein–Fisher–Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman–Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments.

[326] The Diffusion Duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov

Main category: cs.LG

TL;DR: Duo improves uniform-state discrete diffusion models by leveraging their connection to Gaussian diffusion, introducing curriculum learning for faster training and discrete consistency distillation for faster sampling.

Details

Motivation: Uniform-state discrete diffusion models have potential for fast text generation due to self-correction ability, but currently underperform compared to autoregressive and masked diffusion models. The authors aim to narrow this performance gap by exploiting the connection between uniform-state diffusion and underlying Gaussian diffusion processes.

Method: The Duo method transfers techniques from Gaussian diffusion to discrete diffusion: 1) Curriculum learning strategy guided by Gaussian process to reduce variance and accelerate training, 2) Discrete Consistency Distillation that adapts consistency distillation from continuous to discrete setting for few-step generation.

Result: Curriculum learning doubles training speed and models surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Discrete Consistency Distillation accelerates sampling by two orders of magnitude, enabling few-step generation in diffusion language models.

Conclusion: By leveraging the connection between uniform-state discrete diffusion and Gaussian diffusion, Duo successfully narrows the performance gap with state-of-the-art models while maintaining the fast generation advantages of diffusion models.

Abstract: Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/duo

[327] Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space

Xinyue Yu, Hayden Schaeffer

Main category: cs.LG

TL;DR: Regularized random Fourier features with finite element reconstruction (RRFF-FEM) for robust operator learning from noisy data, achieving improved performance with reduced training time compared to unregularized methods.

Details

Motivation: Kernel-based operator learning methods can be computationally prohibitive for large training sets and sensitive to noise, despite offering accurate approximations with theoretical guarantees.

Method: Proposes RRFF-FEM using random features from multivariate Student’s t distributions with frequency-weighted Tikhonov regularization to suppress high-frequency noise, coupled with finite element reconstruction maps.

Result: Established theoretical guarantees showing well-conditioned systems when features scale as m log m with training samples; demonstrated robustness to noise and improved performance with reduced training time across multiple PDE benchmarks.

Conclusion: RRFF and RRFF-FEM provide a computationally efficient, noise-robust alternative to kernel and neural operator methods while maintaining competitive accuracy for operator learning tasks.

Abstract: Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student’s $t$ distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features $N$ scales like $m \log m$ with the number of training samples $m$, the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers’, Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.

[328] OptScale: Probabilistic Optimality for Inference-time Scaling

Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei

Main category: cs.LG

TL;DR: OptScale: A principled probabilistic framework for inference-time scaling that dynamically determines optimal sample size to achieve target performance with minimal compute overhead.

Details

Motivation: Existing inference-time scaling approaches rely on heuristic parallel sampling strategies without theoretical foundation, creating a need for principled guidance on compute-efficient scaling.

Method: Develops a probabilistic framework formalizing optimality under i.i.d. assumptions, derives theoretical lower bound for required samples, and creates OptScale algorithm with LM-based predictor to estimate prior parameters and dynamically determine minimal samples meeting performance thresholds.

Result: OptScale significantly reduces sampling overhead while matching or exceeding state-of-the-art reasoning performance on MATH-500, GSM8K, AIME, and AMC benchmarks.

Conclusion: Provides both theoretical foundation and practical solution for principled inference-time scaling, addressing critical gap in efficient LLM deployment for complex reasoning tasks.

Abstract: Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-$N$ selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on representative reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.

[329] Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji

Main category: cs.LG

TL;DR: UDS (Utility-Diversity Sampling) is an efficient online batch selection framework for supervised fine-tuning of LLMs that dynamically selects valuable training samples based on both utility and diversity metrics without external resources or extra training time.

Details

Motivation: Current online batch selection methods for SFT have three main limitations: (1) they focus only on data utility while neglecting diversity, (2) they require external resources like reference models or validation sets, and (3) they incur extra training time compared to full-dataset training.

Method: UDS uses nuclear norm of logits matrix to capture data utility and intra-sample diversity, and estimates inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. This eliminates need for external resources and unnecessary backpropagation.

Result: Experiments on multiple benchmarks show UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning.

Conclusion: UDS provides an effective and efficient framework for data curation in SFT that addresses key limitations of existing methods by jointly considering utility and diversity without external dependencies or computational overhead.

Abstract: Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.

[330] Optimizing Mixture of Block Attention

Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han

Main category: cs.LG

TL;DR: MoBA is an efficient attention mechanism for long contexts, but lacks theoretical understanding and GPU implementation. This paper develops a statistical model to understand MoBA’s mechanics, identifies improvements (smaller blocks and key convolution), and introduces FlashMoBA for efficient GPU execution.

Details

Motivation: MoBA shows promise for efficient long-context processing in LLMs but has two major limitations: poor understanding of its design principles and lack of efficient GPU implementation, which hinders practical adoption.

Method: 1) Develop statistical model to analyze MoBA mechanics and derive signal-to-noise ratio connecting architectural parameters to retrieval accuracy. 2) Identify improvements: smaller block sizes and short convolution on keys to cluster signals. 3) Create FlashMoBA - hardware-aware CUDA kernel for efficient execution with small blocks.

Result: Improved MoBA models match dense attention baseline performance. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making theoretically-grounded improvements practical.

Conclusion: The paper provides theoretical understanding of MoBA, identifies key improvements for better routing accuracy, and delivers practical GPU implementation (FlashMoBA) that enables efficient execution of MoBA with optimal small block sizes.

Abstract: Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA’s performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA’s underlying mechanics. Our model reveals that performance critically depends on the router’s ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.

[331] The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

Debu Sinha

Main category: cs.LG

TL;DR: Conformal prediction reveals embedding-based hallucination detection fails catastrically on real hallucinations from RLHF-aligned models, achieving 100% FPR at target coverage, while GPT-4 achieves 7% FPR, exposing a “Semantic Illusion” where hardest hallucinations are semantically indistinguishable from faithful responses.

Details

Motivation: RAG systems remain susceptible to hallucinations despite grounding in retrieved evidence, and current detection methods (embedding similarity and NLI) have unproven reliability in safety-critical settings. The paper aims to apply rigorous statistical methods to assess and improve hallucination detection reliability.

Method: Apply conformal prediction to RAG hallucination detection to transform heuristic scores into decision sets with finite-sample coverage guarantees. Use calibration sets of n=600 to evaluate methods on both synthetic hallucinations (Natural Questions) and real hallucinations from RLHF-aligned models (HaluEval). Analyze failure through distributional tails and compare embedding methods, NLI models, and GPT-4 as a judge.

Result: Fundamental dichotomy discovered: on synthetic hallucinations, embedding methods achieve 95% coverage with 0% FPR, but on real hallucinations from RLHF models, same methods fail catastrophically with 100% FPR at target coverage. NLI models achieve acceptable AUC (0.81) but struggle with hardest cases. GPT-4 achieves 7% FPR (95% CI:[3.4%, 13.7%]), proving task is solvable via reasoning but opaque to surface-level semantics.

Conclusion: Real hallucinations from RLHF-aligned models create a “Semantic Illusion” where hardest hallucinations are semantically indistinguishable from faithful responses, making surface-level semantic methods fundamentally inadequate. The problem requires reasoning-based approaches like GPT-4 rather than embedding similarity or NLI methods, highlighting a critical limitation in current RAG safety evaluation.

Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. While current detection methods leverage embedding similarity and natural language inference (NLI), their reliability in safety-critical settings remains unproven. We apply conformal prediction to RAG hallucination detection, transforming heuristic scores into decision sets with finite-sample coverage guarantees (1-alpha). Using calibration sets of n=600, we demonstrate a fundamental dichotomy: on synthetic hallucinations (Natural Questions), embedding methods achieve 95% coverage with 0% False Positive Rate (FPR). However, on real hallucinations from RLHF-aligned models (HaluEval), the same methods fail catastrophically, yielding 100% FPR at target coverage. We analyze this failure through the lens of distributional tails, showing that while NLI models achieve acceptable AUC (0.81), the “hardest” hallucinations are semantically indistinguishable from faithful responses, forcing conformal thresholds to reject nearly all valid outputs. Crucially, GPT-4 as a judge achieves 7% FPR (95% CI:[3.4%, 13.7%]) on the same data, proving the task is solvable via reasoning but opaque to surface-level semantics–a phenomenon we term the “Semantic Illusion.”

[332] Sparse, Efficient and Explainable Data Attribution with DualXDA

Galip Ümit Yolcu, Moritz Weckbecker, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Main category: cs.LG

TL;DR: DualXDA introduces a framework for sparse, efficient, and explainable data attribution that addresses computational cost and sparsity issues in existing methods through two approaches: DualDA for fast sparse attributions and XDA for feature-level explanations.

Details

Motivation: Existing Data Attribution methods suffer from high computational costs and memory demands, forcing approximations that may not capture true model inference. They also exhibit low sparsity, making it difficult to identify decisive patterns in training data.

Method: DualXDA framework consists of two interlinked approaches: DualDA leverages Support Vector Machine theory for fast, naturally sparse data attributions, while XDA enhances data attribution with feature attribution capabilities to explain why training samples are relevant.

Result: DualDA achieves high attribution quality and excels at downstream tasks while improving explanation time by up to 4,100,000x compared to original Influence Functions and up to 11,000x compared to the most efficient approximation. XDA provides feature-level explanations verified qualitatively.

Conclusion: DualXDA provides a comprehensive solution for sparse, efficient, and explainable data attribution that addresses key limitations of existing methods, enabling practical application to medium-scale datasets and models while maintaining attribution quality.

Abstract: Data Attribution (DA) is an emerging approach in the field of eXplainable Artificial Intelligence (XAI), aiming to identify influential training datapoints which determine model outputs. It seeks to provide transparency about the model and individual predictions, e.g. for model debugging, identifying data-related causes of suboptimal performance. However, existing DA approaches suffer from prohibitively high computational costs and memory demands when applied to even medium-scale datasets and models, forcing practitioners to resort to approximations that may fail to capture the true inference process of the underlying model. Additionally, current attribution methods exhibit low sparsity, resulting in non-negligible attribution scores across a high number of training examples, hindering the discovery of decisive patterns in the data. In this work, we introduce DualXDA, a framework for sparse, efficient and explainable DA, comprised of two interlinked approaches, Dual Data Attribution (DualDA) and eXplainable Data Attribution (XDA): With DualDA, we propose a novel approach for efficient and effective DA, leveraging Support Vector Machine theory to provide fast and naturally sparse data attributions for AI predictions. In extensive quantitative analyses, we demonstrate that DualDA achieves high attribution quality, excels at solving a series of evaluated downstream tasks, while at the same time improving explanation time by a factor of up to 4,100,000x compared to the original Influence Functions method, and up to 11,000x compared to the method’s most efficient approximation from literature to date. We further introduce XDA, a method for enhancing Data Attribution with capabilities from feature attribution methods to explain why training samples are relevant for the prediction of a test sample in terms of impactful features, which we showcase and verify qualitatively in detail.

[333] Privacy Bias in Language Models: A Contextual Integrity-based Auditing Metric

Yan Shvartzshnaider, Vasisht Duddu

Main category: cs.LG

TL;DR: The paper introduces “privacy bias” as a metric to audit LLMs, defining it as the appropriateness of information flows in responses, with deviations from expected values indicating potential privacy violations.

Details

Motivation: As LLMs are integrated into sociotechnical systems, there's a critical need to examine the privacy biases they exhibit to prevent privacy violations and ensure ethical deployment.

Method: The authors propose a contextual integrity-based methodology to assess privacy biases in LLMs, accounting for response sensitivity across prompt variations and investigating how biases are affected by model capacities and optimizations.

Result: The paper presents a novel approach for reliably examining privacy biases in LLMs and the factors influencing them, providing an auditing metric for model trainers, service providers, and policymakers.

Conclusion: Privacy bias serves as a crucial auditing metric that can help evaluate ethical impacts, guide LLM selection, and assess appropriateness in deployed systems, addressing a significant gap in LLM privacy evaluation.

Abstract: As large language models (LLMs) are integrated into sociotechnical systems, it is crucial to examine the privacy biases they exhibit. We define privacy bias as the appropriateness value of information flows in responses from LLMs. A deviation between privacy biases and expected values, referred to as privacy bias delta, may indicate privacy violations. As an auditing metric, privacy bias can help (a) model trainers evaluate the ethical and societal impact of LLMs, (b) service providers select context-appropriate LLMs, and (c) policymakers assess the appropriateness of privacy biases in deployed LLMs. We formulate and answer a novel research question: how can we reliably examine privacy biases in LLMs and the factors that influence them? We present a novel approach for assessing privacy biases using a contextual integrity-based methodology to evaluate the responses from various LLMs. Our approach accounts for the sensitivity of responses across prompt variations, which hinders the evaluation of privacy biases. Finally, we investigate how privacy biases are affected by model capacities and optimizations.

[334] Feed Two Birds with One Scone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection

Haoyue Bai, Gregory Canal, Xuefeng Du, Jeongyeol Kwon, Robert Nowak, Yixuan Li

Main category: cs.LG

TL;DR: A unified framework that simultaneously addresses both OOD generalization (covariate shift) and OOD detection (semantic shift) using margin-based learning with unlabeled test-time data.

Details

Motivation: Current research treats OOD generalization and OOD detection as separate problems with conflicting goals, but real-world deployments require handling both covariate and semantic shifts simultaneously.

Method: Proposes a margin-based learning framework that leverages freely available unlabeled test-time data to capture both covariate and semantic shift distributions, using margin constraints as the key mechanism.

Result: Extensive experiments show the framework outperforms specialized baselines in both OOD generalization and detection tasks, with theoretical and empirical validation of the margin constraint’s effectiveness.

Conclusion: Margin constraints enable simultaneous achievement of OOD generalization and detection, providing a unified solution for real-world deployment where both types of distribution shifts occur.

Abstract: Modern machine learning models deployed in the wild can encounter both covariate and semantic shifts, giving rise to the problems of out-of-distribution (OOD) generalization and OOD detection respectively. While both problems have received significant research attention lately, they have been pursued independently. This may not be surprising, since the two tasks have seemingly conflicting goals. This paper provides a new unified approach that is capable of simultaneously generalizing to covariate shifts while robustly detecting semantic shifts. We propose a margin-based learning framework that exploits freely available unlabeled data in the wild that captures the environmental test-time OOD distributions under both covariate and semantic shifts. We show both empirically and theoretically that the proposed margin constraint is the key to achieving both OOD generalization and detection. Extensive experiments show the superiority of our framework, outperforming competitive baselines that specialize in either OOD generalization or OOD detection. Code is publicly available at https://github.com/deeplearning-wisc/scone.

[335] HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

Chang Sun, Zhiqiang Que, Thea K. Årrestad, Vladimir Loncar, Jennifer Ngadiuba, Wayne Luk, Maria Spiropulu

Main category: cs.LG

TL;DR: HGQ is a quantization-aware training framework that optimizes parameter bit-widths through gradient descent for ultra-low latency neural networks on FPGAs, enabling sub-microsecond inference for critical applications like particle physics experiments.

Details

Motivation: Many critical applications require neural networks with sub-microsecond inference latency, particularly for deployment on FPGAs. Existing methods don't adequately address the need for heterogeneous arbitrary precision arithmetic on hardware platforms.

Method: High Granularity Quantization (HGQ) is a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic.

Result: HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining accuracy on several benchmark tasks. These improvements enable deployment of complex models previously infeasible due to resource or latency constraints.

Conclusion: HGQ enables the deployment of advanced machine learning models for real-time data selection with sub-microsecond latency, and is being used for developing next-generation trigger systems at CERN ATLAS and CMS experiments for particle physics. The framework is open-source.

Abstract: Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.

[336] On the Identification of Temporally Causal Representation with Instantaneous Dependence

Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Guangyi Chen, Kun Zhang

Main category: cs.LG

TL;DR: IDOL: A framework for identifying latent causal processes with instantaneous relations using sparse influence constraints, without requiring interventions or grouped observations.

Details

Motivation: Existing methods for temporally causal representation learning either assume no instantaneous relations or require impractical interventions/grouped observations. There's a need for methods that can handle instantaneous causality in real-world scenarios without these requirements.

Method: Proposes IDOL framework using sparse influence constraint (sparse time-delayed and instantaneous relations) with contextual information. Combines temporally variational inference for latent variable estimation with gradient-based sparsity regularization for causal identification.

Result: Theoretical identifiability results established based on sufficient variability and sparse influence constraints. Experimental validation shows successful identification on simulation datasets and effectiveness on human motion forecasting benchmarks with instantaneous dependencies.

Conclusion: IDOL enables identification of latent causal processes with instantaneous relations without requiring interventions or grouped observations, making it applicable to real-world scenarios where such information is difficult to obtain.

Abstract: Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.

[337] Low-Rank Filtering and Smoothing for Sequential Deep Learning

Joanna Sliwa, Frank Schneider, Nathanael Bosch, Agustinus Kristiadi, Philipp Hennig

Main category: cs.LG

TL;DR: A Bayesian framework for continual learning that treats neural network parameters as states in a nonlinear Gaussian model, enabling principled encoding of task relationships and bidirectional knowledge transfer through Bayesian smoothing.

Details

Motivation: Current continual learning methods with parameter regularization lack ways to incorporate prior knowledge about task relationships and only allow forward information flow (from past to future tasks), limiting flexibility and knowledge retention.

Method: Proposes a Bayesian framework modeling network parameters as state space of nonlinear Gaussian model, with two key innovations: 1) Encoding domain knowledge about task relationships to control layer adaptation, 2) Applying Bayesian smoothing for bidirectional knowledge transfer (future tasks can inform past ones). Uses efficient diagonal plus low-rank approximations of precision matrix in Laplace approximation (LR-LGF) for filtering and smoothing.

Result: Empirical results demonstrate efficiency of LR-LGF approximation and benefits of the unlocked capabilities - principled task relationship encoding and bidirectional knowledge transfer without accessing future task data directly.

Conclusion: The Bayesian framework provides a principled way to incorporate task relationship knowledge and enables bidirectional knowledge transfer in continual learning, addressing limitations of existing regularization approaches while maintaining efficiency through LR-LGF approximations.

Abstract: Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network’s parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.

[338] Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification

Xiaoyu Tao, Tingyue Pan, Mingyue Cheng, Yucong Luo, Qi Liu, Enhong Chen

Main category: cs.LG

TL;DR: HiTime: A hierarchical LLM framework that bridges time series data with language models for classification by transforming it into a generative task.

Details

Motivation: LLMs have strong generalization and reasoning capabilities but can't be directly applied to time series classification due to the representation gap between numerical sequences and linguistic semantics.

Method: 1) Hierarchical sequence feature encoding with data-specific and task-specific encoders; 2) Semantic space alignment module for coarse-grained global modeling and fine-grained cross-modal correspondence; 3) Parameter-efficient supervised fine-tuning to activate LLMs’ generative classification capability.

Result: Extensive experiments on multiple benchmarks show the framework consistently outperforms state-of-the-art baselines.

Conclusion: HiTime successfully bridges structured temporal representations with semantic reasoning, transforming discriminative time series classification into a generative task using LLMs.

Abstract: Time series classification plays a fundamental role in a wide range of real-world applications. Recently, large language models (LLMs) have demonstrated strong generalization and reasoning capacities, but directly applying them to time series classification remains non-trivial due to the representation gap between numerical sequences and linguistic semantics. In this paper, we propose HiTime, a hierarchical LLM-based framework for multimodal time series classification that bridges structured temporal representations with semantic reasoning in a generative paradigm. Specifically, we design a hierarchical sequence feature encoding module composed of a data-specific encoder and a task-specific encoder to extract complementary temporal features. To mitigate the embedding gap between time series representations and textual semantics, we further introduce a semantic space alignment module that jointly performs coarse-grained global modeling and fine-grained cross-modal correspondence. Building upon the above representations, we employ a parameter-efficient supervised fine-tuning strategy to activate the generative classification capability of the algined LLMs, thereby transforming conventional discriminative time series classification into a generative task. Extensive experiments on multiple benchmarks demonstrate that the proposed framework consistently outperforms state-of-the-art baselines. The code is publicly available at https://github.com/Xiaoyu-Tao/HiTime.

[339] Fairness via Independence: A (Conditional) Distance Covariance Framework

Ruifan Huang, Haixia Liu

Main category: cs.LG

TL;DR: The paper proposes a fairness method using distance covariance statistics to measure and enforce independence between predictions and sensitive attributes, with computational optimizations for efficiency.

Details

Motivation: To address fairness in machine learning by statistically measuring and enforcing independence between model predictions and sensitive attributes, bridging the fairness gap while maintaining computational efficiency.

Method: Uses conditional distance covariance or distance covariance statistics to assess independence, adds a distance covariance-based penalty during training, and introduces matrix form for parallel computation to improve efficiency.

Result: Theoretical proof of convergence between empirical and population distance covariance, and experimental results on real-world datasets show effective reduction of fairness gaps in machine learning models.

Conclusion: The proposed method successfully enhances fairness by enforcing statistical independence between predictions and sensitive attributes, with computational efficiency improvements making it practical for real-world applications.

Abstract: We explore fairness from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We boost fairness with independence by adding a distance covariance-based penalty to the model’s training. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning. Our code is available at \url{https://github.com/liuhaixias1/Fair_dc/}.

[340] Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, Timothy Gowers

Main category: cs.LG

TL;DR: Current math benchmarks for AI copilots are flawed - too narrow in complexity, miss proof discovery processes, and suffer from Goodhart’s law. Need richer datasets capturing mathematical practice, especially “motivated proofs” concept.

Details

Motivation: Existing datasets for evaluating AI mathematical capabilities have critical flaws: limited mathematical complexity, failure to capture proof discovery processes, and becoming unreliable due to Goodhart's law (benchmarks become targets rather than genuine capability measures).

Method: Systematic exploration of current benchmark limitations and proposing a course correction: advocating for datasets that translate richer facets of mathematical research practice into learnable data, particularly focusing on “motivated proofs” concept by G. Pólya (1949) that supervises both proving and proof discovery processes.

Result: Identifies fundamental shortcomings in current mathematical AI benchmarks and establishes the need for paradigm shift from result-based datasets (theorem→proof mapping) to process-oriented datasets that capture mathematical thinking and discovery.

Conclusion: To enhance AI mathematical capabilities, benchmarks must move beyond simple theorem-proof pairs and instead focus on datasets that capture the richer aspects of mathematical practice, particularly the “motivated proof” concept that provides better learning signals for proof discovery processes.

Abstract: The datasets and benchmarks commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings and misdirections. These range from a restricted scope of mathematical complexity to limited fidelity in capturing aspects beyond the final, written proof (e.g. motivating the proof, or representing the thought processes leading to a proof). These issues are compounded by a dynamic reminiscent of Goodhart’s law: as benchmark performance becomes the primary target for model development, the benchmarks themselves become less reliable indicators of genuine mathematical capability. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or ``thought partners’’), necessitates a course correction both in the design of mathematical datasets and the evaluation criteria of the models’ mathematical ability. In particular, it is necessary for benchmarks to move beyond the existing result-based datasets that map theorem statements directly to proofs, and instead focus on datasets that translate the richer facets of mathematical research practice into data that LLMs can learn from. This includes benchmarks that supervise the proving process and the proof discovery process itself, and we advocate for mathematical dataset developers to consider the concept of “motivated proof”, introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations.

[341] Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

Ishank Juneja, Carlee Joe-Wong, Osman Yağan

Main category: cs.LG

TL;DR: The paper introduces new algorithms (PE and PE-CS) for Multi-Armed Bandits with Cost Subsidy problems, achieving order-optimal logarithmic regret bounds for cost-constrained decision making with unknown rewards.

Details

Motivation: In many real-world sequential decision problems, minimizing cost subject to reward constraints is more important than maximizing total reward. The MAB-CS framework addresses this practical need where decisions must meet minimum reward requirements while minimizing costs, with applications in recommendation systems and other domains.

Method: The authors propose the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize it to PE-CS for the subsidized best reward variant. Both algorithms use elimination-based approaches to identify arms that satisfy reward constraints while minimizing costs.

Result: PE and PE-CS achieve order-wise logarithmic upper bounds on both Cost and Quality Regret, making them the first algorithms with such guarantees. PE is proven order-optimal for all known reference arm problem instances. Experiments on MovieLens 25M and Goodreads datasets show PE’s effectiveness and PE-CS’s superior balance between performance and reliability compared to baselines.

Conclusion: The proposed PE and PE-CS algorithms provide theoretically sound and practically effective solutions for cost-constrained bandit problems, with proven optimality guarantees and strong empirical performance on real-world datasets.

Abstract: Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default’’ decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize PE to PE-CS for the subsidized best reward variant. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policies the first with such a guarantee. Moreover, by comparing our upper and lower bound results we establish that PE is order-optimal for all known reference arm problem instances. Finally, experiments are conducted using the MovieLens 25M and Goodreads datasets for both PE and PE-CS revealing the effectiveness of PE and the superior balance between performance and reliability offered by PE-CS compared to baselines from the literature.

[342] Towards Human-Guided, Data-Centric LLM Co-Pilots

Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

Main category: cs.LG

TL;DR: CliMB-DC is a human-guided, data-centric LLM co-pilot framework that addresses data quality challenges in ML adoption by combining multi-agent reasoning with domain expertise and state-of-the-art data-centric tools.

Details

Motivation: Current LLM-based ML co-pilots focus too much on model-centric aspects while ignoring critical data-centric challenges like missing values, label noise, and domain-specific nuances, which hinders ML adoption by non-technical domain experts in complex real-world settings.

Method: Introduces a multi-agent reasoning system with a strategic coordinator for dynamic planning/adaptation and specialized worker agents for execution. Incorporates domain expertise through human-in-the-loop approach, formalizes a taxonomy of data-centric challenges, and integrates state-of-the-art data-centric tools into an extensible open-source architecture.

Result: Empirical evaluation on real-world healthcare datasets shows CliMB-DC can transform uncurated datasets into ML-ready formats and significantly outperforms existing co-pilot baselines in handling data-centric challenges.

Conclusion: CliMB-DC empowers domain experts from diverse fields (healthcare, finance, social sciences) to actively participate in ML-driven real-world impact by addressing the critical gap in data-centric ML tooling.

Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC’s ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains – healthcare, finance, social sciences and more – to actively participate in driving real-world impact using ML.

[343] Regularized Langevin Dynamics for Combinatorial Optimization

Shengyu Feng, Yiming Yang

Main category: cs.LG

TL;DR: Proposes Regularized Langevin Dynamics (RLD), a sampling framework for combinatorial optimization that improves exploration by enforcing distance between sampled and current solutions, with SA and NN implementations achieving SOTA performance.

Details

Motivation: Direct application of discrete Langevin dynamics for combinatorial optimization suffers from limited exploration and gets stuck in local minima, needing a method to enhance solution diversity and quality.

Method: Introduces Regularized Langevin Dynamics (RLD) that enforces expected distance between sampled and current solutions to avoid local minima. Develops two solvers: one based on simulated annealing (SA) and another on neural networks (NN).

Result: Both RLD-based methods achieve comparable or better performance than previous SOTA SA- and NN-based solvers on three classic CO problems. The SA algorithm reduces runtime by up to 80% while maintaining equal or superior performance.

Conclusion: RLD provides an effective framework for enhancing both traditional heuristics and neural network models in solving combinatorial optimization problems, offering improved exploration and efficiency.

Abstract: This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.

[344] STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems

Gary Simethy, Daniel Ortiz-Arroyo, Petar Durdevic

Main category: cs.LG

TL;DR: STDiff and STDiff-W are diffusion-based models for imputing missing sensor data in wastewater treatment plants, with STDiff-W adding context encoding for better handling of contiguous gaps while preserving realistic dynamics.

Details

Motivation: Incomplete sensor data with long, irregular gaps is a major problem in industrial time-series analytics, particularly in wastewater treatment plants where sensor fouling, maintenance, and outages cause significant data loss that hinders analytics and monitoring.

Method: STDiff learns a one-step transition model conditioned on observed values and masks using diffusion-based approach. STDiff-W extends this with a context encoder that jointly inpaints contiguous blocks, combining long-range consistency with short-term detail by casting gap filling as state-space simulation under partial observability.

Result: STDiff-W achieves state-of-the-art accuracy on two WWTP datasets compared with strong neural baselines (SAITS, BRITS, CSDI). It preserves realistic dynamics including oscillations, spikes, and regime shifts, and achieves top downstream forecasting performance. Ablation studies show meaningful dependencies on control and exogenous inputs.

Conclusion: The models effectively handle missing sensor data in industrial settings. Practical deployment guidance includes evaluating beyond MAE with task-oriented checks, including exogenous drivers, and balancing computational cost against robustness to structured outages.

Abstract: Incomplete sensor data is a major obstacle in industrial time-series analytics. In wastewater treatment plants (WWTPs), key sensors show long, irregular gaps caused by fouling, maintenance, and outages. We introduce STDiff and STDiff-W, diffusion-based imputers that cast gap filling as state-space simulation under partial observability, where targets, controls, and exogenous signals may all be intermittently missing. STDiff learns a one-step transition model conditioned on observed values and masks, while STDiff-W extends this with a context encoder that jointly inpaints contiguous blocks, combining long-range consistency with short-term detail. On two WWTP datasets (one with synthetic block gaps from Agtrup and another with natural outages from Avedøre), STDiff-W achieves state-of-the-art accuracy compared with strong neural baselines such as SAITS, BRITS, and CSDI. Beyond point-error metrics, its reconstructions preserve realistic dynamics including oscillations, spikes, and regime shifts, and they achieve top or tied-top downstream one-step forecasting performance compared with strong neural baselines, indicating that preserving dynamics does not come at the expense of predictive utility. Ablation studies that drop, shuffle, or add noise to control or exogenous inputs consistently degrade NH4 and PO4 performance, with the largest deterioration observed when exogenous signals are removed, showing that the model captures meaningful dependencies. We conclude with practical guidance for deployment: evaluate performance beyond MAE using task-oriented and visual checks, include exogenous drivers, and balance computational cost against robustness to structured outages.

[345] Generating Samples to Probe Trained Models

Eren Mehmet Kıral, Nurşen Aydın, Ş. İlker Birbil

Main category: cs.LG

TL;DR: A framework for understanding ML models by identifying their preferred data samples through mathematical queries.

Details

Motivation: There's a growing need to investigate how machine learning models operate and understand their decision-making processes by examining their data preferences.

Method: Proposed mathematical framework to probe trained models and identify preferred samples in various scenarios including prediction-risky, parameter-sensitive, or model-contrastive samples.

Result: Applied the framework to range of models trained on classification and regression tasks, receiving answers in form of generated data that reveals model preferences.

Conclusion: The framework enables systematic investigation of model behavior by questioning data preferences, providing insights into how models operate through generated data samples.

Abstract: There is a growing need for investigating how machine learning models operate. With this work, we aim to understand trained machine learning models by questioning their data preferences. We propose a mathematical framework that allows us to probe trained models and identify their preferred samples in various scenarios including prediction-risky, parameter-sensitive, or model-contrastive samples. To showcase our framework, we pose these queries to a range of models trained on a range of classification and regression tasks, and receive answers in the form of generated data.

[346] EEGDM: Learning EEG Representation with Latent Diffusion Model

Shaocong Wang, Tong Liu, Yihan Li, Ming Li, Kairui Wen, Pei Yang, Wenqi Ji, Minjing Yu, Yong-Jin Liu

Main category: cs.LG

TL;DR: EEGDM: A self-supervised EEG representation learning framework using latent diffusion models for signal generation, outperforming masked reconstruction methods by capturing global dynamics and long-range dependencies.

Details

Motivation: Current self-supervised EEG learning relies on masked reconstruction, which focuses on local dependencies but fails to capture global dynamics and long-range dependencies essential for neural activity characterization.

Method: EEGDM uses latent diffusion models to generate EEG signals. It incorporates an EEG encoder that distills raw signals and channel augmentations into compact representations, which then condition the diffusion model for progressive denoising from noise to realistic EEG signals.

Result: EEGDM (1) reconstructs high-quality EEG signals, (2) learns robust representations, and (3) achieves competitive performance across diverse downstream tasks, demonstrating superior representation learning capabilities.

Conclusion: EEGDM explores a new direction for self-supervised EEG representation learning by leveraging diffusion-based generation to capture holistic temporal patterns and cross-channel relationships, offering a compact latent space for both generative control and downstream applications.

Abstract: Recent advances in self-supervised learning for EEG representation have largely relied on masked reconstruction, where models are trained to recover randomly masked signal segments. While effective at modeling local dependencies, such objectives are inherently limited in capturing the global dynamics and long-range dependencies essential for characterizing neural activity. To address this limitation, we propose EEGDM, a novel self-supervised framework that leverages latent diffusion models to generate EEG signals as an objective. Unlike masked reconstruction, diffusion-based generation progressively denoises signals from noise to realism, compelling the model to capture holistic temporal patterns and cross-channel relationships. Specifically, EEGDM incorporates an EEG encoder that distills raw signals and their channel augmentations into a compact representation, acting as conditional information to guide the diffusion model for generating EEG signals. This design endows EEGDM with a compact latent space, which not only offers ample control over the generative process but also can be leveraged for downstream tasks. Experimental results show that EEGDM (1) reconstructs high-quality EEG signals, (2) learns robust representations, and (3) achieves competitive performance across diverse downstream tasks, thus exploring a new direction for self-supervised EEG representation learning.

[347] On Agnostic PAC Learning in the Small Error Regime

Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas

Main category: cs.LG

TL;DR: This paper resolves an open question about agnostic learning complexity by providing a computationally efficient learner that matches the lower bound when τ≈d/m, improving upon previous work.

Details

Motivation: The motivation is to address the gap in understanding agnostic learning complexity, particularly when τ (the error of the best hypothesis) is approximately d/m. Previous work left open whether there could be a higher lower bound in this regime.

Method: The method involves careful aggregations of ERM (Empirical Risk Minimization) classifiers to create a computationally efficient learner that achieves the desired error bound.

Result: The paper presents a learner achieving error c·τ + O(√(τ(d+log(1/δ))/m) + (d+log(1/δ))/m) with constant c ≤ 2.1, matching the lower bound when τ≈d/m.

Conclusion: This work resolves the open question about agnostic learning complexity in the τ≈d/m regime and makes progress on computational efficiency questions, though leaves open whether the constant can be improved from 2.1 to 1.

Abstract: Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions – even when one discounts a learner’s error by $\mathrm{err}(h^_{\mathcal{D}})$, the error of the best hypothesis in $\mathcal{H}$ for $\mathcal{D}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\mathcal{D}$ for which $τ= \mathrm{err}(h^_{\mathcal{D}})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS 24) addresses this shortcoming by including $τ$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $τ+ Ω\left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ in a regime where $τ> d/m$, and leave open the question of whether there may be a higher lower bound when $τ\approx d/m$, with $d$ denoting $\mathrm{VC}(\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \cdot τ+ O \left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ for a constant $c \leq 2.1$, thus matching the lower bound when $τ\approx d/m$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS 24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

[348] Preconditioned Inexact Stochastic ADMM for Deep Model

Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li

Main category: cs.LG

TL;DR: PISA is a new preconditioned inexact stochastic ADMM algorithm that converges under minimal assumptions (Lipschitz continuity only) and effectively handles data heterogeneity in distributed training of foundation models.

Details

Motivation: Current stochastic gradient descent-based optimizers for foundation models have limitations: slow convergence, stringent convergence assumptions, and poor performance with data heterogeneity in distributed settings.

Method: Developed PISA (Preconditioned Inexact Stochastic Alternating Direction Method of Multipliers) that converges under minimal assumptions (Lipschitz continuity only). The algorithm supports scalable parallel computing and various preconditions including second-order information, second moment, and orthogonalized momentum via Newton-Schulz iterations.

Result: Created two computationally efficient variants: SISA (using second moment preconditioning) and NSISA (using orthogonalized momentum via Newton-Schulz). Both demonstrated superior performance across diverse deep models including vision models, LLMs, RL models, GANs, and RNNs compared to state-of-the-art optimizers.

Conclusion: PISA provides a theoretically sound and practically effective optimization framework for foundation models that overcomes limitations of existing methods, particularly in handling data heterogeneity while maintaining computational efficiency and supporting parallel computing.

Abstract: The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA (Preconditioned Inexact Stochastic Alternating Direction Method of Multipliers). Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables the proposed algorithm to tackle the challenge of data heterogeneity effectively. Moreover, the algorithmic architecture enables scalable parallel computing and supports various preconditions, such as second-order information, second moment, and orthogonalized momentum by Newton-Schulz iterations. Incorporating the latter two preconditions in PISA yields two computationally efficient variants: SISA and NSISA. Comprehensive experimental evaluations for training or fine-tuning diverse deep models, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate superior numerical performance of SISA and NSISA compared to various state-of-the-art optimizers.

[349] On the Effect of Sampling Diversity in Scaling LLM Inference

Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Weiyang Liu, Haifeng Chen, Xiang Zhang, Wei Cheng

Main category: cs.LG

TL;DR: Diversified sampling improves LLM scaling inference by introducing prompt diversity, reducing error rates in Best-of-N selection across reasoning, math, and code tasks.

Details

Motivation: To systematically study how prompt diversity affects LLM scaling inference, motivated by observed relationship between solution accuracy and meaningful response diversity.

Method: Theoretical analysis of why diversified sampling improves Best-of-N scaling, derivation of diversity-fidelity trade-off principle, instantiation of perturbation styles, and systematic empirical evaluation across different contexts.

Result: Diversified sampling yields relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation. Works under various conditions but diversity may vanish under majority voting.

Conclusion: Provides systematic theoretical and empirical foundation showing that appropriately applied sampling diversity significantly improves LLM inference-time scaling performance across multiple domains.

Abstract: Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-N scaling, showing that responses generated from diverse prompts after Best-of-N selection exhibit significantly lower error rates than those produced from stationary prompts. Building on this analysis, we derive a diversity-fidelity trade-off principle, that guides the design of sampling strategies introducing diversity. From this guidance, we instantiate a family of effective perturbation styles. We theoretically and empirically characterize when diversified exploration remains effective, demonstrating that it works under a variety of conditions, and we further show that under majority voting, diversity may vanish. Finally, we systematically evaluate how effective sampling diversity is and show that, when applied appropriately in different contexts, it yields relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation. Overall, this work provides a systematic analysis that offers a theoretical and empirical foundation for understanding how sampling diversity affects LLM inference-time scaling.

[350] Data-Free Continual Learning of Server Models in Model-Heterogeneous Cloud-Device Collaboration

Xiao Zhang, Zengzhe Chen, Yuan Yuan, Yifei Zou, Fuzhen Zhuang, Wenyu Jiao, Yuke Wang, Dongxiao Yu

Main category: cs.LG

TL;DR: FedDCL is a framework for data-free continual learning in model-heterogeneous federated learning using diffusion models to generate synthetic data and enable knowledge transfer without raw data exchange.

Details

Motivation: Traditional federated learning faces challenges with data heterogeneity, model heterogeneity, catastrophic forgetting, and knowledge misalignment in dynamic cloud-device collaborative computing environments where new data and models continuously emerge.

Method: FedDCL leverages pre-trained diffusion models to extract lightweight class-specific prototypes, enabling three data-free capabilities: 1) synthetic data generation for current tasks to address non-IID data, 2) exemplar-free generative replay for knowledge retention from previous tasks, and 3) data-free dynamic knowledge transfer from heterogeneous devices to cloud server.

Result: Experimental results on various datasets demonstrate FedDCL’s effectiveness in enhancing generalizability and practical applicability of federated cloud-device collaboration in dynamic settings.

Conclusion: FedDCL shows potential to address key challenges in federated learning by enabling data-free continual learning in model-heterogeneous settings, improving the viability of privacy-preserving cloud-device collaborative computing.

Abstract: The rise of cloud-device collaborative computing has enabled intelligent services to be delivered across distributed edge devices while leveraging centralized cloud resources. In this paradigm, federated learning (FL) has become a key enabler for privacy-preserving model training without transferring raw data from edge devices to the cloud. However, with the continuous emergence of new data and increasing model diversity, traditional federated learning faces significant challenges, including inherent issues of data heterogeneity, model heterogeneity and catastrophic forgetting, along with new challenge of knowledge misalignment. In this study, we introduce FedDCL, a novel framework designed to enable data-free continual learning of the server model in a model-heterogeneous federated setting. We leverage pre-trained diffusion models to extract lightweight class-specific prototypes, which confer a threefold data-free advantage, enabling: (1) generation of synthetic data for the current task to augment training and counteract non-IID data distributions; (2) exemplar-free generative replay for retaining knowledge from previous tasks; and (3) data-free dynamic knowledge transfer from heterogeneous devices to the cloud server.Experimental results on various datasets demonstrate the effectiveness of FedDCL, showcasing its potential to enhance the generalizability and practical applicability of federated cloud-device collaboration in dynamic settings.

[351] How to use score-based diffusion in earth system science: A satellite nowcasting example

Randy J. Chase, Katherine Haynes, Lander Ver Hoef, Imme Ebert-Uphoff

Main category: cs.LG

TL;DR: Diffusion models outperform traditional ML for cloud nowcasting, producing sharper forecasts with better cloud generation/decay and ensemble capabilities.

Details

Motivation: Traditional ML methods for earth science applications create blurry forecasts, while diffusion models can produce sharper, more realistic images but are challenging to adapt due to theoretical focus in literature.

Method: Applied three diffusion model variants to cloud nowcasting: standard score-based diffusion (Diff), residual correction diffusion (CorrDiff), and latent diffusion model (LDM), using geostationary satellite infrared imagery.

Result: Diffusion models successfully advect, generate, and decay clouds including convective initiation. CorrDiff performed best, outperforming all other diffusion models, conventional U-Net, and persistence. Models also enabled skillful ensemble generation.

Conclusion: Diffusion models provide superior cloud nowcasting with sharper forecasts and ensemble capabilities. The work serves as a practical starting point for applying diffusion models to various earth science applications.

Abstract: Machine learning (ML) is used for many earth science applications; however, traditional ML methods trained with squared errors often create blurry forecasts. Diffusion models are an emerging generative ML technique with the ability to produce sharper, more realistic images by learning the underlying data distribution. Diffusion models are becoming more prevalent, yet adapting them for earth science applications can be challenging because most articles focus on theoretical aspects of the approach, rather than making the method widely accessible. This work illustrates score-based diffusion models with a well-known problem in atmospheric science: cloud nowcasting (zero-to-three-hour forecast). After discussing the background and intuition of score-based diffusion models using examples from geostationary satellite infrared imagery, we experiment with three types of diffusion models: a standard score-based diffusion model (Diff); a residual correction diffusion model (CorrDiff); and a latent diffusion model (LDM). Our results show that the diffusion models not only advect existing clouds, but also generate and decay clouds, including convective initiation. A case study qualitatively shows the preservation of high-resolution features longer into the forecast than a conventional U-Net. The best of the three diffusion models tested was the CorrDiff approach, outperforming all other diffusion models, the conventional U-Net, and persistence. The diffusion models also enable out-of-the-box ensemble generation with skillful calibration. By explaining and exploring diffusion models for a common problem and ending with lessons learned from adapting diffusion models for our task, this work provides a starting point for the community to utilize diffusion models for a variety of earth science applications.

[352] PEAR: Equal Area Weather Forecasting on the Sphere

Hampus Linander, Christoffer Petersson, Daniel Persson, Jan E. Gerken

Main category: cs.LG

TL;DR: PEAR is a transformer-based weather forecasting model that operates directly on HEALPix grid, outperforming equiangular grid models without computational overhead.

Details

Motivation: Existing AI weather models use equiangular grids with uneven pixel density (finer at poles), causing unphysical biases. HEALPix grid provides equal-area pixels, gaining support in meteorology/climate science.

Method: Propose PEAR (Pangu Equal ARea), a transformer-based deep learning model that natively operates on HEALPix grid features, eliminating the need for equiangular discretization.

Result: PEAR outperforms corresponding models on equiangular grids without any computational overhead, demonstrating the advantage of HEALPix-based approach.

Conclusion: HEALPix grid provides superior foundation for AI weather forecasting models, enabling better performance while maintaining computational efficiency.

Abstract: Artificial intelligence is rapidly reshaping the natural sciences, with weather forecasting emerging as a flagship AI4Science application where machine learning models can now rival and even surpass traditional numerical simulations. Following the success of the landmark models Pangu Weather and Graphcast, outperforming traditional numerical methods for global medium-range forecasting, many novel data-driven methods have emerged. A common limitation shared by many of these models is their reliance on an equiangular discretization of the sphere which suffers from a much finer grid at the poles than around the equator. In contrast, in the Hierarchical Equal Area iso-Latitude Pixelization (HEALPix) of the sphere, each pixel covers the same surface area, removing unphysical biases. Motivated by a growing support for this grid in meteorology and climate sciences, we propose to perform weather forecasting with deep learning models which natively operate on the HEALPix grid. To this end, we introduce Pangu Equal ARea (PEAR), a transformer-based weather forecasting model which operates directly on HEALPix-features and outperforms the corresponding model on an equiangular grid without any computational overhead.

[353] A Certified Unlearning Approach without Access to Source Data

Umit Yigit Basaran, Sk Miraj Ahmed, Amit Roy-Chowdhury, Basak Guler

Main category: cs.LG

TL;DR: Certified unlearning framework for data removal without access to original training data, using surrogate datasets and controlled noise scaling based on statistical distance.

Details

Motivation: Growing need for data erasure due to privacy regulations, with traditional methods requiring full training dataset access which is often unavailable in practice.

Method: Uses surrogate dataset approximating source data statistics, applies controlled noise scaling based on statistical distance between datasets, with theoretical bounds and practical noise calibration techniques.

Result: Effective and reliable data removal validated through experiments on synthetic and real-world datasets, providing strong post-unlearning guarantees while maintaining model utility.

Conclusion: Proposed framework enables practical certified unlearning without original data access, addressing privacy requirements in real-world scenarios where source data is unavailable.

Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model’s behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

[354] The kernel of graph indices for vector search

Mariano Tepper, Ted Willke

Main category: cs.LG

TL;DR: SVG is a new graph index for vector search that uses kernel methods to build navigable graphs in both metric and non-metric spaces, with formal guarantees and bounded out-degree via sparsity constraints.

Details

Motivation: Existing graph indices for vector search rely on Euclidean geometry principles, limiting their formal guarantees to metric spaces. There's a need for graph indices that work in both metric and non-metric spaces (like inner product similarity) with formal navigability guarantees.

Method: Introduces Support Vector Graph (SVG) using kernel methods to establish graph connectivity. Proposes SVG-L0 variant with ℓ₀ sparsity constraint for bounded out-degree, replacing heuristic truncation with principled approach. Shows existing indices (HNSW, DiskANN) as special cases of SVG framework.

Result: SVG provides formal navigability guarantees in both metric and non-metric spaces. SVG-L0 achieves bounded out-degree through principled sparsity constraints, has self-tuning properties that avoid heuristic candidate sets, and maintains computational efficiency.

Conclusion: Machine learning can effectively build graph indices for vector search across different similarity spaces. The SVG framework unifies existing approaches and enables derivation of new navigable indices with formal guarantees and practical bounded-degree implementations.

Abstract: The most popular graph indices for vector search use principles from computational geometry to build the graph. Hence, their formal graph navigability guarantees are only valid in Euclidean space. In this work, we show that machine learning can be used to build graph indices for vector search in metric and non-metric vector spaces (e.g., for inner product similarity). From this novel perspective, we introduce the Support Vector Graph (SVG), a new type of graph index that leverages kernel methods to establish the graph connectivity and that comes with formal navigability guarantees valid in metric and non-metric vector spaces. In addition, we interpret the most popular graph indices, including HNSW and DiskANN, as particular specializations of SVG and show that new navigable indices can be derived from the principles behind this specialization. Finally, we propose SVG-L0 that incorporates an $\ell_0$ sparsity constraint into the SVG kernel method to build graphs with a bounded out-degree. This yields a principled way of implementing this practical requirement, in contrast to the traditional heuristic of simply truncating the out edges of each node. Additionally, we show that SVG-L0 has a self-tuning property that avoids the heuristic of using a set of candidates to find the out-edges of each node and that keeps its computational complexity in check.

[355] Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal

Main category: cs.LG

TL;DR: Large-scale experimental study shows totally corrective LP-based boosting methods can match or outperform XGBoost/LightGBM with shallow trees while producing sparser ensembles, and can effectively thin pre-trained ensembles.

Details

Motivation: Despite theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention, creating a gap between theory and practical application.

Method: Conducted first large-scale experimental study of six LP-based boosting formulations (including two novel methods: NM-Boost and QRLP-Boost) across 20 diverse datasets, evaluating both heuristic and optimal base learners, analyzing accuracy, ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity.

Result: Totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. These methods can also thin pre-trained ensembles without sacrificing performance.

Conclusion: LP-based boosting methods are practically viable, offering competitive performance with sparser ensembles, though limitations exist when using optimal decision trees. The study bridges theory and practice for totally corrective boosting.

Abstract: Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

[356] MolMark: Safeguarding Molecular Structures through Learnable Atom-Level Watermarking

Runwen Hu, Peilin Chen, Keyan Ding, Shiqi Wang

Main category: cs.LG

TL;DR: MolMark is the first deep learning watermarking framework for AI-generated molecules that embeds digital signatures without compromising molecular functionality, maintaining robustness under geometric transformations.

Details

Motivation: AI-generated molecules lack protection mechanisms, making them vulnerable to unauthorized reuse and provenance ambiguity, which undermines scientific reproducibility and intellectual property security.

Method: MolMark learns to modulate chemically meaningful atom-level representations using SE(3)-invariant features for geometric robustness, integrates seamlessly with molecular generative models as a learned transformation, and embeds watermarks without structural interference.

Result: Experiments on QM9 and GEOM-DRUG datasets with GeoBFN and GeoLDM models show MolMark can embed 16-bit watermarks while retaining >90% of essential molecular properties, preserving downstream performance, and achieving >95% extraction accuracy under SE(3) transformations.

Conclusion: MolMark establishes a principled pathway for unifying molecular generation with verifiable authorship, supporting trustworthy and accountable AI-driven molecular discovery.

Abstract: AI-driven molecular generation is reshaping drug discovery and materials design, yet the lack of protection mechanisms leaves AI-generated molecules vulnerable to unauthorized reuse and provenance ambiguity. Such limitation undermines both scientific reproducibility and intellectual property security. To address this challenge, we propose the first deep learning based watermarking framework for molecules (MolMark), which is exquisitely designed to embed high-fidelity digital signatures into molecules without compromising molecular functionalities. MolMark learns to modulate the chemically meaningful atom-level representations and enforce geometric robustness through SE(3)-invariant features, maintaining robustness under rotation, translation, and reflection. Additionally, MolMark integrates seamlessly with AI-based molecular generative models, enabling watermarking to be treated as a learned transformation with minimal interference to molecular structures. Experiments on benchmark datasets (QM9, GEOM-DRUG) and state-of-the-art molecular generative models (GeoBFN, GeoLDM) demonstrate that MolMark can embed 16-bit watermarks while retaining more than 90% of essential molecular properties, preserving downstream performance, and enabling >95% extraction accuracy under SE(3) transformations. MolMark establishes a principled pathway for unifying molecular generation with verifiable authorship, supporting trustworthy and accountable AI-driven molecular discovery.

[357] Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

Main category: cs.LG

TL;DR: SSPO enables efficient preference optimization using minimal labeled data by leveraging unpaired samples through principled pseudo-labeling.

Details

Motivation: Current preference optimization methods require substantial paired feedback data, leading to high resource costs. The authors aim to reduce this dependency by developing a semi-supervised approach.

Method: SSPO combines a small number of pairwise preference labels with large unpaired samples. The key innovation is proving the existence of an optimal reward threshold that separates winning/losing responses, enabling principled pseudo-labeling of unpaired data.

Result: SSPO achieves remarkable data efficiency - training with just 1% of UltraFeedback data surpasses strong baselines trained on 10% of UltraFeedback. Extensive experiments validate effectiveness across datasets.

Conclusion: SSPO maintains human alignment while drastically reducing data acquisition costs, offering a practical solution for preference optimization with limited labeled data.

Abstract: The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

[358] Dual-Distilled Heterogeneous Federated Learning with Adaptive Margins for Trainable Global Prototypes

Fatema Siddika, Md Anwar Hossen, Wensheng Zhang, Anuj Sharma, Juan Pablo Muñoz, Ali Jannesari

Main category: cs.LG

TL;DR: FedProtoKD addresses prototype margin-shrinking in heterogeneous federated learning by using dual-knowledge distillation with contrastive learning and adaptive prototype margins, achieving significant accuracy improvements over state-of-the-art methods.

Details

Motivation: Standard prototype aggregation in heterogeneous federated learning suffers from decision margin shrinking, degrading performance under model heterogeneity and non-IID data distributions.

Method: FedProtoKD uses dual-knowledge distillation combining client logits and prototype features, contrastive learning-based trainable server prototypes with adaptive margins, and importance assessment of public samples based on prototype closeness.

Result: Improved test accuracy by average 1.13% and up to 34.13% across various settings, significantly outperforming existing state-of-the-art HFL methods.

Conclusion: FedProtoKD effectively addresses prototype margin-shrinking in heterogeneous federated learning through enhanced knowledge distillation and adaptive prototype mechanisms, demonstrating superior performance over existing approaches.

Abstract: Heterogeneous Federated Learning (HFL) has gained significant attention for its capacity to handle both model and data heterogeneity across clients. Prototype-based HFL methods emerge as a promising solution to address statistical and model heterogeneity as well as privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing class-representative prototypes among heterogeneous clients. However, aggregating these prototypes via standard weighted averaging often yields sub-optimal global knowledge. Specifically, the averaging approach induces a shrinking of the aggregated prototypes’ decision margins, thereby degrading model performance in scenarios with model heterogeneity and non-IID data distributions. The propose FedProtoKD in a Heterogeneous Federated Learning setting, utilizing an enhanced dual-knowledge distillation mechanism to enhance system performance by leveraging clients’ logits and prototype feature representations. The proposed framework aims to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, the framework assess the importance of public samples using the closeness of the sample’s prototype to its class representative prototypes, which enhances learning performance. FedProtoKD improved test accuracy by an average of 1.13% and up to 34.13% across various settings, significantly outperforming existing state-of-the-art HFL methods.

[359] Fine-Tuning Masked Diffusion for Provable Self-Correction

Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen

Main category: cs.LG

TL;DR: PRISM introduces a lightweight, model-agnostic approach for self-correction in Masked Diffusion Models that learns per-token quality scores without RL or verifiers, improving inference across domains.

Details

Motivation: Current approaches for self-correction in Masked Diffusion Models either require architectural/training overhauls or rely on imprecise proxies for token quality, limiting their applicability. There's a need for a lightweight, model-agnostic solution.

Method: PRISM (Plug-in Remasking for Inference-time Self-correction of Masked Diffusions) is a lightweight approach that defines a self-correction loss to learn per-token quality scores without RL or verifiers. These scores are computed in the same forward pass as MDM and used to detect low-quality tokens.

Result: PRISM advances MDM inference across multiple domains and scales: Sudoku puzzles, unconditional text generation (170M parameters), and code generation with LLaDA (8B parameters).

Conclusion: PRISM provides an effective, theoretically-grounded approach for self-correction in Masked Diffusion Models that works with any pretrained MDM without requiring architectural changes or complex training procedures.

Abstract: A natural desideratum for generative models is self-correction–detecting and revising low-quality tokens at inference. While Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces, their capacity for self-correction remains poorly understood. Prior attempts to incorporate self-correction into MDMs either require overhauling MDM architectures/training or rely on imprecise proxies for token quality, limiting their applicability. Motivated by this, we introduce PRISM–Plug-in Remasking for Inference-time Self-correction of Masked Diffusions–a lightweight, model-agnostic approach that applies to any pretrained MDM. Theoretically, PRISM defines a self-correction loss that provably learns per-token quality scores, without RL or a verifier. These quality scores are computed in the same forward pass with MDM and used to detect low-quality tokens. Empirically, PRISM advances MDM inference across domains and scales: Sudoku; unconditional text (170M); and code with LLaDA (8B).

[360] A Generic Machine Learning Framework for Radio Frequency Fingerprinting

Alex Hiles, Bashar I. Ahmad

Main category: cs.LG

TL;DR: A generic ML framework for RF fingerprinting that works across different emitter types and supports multiple downstream tasks like specific emitter identification, data association, and emitter clustering.

Details

Motivation: Traditional RF fingerprinting methods are labor-intensive, inflexible, and limited to specific emitter types or transmission schemes. There's a need for more versatile, automated approaches that can handle various emitter types and applications like signal intelligence, electronic surveillance, and physical-layer authentication.

Method: A generic and versatile machine learning framework for data-driven RF fingerprinting that is emitter-type agnostic. The framework supports multiple downstream tasks including specific emitter identification (SEI), emitter data association (EDA), and RF emitter clustering (RFEC).

Result: The framework is demonstrated using real RF datasets for spaceborne surveillance, signal intelligence, and counter-drone applications, showing its practical applicability across different domains.

Conclusion: The proposed ML framework offers a flexible, automated solution for RF fingerprinting that outperforms traditional methods and can be applied to various emitter types and practical applications in defense and civilian domains.

Abstract: Fingerprinting radio frequency (RF) emitters typically involves finding unique characteristics that are featured in their received signal. These fingerprints are nuanced, but sufficiently detailed, motivating the pursuit of methods that can successfully extract them. The downstream task that requires the most meticulous RF fingerprinting (RFF) is known as specific emitter identification (SEI), which entails recognising each individual transmitter. RFF and SEI have a long history, with numerous defence and civilian applications such as signal intelligence, electronic surveillance, physical-layer authentication of wireless devices, to name a few. In recent years, data-driven RFF approaches have become popular due to their ability to automatically learn intricate fingerprints. They generally deliver superior performance when compared to traditional RFF techniques that are often labour-intensive, inflexible, and only applicable to a particular emitter type or transmission scheme. In this paper, we present a generic and versatile machine learning (ML) framework for data-driven RFF with several popular downstream tasks such as SEI, data association (EDA) and RF emitter clustering (RFEC). It is emitter-type agnostic. We then demonstrate the introduced framework for several tasks using real RF datasets for spaceborne surveillance, signal intelligence and countering drones applications.

[361] ASecond-Order SpikingSSM for Wearables

Kartikay Agrawal, Abhijeet Vikram, Vedant Sharma, Vaishnavi Nagabhushana, Ayon Borthakur

Main category: cs.LG

TL;DR: SHaRe-SSM combines spiking neural networks with state space models for efficient long-sequence processing, outperforming transformers and first-order SSMs while eliminating matrix multiplications for resource-constrained applications.

Details

Motivation: To address the need for energy-efficient, multiplication-free computation for ultra-long sequence modeling, combining the strengths of spiking neural networks (energy efficiency, sparse processing) with state space models (scalable long-range sequence modeling without quadratic dependence on sequence length).

Method: Proposes SHaRe-SSM, a second-order spiking state space model using a parallel scan formulation for fast computation over tens of thousands of time steps, and introduces a kernel-based spiking regressor for modeling dependencies in sequences up to 50k steps.

Result: SHaRe-SSM outperforms transformers and first-order SSMs on average, achieves 52.1x less energy consumption than ANN-based second-order SSM, and demonstrates superior long-range modeling capability for sequences up to 50k steps.

Conclusion: SHaRe-SSM provides an energy-efficient, multiplication-free solution for ultra-long sequence processing, making it a strong candidate for resource-constrained devices like wearables by combining the benefits of spiking neural networks and state space models.

Abstract: Spiking neural networks have garnered increasing attention due to their energy efficiency, multiplication-free computation, and sparse event-based processing. In parallel, state space models have emerged as scalable alternatives to transformers for long-range sequence modelling by avoiding quadratic dependence on sequence length. We propose SHaRe-SSM (Spiking Harmonic Resonate-and-Fire State Space Model), a second-order spiking SSM for classification and regression on ultra-long sequences. SHaRe-SSM outperforms transformers and first-order SSMs on average while eliminating matrix multiplications, making it highly suitable for resource-constrained applications. To ensure fast computation over tens of thousands of time steps, we leverage a parallel scan formulation of the underlying dynamical system. Furthermore, we introduce a kernel-based spiking regressor, which enables the accurate modelling of dependencies in sequences of up to 50k steps. Our results demonstrate that SHaRe-SSM achieves superior long-range modelling capability with energy efficiency (52.1x less than ANN-based second order SSM), positioning it as a strong candidate for resource-constrained devices such as wearables

[362] Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction

Omer Jauhar Khan, Sudais Khan, Hafeez Anwar, Shahzeb Khan, Shams Ul Arifeen

Main category: cs.LG

TL;DR: PINNs with physics constraints predict spaghetti bridge weights, achieving R²=0.9603 and MAE=10.50 using a novel PIKAN architecture and limited data (15 real bridges augmented to 100 samples).

Details

Motivation: Physics Informed Neural Networks (PINNs) can embed physical laws into deep learning models, which is valuable for structural engineering tasks with limited data. The paper explores using PINNs to predict small-scale spaghetti bridge weights to understand load limits and failure modes in simplified structural models.

Method: Proposes a framework incorporating physics-based constraints into prediction models. Introduces a novel Physics Informed Kolmogorov Arnold Network (PIKAN) architecture that blends universal function approximation theory with physical insights. Uses structural parameters collected manually or via computer vision. Dataset includes 15 real bridges augmented to 100 samples.

Result: Best model achieves R² score of 0.9603 and mean absolute error (MAE) of 10.50 units. Also provides a web-based interface for parameter entry and prediction. Demonstrates PINNs can offer reliable weight estimates even with limited data.

Conclusion: PINNs can provide reliable structural weight estimates with limited data and may help inform early-stage failure analysis in lightweight bridge designs. The approach shows promise for structural engineering applications where data is scarce.

Abstract: Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.

[363] Training Deep Physics-Informed Kolmogorov-Arnold Networks

Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis

Main category: cs.LG

TL;DR: Proposed RGA KANs with improved initialization overcome training instability in deep physics-informed KANs, outperforming baselines by orders of magnitude on PDE benchmarks.

Details

Motivation: Deep Chebyshev-based physics-informed KANs (cPIKANs) suffer from training instabilities when scaled to depth, limiting their applicability to PDE problems, similar to issues faced by multilayer perceptron-based models.

Method: 1) Basis-agnostic Glorot-like initialization scheme preserving activation variance; 2) Residual-Gated Adaptive KANs (RGA KANs) architecture inspired by PirateNet to mitigate divergence in deep cPIKANs.

Result: RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets by several orders of magnitude on nine standard forward PDE benchmarks, remaining stable where others diverge, and successfully traverse all training phases unlike baseline cPIKANs.

Conclusion: The proposed initialization scheme and RGA KAN architecture effectively address training instability in deep physics-informed KANs, enabling stable and accurate solutions to PDE problems where previous methods fail.

Abstract: Since their introduction, Kolmogorov-Arnold Networks (KANs) have been successfully applied across several domains, with physics-informed machine learning (PIML) emerging as one of the areas where they have thrived. In the PIML setting, Chebyshev-based physics-informed KANs (cPIKANs) have become the standard due to their computational efficiency. However, like their multilayer perceptron-based counterparts, cPIKANs face significant challenges when scaled to depth, leading to training instabilities that limit their applicability to several PDE problems. To address this, we propose a basis-agnostic, Glorot-like initialization scheme that preserves activation variance and yields substantial improvements in stability and accuracy over the default initialization of cPIKANs. Inspired by the PirateNet architecture, we further introduce Residual-Gated Adaptive KANs (RGA KANs), designed to mitigate divergence in deep cPIKANs where initialization alone is not sufficient. Through empirical tests and information bottleneck analysis, we show that RGA KANs successfully traverse all training phases, unlike baseline cPIKANs, which stagnate in the diffusion phase in specific PDE settings. Evaluations on nine standard forward PDE benchmarks under a fixed training pipeline with adaptive components demonstrate that RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets - often by several orders of magnitude - while remaining stable in settings where the others diverge.

[364] Towards Causal Market Simulators

Dennis Thumm, Luis Ontaneda Mijares

Main category: cs.LG

TL;DR: TNCM-VAE combines VAE with structural causal models to generate counterfactual financial time series with causal reasoning capabilities for stress testing and scenario analysis.

Details

Motivation: Existing market generators using deep generative models lack causal reasoning capabilities essential for counterfactual analysis and risk assessment in financial applications.

Method: Time-series Neural Causal Model VAE (TNCM-VAE) combines variational autoencoders with structural causal models, enforces causal constraints through DAGs in decoder architecture, and uses causal Wasserstein distance for training.

Result: Superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth on synthetic autoregressive models inspired by Ornstein-Uhlenbeck process.

Conclusion: The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

Abstract: Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

[365] Incremental Generation is Necessary and Sufficient for Universality in Flow-Based Modelling

Hossein Rouhvarzi, Anastasis Kratsios

Main category: cs.LG

TL;DR: Incremental flow-based denoising models require multiple steps for universal approximation of orientation-preserving homeomorphisms on [0,1]^d, with single-step flows being insufficient but compositions of K_d flows achieving O(n^{-1/d}) approximation rates.

Details

Motivation: To establish rigorous approximation-theoretic foundations for incremental flow-based denoising models, determining when incremental generation is necessary and sufficient for universal flow-based generation on natural function classes.

Method: Using topological-dynamical arguments to prove impossibility of single-step autonomous flows, then exploiting algebraic properties to show compositions of K_d flows can approximate orientation-preserving Lipschitz homeomorphisms with dimension-dependent rates, and using linear lifting for structured universal approximation.

Result: Single-step flows are meagre and not universal, but compositions of at most K_d flows achieve O(n^{-1/d}) approximation rates for Lipschitz homeomorphisms, with dimension-free rates under smoothness assumptions, plus structured universal approximation for continuous functions and probability measures.

Conclusion: Incremental generation is both necessary and sufficient for universal flow-based generation on orientation-preserving homeomorphisms of [0,1]^d, providing rigorous theoretical justification for multi-step denoising models.

Abstract: Incremental flow-based denoising models have reshaped generative modelling, but their empirical advantage still lacks a rigorous approximation-theoretic foundation. We show that incremental generation is necessary and sufficient for universal flow-based generation on the largest natural class of self-maps of $[0,1]^d$ compatible with denoising pipelines, namely the orientation-preserving homeomorphisms of $[0,1]^d$. All our guarantees are uniform on the underlying maps and hence imply approximation both samplewise and in distribution. Using a new topological-dynamical argument, we first prove an impossibility theorem: the class of all single-step autonomous flows, independently of the architecture, width, depth, or Lipschitz activation of the underlying neural network, is meagre and therefore not universal in the space of orientation-preserving homeomorphisms of $[0,1]^d$. By exploiting algebraic properties of autonomous flows, we conversely show that every orientation-preserving Lipschitz homeomorphism on $[0,1]^d$ can be approximated at rate $O(n^{-1/d})$ by a composition of at most $K_d$ such flows, where $K_d$ depends only on the dimension. Under additional smoothness assumptions, the approximation rate can be made dimension-free, and $K_d$ can be chosen uniformly over the class being approximated. Finally, by linearly lifting the domain into one higher dimension, we obtain structured universal approximation results for continuous functions and for probability measures on $[0,1]^d$, the latter realized as pushforwards of empirical measures with vanishing $1$-Wasserstein error.

[366] Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

Shasha Zhou, Mingyu Huang, Jack Cole, Charles Britton, Ming Yin, Jan Wolber, Ke Li

Main category: cs.LG

TL;DR: FAITH framework uses medical knowledge graphs for automated factuality evaluation of LLM responses without reference answers, achieving high correlation with clinician judgments.

Details

Motivation: LLMs show strong medical capabilities but require rigorous validation for safe healthcare deployment; need automated factuality assessment methods.

Method: FAITH framework decomposes LLM responses into atomic claims, links them to medical knowledge graphs, and scores based on evidence paths without needing reference answers.

Result: KG-based evaluation achieves high correlation with clinician judgments, effectively distinguishes LLM capabilities, is robust to textual variances, and provides explainable scoring.

Conclusion: While limitations exist, leveraging knowledge graphs is a promising direction for automated factuality assessment in healthcare LLM applications.

Abstract: The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

[367] Towards Reproducibility in Predictive Process Mining: SPICE – A Deep Learning Library

Oliver Stritzel, Nick Hühnerbein, Simon Rauch, Itzel Zarate, Lukas Fleischmann, Moike Buck, Attila Lischka, Christian Frey

Main category: cs.LG

TL;DR: SPICE is a Python framework that reimplements three popular deep-learning-based Predictive Process Mining methods in PyTorch with a common base framework for reproducible and robust benchmarking.

Details

Motivation: Existing Predictive Process Mining approaches lack reproducibility, transparency, usability for new datasets, and benchmarking capabilities, making comparisons between different implementations difficult.

Method: Developed SPICE framework that reimplements three baseline deep-learning methods for PPM in PyTorch with a common base framework featuring rigorous configurability to enable reproducible comparisons.

Result: Compared SPICE to originally reported metrics and with fair metrics on 11 datasets, demonstrating its effectiveness for benchmarking.

Conclusion: SPICE provides a standardized framework for reproducible and robust comparison of Predictive Process Mining approaches, addressing current limitations in the field.

Abstract: In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.

[368] Look-Ahead Reasoning on Learning Platforms

Haiqing Zhu, Tijana Zrnic, Celestine Mendler-Dünner

Main category: cs.LG

TL;DR: Users engage in strategic behavior on learning platforms, considering both individual (level-k thinking) and collective reasoning approaches, with coordination offering benefits but requiring alignment between learner and user utilities.

Details

Motivation: Learning platforms often optimize for designer priorities rather than user needs, leading to strategic user behavior. Existing work focuses on individual responses to deployed models without considering how user actions are coupled and impact future predictions at scale.

Method: Formalizes level-k thinking (individual strategic reasoning) and collective reasoning (coordinated user actions). Analyzes convergence properties of level-k thinking and characterizes benefits/limits of coordination through utility alignment analysis.

Result: Level-k thinking accelerates convergence to equilibrium but yields no long-term individual benefit. Collective reasoning offers coordination benefits but requires alignment between learner and user utilities. Provides first results on utility trade-offs of coordination in algorithmic systems.

Conclusion: Look-ahead reasoning generalizes algorithmic collective action. User coordination can be beneficial but depends on utility alignment. Strategic behavior analysis should consider coupled user actions and their impact on future predictions.

Abstract: On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and – at scale – impact future predictions. Within this framework, we first formalize level-k thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner’s and the users’ utilities emerges as a key concept. Look-ahead reasoning can be seen as a generalization of algorithmic collective action; we thus offer the first results characterizing the utility trade-offs of coordination when contesting algorithmic systems.

[369] Deep Gaussian Process Proximal Policy Optimization

Matthijs van der Lende, Juan Cardenas-Cartagena

Main category: cs.LG

TL;DR: GPPO is a scalable RL algorithm using Deep Gaussian Processes for uncertainty estimation in continuous control tasks, maintaining PPO-level performance while providing calibrated uncertainty for safer exploration.

Details

Motivation: Deep neural networks in RL often lack calibrated uncertainty estimates, which is critical for balancing safe exploration and efficient learning in control tasks.

Method: Introduces Deep Gaussian Process Proximal Policy Optimization (GPPO), a model-free actor-critic algorithm that leverages Deep Gaussian Processes to approximate both policy and value functions.

Result: GPPO maintains competitive performance with Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates.

Conclusion: GPPO enables safer and more effective exploration in RL through calibrated uncertainty estimation without sacrificing performance on standard benchmarks.

Abstract: Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.

[370] Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

Akira Tamamori

Main category: cs.LG

TL;DR: The paper reveals that the “Ridge of Optimization” in high-capacity kernel Hopfield networks corresponds to the Edge of Stability where the Fisher Information Matrix becomes singular, unifying learning dynamics and capacity through geometric principles.

Details

Motivation: To understand the origin of the "Ridge of Optimization" phenomenon in high-capacity kernel Hopfield networks, which exhibits extreme stability and was previously linked to Spectral Concentration but remained unexplained.

Method: Analyze network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability where the Fisher Information Matrix becomes singular. Show that apparent Euclidean force antagonism manifests as Dual Equilibrium in Riemannian space.

Result: Demonstrates that the Ridge of Optimization is equivalent to the Edge of Stability boundary, providing a geometric interpretation where the Fisher Information Matrix singularity explains the extreme stability observed in these networks.

Conclusion: Unifies learning dynamics and capacity through the Minimum Description Length principle, offering a geometric theory of self-organized criticality that explains the origin of the Ridge phenomenon in kernel Hopfield networks.

Abstract: High-capacity kernel Hopfield networks exhibit a \textit{Ridge of Optimization} characterized by extreme stability. While previously linked to \textit{Spectral Concentration}, its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability, a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textit{Dual Equilibrium} in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

[371] xGR: Efficient Generative Recommendation Serving at Scale

Qingxiao Sun, Tongxuan Liu, Shen Zhang, Siyu Wu, Peijun Yang, Haotian Liang, Menxin Li, Xiaolong Ma, Zhiwei Liang, Ziyi Ren, Minchao Zhang, Xinyu Liu, Ke Zhang, Depei Qian, Hailong Yang

Main category: cs.LG

TL;DR: xGR is a serving system for generative recommendation that optimizes LLM-based recommendation workloads to achieve high throughput under strict latency constraints.

Details

Motivation: Generative recommendation (GR) using LLMs has different workload characteristics from traditional LLM serving - it processes long prompts but produces short, fixed-length outputs, with high computational costs in decode phases due to large beam width and time-consuming sorting overhead from vast item spaces.

Method: xGR uses three key techniques: 1) unifies prefill and decode phases through staged computation and separated KV cache, 2) enables early sorting termination and mask-based item filtering with data structure reuse, and 3) reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism.

Result: Experiments with real-world recommendation service datasets show xGR achieves at least 3.49x throughput compared to state-of-the-art baselines under strict latency constraints.

Conclusion: xGR effectively addresses the unique challenges of generative recommendation serving, providing a specialized system that meets low-latency requirements in high-concurrency scenarios.

Abstract: Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR’s workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

[372] Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

Atalay Denknalbant, Emre Sezdi, Zeki Furkan Kutlu, Polat Goktas

Main category: cs.LG

TL;DR: This paper demonstrates that alternative behavioral data (phone usage, shopping patterns, etc.) can effectively evaluate credit risk for underbanked individuals in Istanbul, achieving near-bureau-level discrimination power without formal credit records.

Details

Motivation: Financial exclusion limits entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often lack formal credit bureau files because their earnings flow through informal channels, making traditional credit evaluation impossible.

Method: Created a synthetic dataset of 100,000 Istanbul residents reproducing 2025 Q1 census marginals and telecom patterns using retrieval-augmented generation with OpenAI o3. Each profile contains 7 socio-demographic variables and 9 alternative attributes (phone specs, shopping rhythm, subscription spend, car ownership, rent, credit card flag). Tested CatBoost, LightGBM, and XGBoost models in two versions: demo (socio-demographic only) and full (including alternative attributes).

Result: Alternative data raised AUC by about 1.3 percentage points and boosted balanced F1 from roughly 0.84 to 0.95 (14% gain). The concise set of behavioral attributes approached bureau-level discrimination power for borrowers lacking formal credit records.

Conclusion: The study provides an open synthetic dataset, reproducible modeling pipeline, and evidence that behavioral attributes can effectively evaluate underbanked borrowers. This gives lenders and regulators a transparent blueprint for extending fair and safe credit access to financially excluded populations.

Abstract: Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced (F_{1}) from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.

cs.MA

[373] On the Role of Contextual Information and Ego States in LLM Agent Behavior for Transactional Analysis Dialogues

Monika Zamojska, Jarosław A. Chudziak

Main category: cs.MA

TL;DR: This paper proposes a Transactional Analysis-inspired multi-agent system where agents have Parent/Adult/Child ego states with information retrieval, enhancing psychological realism in LLM-based simulations.

Details

Motivation: Current LLM agents lack psychological depth and consistency needed to model human thinking patterns, missing deeper goals, emotional conflicts, and motivations that drive real human interactions in social, political, and psychological research contexts.

Method: A Multi-Agent System (MAS) inspired by Transactional Analysis (TA) theory, where each agent is divided into three ego states (Parent, Adult, Child) as separate knowledge structures with their own perspectives and reasoning styles, enhanced with information retrieval mechanisms to access relevant contextual information from vector stores.

Result: The architecture was evaluated through ablation tests in a simulated dialogue scenario, comparing agents with and without information retrieval. Results are promising and open new directions for exploring psychologically grounded structures in agent behavior.

Conclusion: The paper contributes an agent architecture that integrates Transactional Analysis theory with contextual information retrieval to enhance the realism of LLM-based multi-agent simulations, addressing current limitations in psychological depth and consistency.

Abstract: LLM-powered agents are now used in many areas, from customer support to education, and there is increasing interest in their ability to act more like humans. This includes fields such as social, political, and psychological research, where the goal is to model group dynamics and social behavior. However, current LLM agents often lack the psychological depth and consistency needed to capture the real patterns of human thinking. They usually provide direct or statistically likely answers, but they miss the deeper goals, emotional conflicts, and motivations that drive real human interactions. This paper proposes a Multi-Agent System (MAS) inspired by Transactional Analysis (TA) theory. In the proposed system, each agent is divided into three ego states - Parent, Adult, and Child. The ego states are treated as separate knowledge structures with their own perspectives and reasoning styles. To enrich their response process, they have access to an information retrieval mechanism that allows them to retrieve relevant contextual information from their vector stores. This architecture is evaluated through ablation tests in a simulated dialogue scenario, comparing agents with and without information retrieval. The results are promising and open up new directions for exploring how psychologically grounded structures can enrich agent behavior. The contribution is an agent architecture that integrates Transactional Analysis theory with contextual information retrieval to enhance the realism of LLM-based multi-agent simulations.

[374] MAPPO-LCR: Multi-Agent Policy Optimization with Local Cooperation Reward in Spatial Public Goods Games

Zhaoqilin Yang, Axin Xiang, Kedi Yang, Tianjun Liu, Youliang Tian

Main category: cs.MA

TL;DR: MAPPO-LCR uses centralized critic with local cooperation rewards to solve spatial public goods games where traditional methods fail due to payoff coupling and non-stationarity.

Details

Motivation: Existing evolutionary and reinforcement learning methods struggle with payoff coupling and non-stationarity in spatial public goods games with large interacting populations. Traditional PPO ignores the intrinsic coupling of individual returns through overlapping group interactions.

Method: Introduces Multi-Agent PPO (MAPPO) with centralized critic for joint strategy evaluation, plus MAPPO-LCR variant that adds local cooperation reward aligning policy updates with surrounding cooperative density without changing game structure. Enables decentralized execution with population-level value estimation during training.

Result: Extensive simulations show stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses confirm MAPPO’s learning advantage over PPO in spatial public goods games.

Conclusion: MAPPO-LCR successfully addresses coupling limitations in spatial public goods games through centralized critic and local cooperation rewards, enabling effective cooperation emergence while preserving decentralized execution.

Abstract: Spatial public goods games model collective dilemmas where individual payoffs depend on population-level strategy configurations. Most existing studies rely on evolutionary update rules or value-based reinforcement learning methods. These approaches struggle to represent payoff coupling and non-stationarity in large interacting populations. This work introduces Multi-Agent Proximal Policy Optimization (MAPPO) into spatial public goods games for the first time. In these games, individual returns are intrinsically coupled through overlapping group interactions. Proximal Policy Optimization (PPO) treats agents as independent learners and ignores this coupling during value estimation. MAPPO addresses this limitation through a centralized critic that evaluates joint strategy configurations. To study neighborhood-level cooperation signals under this framework, we propose MAPPO with Local Cooperation Reward, termed MAPPO-LCR. The local cooperation reward aligns policy updates with surrounding cooperative density without altering the original game structure. MAPPO-LCR preserves decentralized execution while enabling population-level value estimation during training. Extensive simulations demonstrate stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses further confirm the learning advantage of MAPPO over PPO in spatial public goods games.

[375] Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

Abhivansh Gupta

Main category: cs.MA

TL;DR: Proposes a Verifiability-First architecture with cryptographic attestations and Audit Agents for LLM-based agents, plus OPERA benchmark to measure misalignment detection speed and reliability rather than just likelihood.

Details

Motivation: As LLM-based agents become more autonomous and multi-modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior work showed agent personalities and tool access significantly influence misalignment.

Method: Verifiability-First architecture with: (1) run-time attestations using cryptographic and symbolic methods, (2) lightweight Audit Agents for continuous verification of intent vs behavior using constrained reasoning, (3) challenge-response attestation protocols for high-risk operations. Plus OPERA benchmark suite to measure detectability, detection time, and resilience.

Result: Introduces OPERA (Observability, Provable Execution, Red-team, Attestation) benchmark suite and evaluation protocol designed to measure: (i) detectability of misalignment, (ii) time to detection under stealthy strategies, (iii) resilience of verifiability mechanisms to adversarial prompt and persona injection.

Conclusion: Shifts evaluation focus from measuring how likely misalignment is to measuring how quickly and reliably misalignment can be detected and remediated, emphasizing proactive verifiability over passive measurement.

Abstract: As LLM-based agents grow more autonomous and multi-modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior benchmarks measured the propensity for misaligned behavior and showed that agent personalities and tool access significantly influence misalignment. Building on these insights, we propose a Verifiability-First architecture that (1) integrates run-time attestations of agent actions using cryptographic and symbolic methods, (2) embeds lightweight Audit Agents that continuously verify intent versus behavior using constrained reasoning, and (3) enforces challenge-response attestation protocols for high-risk operations. We introduce OPERA (Observability, Provable Execution, Red-team, Attestation), a benchmark suite and evaluation protocol designed to measure (i) detectability of misalignment, (ii) time to detection under stealthy strategies, and (iii) resilience of verifiability mechanisms to adversarial prompt and persona injection. Our approach shifts the evaluation focus from how likely misalignment is to how quickly and reliably misalignment can be detected and remediated.

[376] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

Main category: cs.MA

TL;DR: A coordination framework for LLM agents that enables adaptiveness through dynamic task routing, bidirectional feedback, and parallel agent evaluation with competition.

Details

Motivation: Existing multi-agent frameworks rely on static workflows, fixed roles, and limited communication, reducing effectiveness in open-ended, high-complexity domains.

Method: Proposes a coordination framework with three core mechanisms: dynamic task routing (reallocating tasks based on confidence/workload), bidirectional feedback (structured critiques for iterative improvement), and parallel agent evaluation (competition on ambiguous subtasks with evaluator-driven selection).

Result: Demonstrates substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines.

Conclusion: Highlights the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

Abstract: Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

cs.MM

[377] Voxel-GS: Quantized Scaffold Gaussian Splatting Compression with Run-Length Coding

Chunyang Fu, Xiangrui Liu, Shiqi Wang, Zhu Li

Main category: cs.MM

TL;DR: Voxel-GS is a Gaussian splatting compression framework that uses differentiable quantization, Laplacian rate proxy, and run-length coding for high compression ratios with fast coding speeds.

Details

Motivation: Large Gaussian splatting point clouds need efficient compression, but prior methods use complex neural entropy models that are computationally expensive.

Method: Uses differentiable quantization on Scaffold-GS attributes, Laplacian-based rate proxy for entropy constraint, and lossless compression with Octree and run-length coding.

Result: Achieves high compression ratios with significantly faster coding speeds than prior methods, with accurate rate estimation for run-length coding.

Conclusion: Voxel-GS provides an effective, lightweight alternative to complex neural entropy models for Gaussian splatting compression, achieving competitive performance with simpler components.

Abstract: Substantial Gaussian splatting format point clouds require effective compression. In this paper, we propose Voxel-GS, a simple yet highly effective framework that departs from the complex neural entropy models of prior work, instead achieving competitive performance using only a lightweight rate proxy and run-length coding. Specifically, we employ a differentiable quantization to discretize the Gaussian attributes of Scaffold-GS. Subsequently, a Laplacian-based rate proxy is devised to impose an entropy constraint, guiding the generation of high-fidelity and compact reconstructions. Finally, this integer-type Gaussian point cloud is compressed losslessly using Octree and run-length coding. Experiments validate that the proposed rate proxy accurately estimates the bitrate of run-length coding, enabling Voxel-GS to eliminate redundancy and optimize for a more compact representation. Consequently, our method achieves a remarkable compression ratio with significantly faster coding speeds than prior art. The code is available at https://github.com/zb12138/VoxelGS.

eess.AS

[378] Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models

Ali Alsayegh, Tariq Masood

Main category: eess.AS

TL;DR: Evaluation of 8 commercial speech-to-text services on dysarthric speech shows severity-dependent performance: mild dysarthria achieves 3-5% WER (approaching typical speech), while severe dysarthria exceeds 49% WER. GPT-4o shows consistent improvement with verbatim prompts, while Gemini variants degrade.

Details

Motivation: People with dysarthria face exclusion from voice-based human-machine interaction due to poor ASR performance. While MLLMs offer potential for contextual reasoning to compensate for acoustic degradation, their zero-shot capabilities for dysarthric speech remain uncharacterized.

Method: Evaluated 8 commercial speech-to-text services on TORGO dysarthric speech corpus: 4 conventional ASR systems (AssemblyAI, Whisper large-v3, Deepgram Nova-3, Nova-3 Medical) and 4 MLLM-based systems (GPT-4o, GPT-4o Mini, Gemini 2.5 Pro, Gemini 2.5 Flash). Assessed lexical accuracy, semantic preservation, and cost-latency trade-offs.

Result: Severity-dependent degradation: mild dysarthria achieves 3-5% WER (approaching typical speech benchmarks), severe dysarthria exceeds 49% WER across all systems. Verbatim-transcription prompt yields architecture-specific effects: GPT-4o achieves 7.36 percentage point WER reduction with consistent improvement, while Gemini variants degrade. Semantic metrics show communicative intent remains partially recoverable despite high lexical error rates.

Conclusion: Establishes empirical baselines for evidence-based technology selection in assistive voice interface deployment for dysarthric speakers. Shows MLLMs can provide improvements but with architecture-dependent effects, and that semantic content remains partially recoverable even with poor lexical accuracy.

Abstract: Voice-based human-machine interaction is a primary modality for accessing intelligent systems, yet individuals with dysarthria face systematic exclusion due to recognition performance gaps. Whilst automatic speech recognition (ASR) achieves word error rates (WER) below 5% on typical speech, performance degrades dramatically for dysarthric speakers. Multimodal large language models (MLLMs) offer potential for leveraging contextual reasoning to compensate for acoustic degradation, yet their zero-shot capabilities remain uncharacterised. This study evaluates eight commercial speech-to-text services on the TORGO dysarthric speech corpus: four conventional ASR systems (AssemblyAI, Whisper large-v3, Deepgram Nova-3, Nova-3 Medical) and four MLLM-based systems (GPT-4o, GPT-4o Mini, Gemini 2.5 Pro, Gemini 2.5 Flash). Evaluation encompasses lexical accuracy, semantic preservation, and cost-latency trade-offs. Results demonstrate severity-dependent degradation: mild dysarthria achieves 3-5% WER approaching typical-speech benchmarks, whilst severe dysarthria exceeds 49% WER across all systems. A verbatim-transcription prompt yields architecture-specific effects: GPT-4o achieves 7.36 percentage point WER reduction with consistent improvement across all tested speakers, whilst Gemini variants exhibit degradation. Semantic metrics indicate that communicative intent remains partially recoverable despite elevated lexical error rates. These findings establish empirical baselines enabling evidence-based technology selection for assistive voice interface deployment.

[379] Review of MEMS Speakers for Audio Applications

Nils Wittek, Anton Melnikov, Bert Kaiser, André Zimmermann

Main category: eess.AS

TL;DR: This review paper analyzes MEMS speakers as compact alternatives to traditional speakers, covering ultrasound and thermoacoustic methods, comparing electrodynamic, piezoelectric, and electrostatic actuation, with piezoelectric MEMS showing dominance in performance from 1990-2025.

Details

Motivation: MEMS speakers offer compact, scalable alternatives to traditional voice coil speakers with potential for improved sound quality through precise semiconductor manufacturing. The review aims to provide an overview of the research landscape and identify promising approaches for full-spectrum audio performance.

Method: The review classifies MEMS speakers by actuation principle (electrodynamic, piezoelectric, electrostatic) and analyzes performance indicators from 1990-2025. It includes ultrasound pulse-based and thermoacoustic sound generation methods, with comparative analysis focusing on miniaturization and efficiency.

Result: Piezoelectric MEMS speakers with direct air displacement show dominance in performance metrics. The review identifies upcoming research challenges and potential candidates for achieving full-spectrum audio performance, suggesting innovative approaches could enable wideband adoption of MEMS-only speakers.

Conclusion: MEMS speakers present promising alternatives to traditional speakers, with piezoelectric actuation showing particular promise. Future research should focus on innovative approaches to overcome current limitations and enable widespread adoption of MEMS-only speakers for full-spectrum audio applications.

Abstract: Microelectromechanical systems (MEMS) speakers are compact, scalable alternatives to traditional voice coil speakers, promising improved sound quality through precise semiconductor manufacturing. This review provides an overview of the research landscape, including ultrasound pulse-based and thermoacoustic sound generation, classifying MEMS speakers by actuation principle: electrodynamic, piezoelectric, and electrostatic. A comparative analysis of performance indicators from 1990-2025 highlights the dominance of piezoelectric MEMS with direct air displacement, focusing on miniaturization and efficiency. The review outlines upcoming research challenges and identifies potential candidates for achieving full-spectrum audio performance. A focus on innovative approaches could lead to wideband adoption of MEMS-only speakers.

[380] Towards a Single ASR Model That Generalizes to Disordered Speech

Jimmy Tobin, Katrin Tomanek, Subhashini Venugopalan

Main category: eess.AS

TL;DR: Adding just 1,000 hours of disordered speech data (less than 1% of total training data) to ASR fine-tuning yields 26-33% accuracy improvements on disordered speech without harming standard speech performance.

Details

Motivation: To improve speech recognition accessibility for people with speech disabilities by making ASR systems work better on disordered speech without compromising standard speech performance.

Method: Integrated ~1,000 hours of disordered speech recordings into fine-tuning of a near state-of-the-art ASR baseline system, representing less than 1% of total training data.

Result: 33% improvement on prompted disordered speech, 26% improvement on spontaneous conversational disordered speech, no significant performance decline on standard benchmarks, and closed 64% of gap between baseline and personalized models.

Conclusion: Incorporating a small fraction of high-quality disordered speech data is an effective, easy step to make speech technology more accessible for users with speech disabilities from a fairness perspective.

Abstract: This study investigates the impact of integrating a dataset of disordered speech recordings ($\sim$1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. Contrary to what one might expect, despite the data being less than 1% of the training data of the ASR system, we find a considerable improvement in disordered speech recognition accuracy. Specifically, we observe a 33% improvement on prompted speech, and a 26% improvement on a newly gathered spontaneous, conversational dataset of disordered speech. Importantly, there is no significant performance decline on standard speech recognition benchmarks. Further, we observe that the proposed tuning strategy helps close the gap between the baseline system and personalized models by 64% highlighting the significant progress as well as the room for improvement. Given the substantial benefits of our findings, this experiment suggests that from a fairness perspective, incorporating a small fraction of high quality disordered speech data in a training recipe is an easy step that could be done to make speech technology more accessible for users with speech disabilities.

[381] Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah

Main category: eess.AS

TL;DR: Automatic temporal localization of key PE therapy fidelity elements from session audio/transcripts using fine-tuned Qwen2-Audio with LoRA, achieving 5.3s MAE for boundary prediction.

Details

Motivation: Manual review of PE therapy sessions for fidelity assessment is labor-intensive, creating a need for automated, scalable solutions to support quality assurance and clinician training.

Method: Fine-tune Qwen2-Audio using LoRA on 30-second audio-transcript windows; use LLM-based prompting to generate fidelity labels for three protocol phases; predict normalized boundary offsets with soft supervision guided by task-specific prompts.

Result: Best configuration (LoRA rank 8, 30s windows) achieves 5.3s MAE on 308 real PE sessions, within typical rater tolerance for timestamp review, enabling practical fidelity quality control.

Conclusion: Introduces a privacy-preserving, scalable framework for automatic fidelity tracking in PE therapy with potential applications in clinician training, supervision, and quality assurance.

Abstract: Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements, identifying their start and stop times, directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases, therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3), are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 308 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3s across tasks, within typical rater tolerance for timestamp review, enabling practical fidelity QC. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a privacy-preserving, scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

eess.IV

[382] Colormap-Enhanced Vision Transformers for MRI-Based Multiclass (4-Class) Alzheimer’s Disease Classification

Faisal Ahmed

Main category: eess.IV

TL;DR: PseudoColorViT-Alz uses colormap-enhanced Vision Transformers on MRI scans to achieve state-of-the-art Alzheimer’s disease classification with 99.79% accuracy and 100% AUC on OASIS-1 dataset.

Details

Motivation: MRI is crucial for Alzheimer's disease diagnosis, but subtle structural variations in brain MRI scans challenge conventional deep learning models to extract discriminative features effectively.

Method: Proposes PseudoColorViT-Alz, a colormap-enhanced Vision Transformer framework that uses pseudo-color representations of MRI images to amplify anatomical texture and contrast cues, combined with Vision Transformers’ global feature learning capabilities.

Result: Achieves 99.79% accuracy with 100% AUC on OASIS-1 dataset for four-class classification (non-demented, moderate dementia, mild dementia, very mild dementia), surpassing recent 2024-2025 methods (96.1%-99.68% accuracy).

Conclusion: Pseudo-color augmentation combined with Vision Transformers significantly enhances MRI-based Alzheimer’s disease classification, offering a robust and interpretable framework that outperforms current methods for clinical decision-making and early detection.

Abstract: Magnetic Resonance Imaging (MRI) plays a pivotal role in the early diagnosis and monitoring of Alzheimer’s disease (AD). However, the subtle structural variations in brain MRI scans often pose challenges for conventional deep learning models to extract discriminative features effectively. In this work, we propose PseudoColorViT-Alz, a colormap-enhanced Vision Transformer framework designed to leverage pseudo-color representations of MRI images for improved Alzheimer’s disease classification. By combining colormap transformations with the global feature learning capabilities of Vision Transformers, our method amplifies anatomical texture and contrast cues that are otherwise subdued in standard grayscale MRI scans. We evaluate PseudoColorViT-Alz on the OASIS-1 dataset using a four-class classification setup (non-demented, moderate dementia, mild dementia, and very mild dementia). Our model achieves a state-of-the-art accuracy of 99.79% with an AUC of 100%, surpassing the performance of recent 2024–2025 methods, including CNN-based and Siamese-network approaches, which reported accuracies ranging from 96.1% to 99.68%. These results demonstrate that pseudo-color augmentation combined with Vision Transformers can significantly enhance MRI-based Alzheimer’s disease classification. PseudoColorViT-Alz offers a robust and interpretable framework that outperforms current methods, providing a promising tool to support clinical decision-making and early detection of Alzheimer’s disease.

[383] Fetpype: An Open-Source Pipeline for Reproducible Fetal Brain MRI Analysis

Thomas Sanchez, Gerard Martí-Juan, David Meunier, Miguel Angel Gonzalez Ballester, Oscar Camara, Gemma Piella, Meritxell Bach Cuadra, Guillaume Auzias

Main category: eess.IV

TL;DR: Fetpype is an open-source Python library that standardizes and streamlines fetal brain MRI preprocessing and analysis workflows, addressing challenges of motion, noise, and complex multi-step processing.

Details

Motivation: Fetal brain MRI analysis faces significant challenges including fetal motion, low signal-to-noise ratio, and complex multi-step processing requiring specialized tools. Current lack of standardized, integrated workflows hinders reproducibility and adoption of advanced analysis techniques.

Method: Developed Fetpype, an open-source Python library that provides standardized workflows for T2-weighted fetal brain MRI data, integrating motion correction, super-resolution reconstruction, segmentation, and surface extraction into user-friendly pipelines.

Result: Fetpype is publicly available on GitHub (https://github.com/fetpype/fetpype), providing researchers and clinicians with a standardized tool for reproducible fetal brain MRI analysis from raw images to processed volumes.

Conclusion: Fetpype addresses the standardization gap in fetal brain MRI analysis by offering an integrated, open-source solution that improves reproducibility and accessibility of advanced neurodevelopmental assessment techniques.

Abstract: Fetal brain Magnetic Resonance Imaging (MRI) is crucial for assessing neurodevelopment in utero. However, analyzing this data presents significant challenges due to fetal motion, low signal-to-noise ratio, and the need for complex multi-step processing, including motion correction, super-resolution reconstruction, segmentation, and surface extraction. While various specialized tools exist for individual steps, integrating them into robust, reproducible, and user-friendly workflows that go from raw images to processed volumes is not straightforward. This lack of standardization hinders reproducibility across studies and limits the adoption of advanced analysis techniques for researchers and clinicians. To address these challenges, we introduce Fetpype, an open-source Python library designed to streamline and standardize the preprocessing and analysis of T2-weighted fetal brain MRI data. Fetpype is publicly available on GitHub at https://github.com/fetpype/fetpype.

[384] UPMRI: Unsupervised Parallel MRI Reconstruction via Projected Conditional Flow Matching

Xinzhe Luo, Yingzhen Li, Chen Qin

Main category: eess.IV

TL;DR: UPMRI: Unsupervised MRI reconstruction using Projected Conditional Flow Matching that learns from undersampled k-space data only, achieving performance comparable to supervised methods without needing fully sampled ground-truth images.

Details

Motivation: Supervised deep learning for accelerated MRI reconstruction requires fully sampled ground-truth images which are impractical in clinical settings due to long scan times. Current self-supervised/unsupervised methods perform inadequately at high acceleration rates.

Method: UPMRI uses Projected Conditional Flow Matching (PCFM) to learn the prior distribution of fully sampled parallel MRI data from only undersampled k-space measurements. Establishes theoretical link between marginal vector field in measurement space and optimal PCFM solution, resulting in cyclic dual-space sampling algorithm.

Result: Extensive evaluations on fastMRI brain and CMRxRecon cardiac datasets show UPMRI significantly outperforms state-of-the-art self-supervised and unsupervised baselines. Achieves reconstruction fidelity comparable to or better than leading supervised methods at high acceleration factors.

Conclusion: UPMRI bridges the gap between supervised and unsupervised MRI reconstruction by achieving high-quality results without requiring fully sampled training data, making it practical for clinical applications where ground-truth images are unavailable.

Abstract: Reconstructing high-quality images from substantially undersampled k-space data for accelerated MRI presents a challenging ill-posed inverse problem. While supervised deep learning has revolutionized this field, it relies heavily on large datasets of fully sampled ground-truth images, which are often impractical or impossible to acquire in clinical settings due to long scan times. Despite advances in self-supervised/unsupervised MRI reconstruction, their performance remains inadequate at high acceleration rates. To bridge this gap, we introduce UPMRI, an unsupervised reconstruction framework based on Projected Conditional Flow Matching (PCFM) and its unsupervised transformation. Unlike standard generative models, PCFM learns the prior distribution of fully sampled parallel MRI data by utilizing only undersampled k-space measurements. To reconstruct the image, we establish a novel theoretical link between the marginal vector field in the measurement space, governed by the continuity equation, and the optimal solution to the PCFM objective. This connection results in a cyclic dual-space sampling algorithm for high-quality reconstruction. Extensive evaluations on the fastMRI brain and CMRxRecon cardiac datasets demonstrate that UPMRI significantly outperforms state-of-the-art self-supervised and unsupervised baselines. Notably, it also achieves reconstruction fidelity comparable to or better than leading supervised methods at high acceleration factors, while requiring no fully sampled training data.

[385] Resource-efficient medical image classification for edge devices

Mahsa Lavaei, Zahra Abadi, Salar Beigzad, Alireza Maleki

Main category: eess.IV

TL;DR: Medical image classification using quantization techniques for edge deployment, achieving reduced model size and latency while maintaining diagnostic accuracy.

Details

Motivation: Deep learning models for medical image classification face deployment challenges on resource-constrained edge devices due to computational and memory limitations, requiring efficient solutions for remote healthcare settings.

Method: Employed model quantization techniques including quantization-aware training (QAT) and post-training quantization (PTQ) optimized for edge devices, reducing precision of model parameters and activations.

Result: Quantized models achieved substantial reductions in model size and inference latency, enabling real-time processing on edge hardware while maintaining clinically acceptable diagnostic accuracy.

Conclusion: Provides a practical pathway for deploying AI-driven medical diagnostics in remote and resource-limited settings, enhancing accessibility and scalability of healthcare technologies.

Abstract: Medical image classification is a critical task in healthcare, enabling accurate and timely diagnosis. However, deploying deep learning models on resource-constrained edge devices presents significant challenges due to computational and memory limitations. This research investigates a resource-efficient approach to medical image classification by employing model quantization techniques. Quantization reduces the precision of model parameters and activations, significantly lowering computational overhead and memory requirements without sacrificing classification accuracy. The study focuses on the optimization of quantization-aware training (QAT) and post-training quantization (PTQ) methods tailored for edge devices, analyzing their impact on model performance across medical imaging datasets. Experimental results demonstrate that quantized models achieve substantial reductions in model size and inference latency, enabling real-time processing on edge hardware while maintaining clinically acceptable diagnostic accuracy. This work provides a practical pathway for deploying AI-driven medical diagnostics in remote and resource-limited settings, enhancing the accessibility and scalability of healthcare technologies.

[386] A 28nm 0.22 μJ/token memory-compute-intensity-aware CNN-Transformer accelerator with hybrid-attention-based layer-fusion and cascaded pruning for semanticsegmentation

Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu, Luhong Liang, Yitong Zhou, Di Pang, Man-To Yung, Dong Zhang, Xijie Huang, Shih-Yang Liu, Yongkun Wu, Fengshi Tian, Chi-Ying Tsui, Fengbin Tu, Kwang-Ting Cheng

Main category: eess.IV

TL;DR: A 28nm CNN-Transformer accelerator for semantic segmentation achieves 3.86-10.91x energy reduction with 52.90TOPS/W peak efficiency.

Details

Motivation: To address the energy inefficiency of semantic segmentation accelerators by combining CNN and Transformer architectures for better performance-per-watt.

Method: Uses a hybrid attention unit, layer-fusion scheduler, and cascaded feature-map pruner in a 28nm chip design with INT8 precision.

Result: Achieves 3.86-to-10.91x energy reduction over previous designs with peak energy efficiency of 52.90TOPS/W on a 13.93mm2 chip area.

Conclusion: The proposed CNN-Transformer accelerator demonstrates significant energy efficiency improvements for semantic segmentation through architectural innovations.

Abstract: This work presents a 28nm 13.93mm2 CNN-Transformer accelerator for semantic segmentation, achieving 3.86-to-10.91x energy reduction over previous designs. It features a hybrid attention unit, layer-fusion scheduler, and cascaded feature-map pruner, with peak energy efficiency of 52.90TOPS/W (INT8).

[387] SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis

N. A. Adarsh Pritam, Jeba Shiney O, Sanyam Jain

Main category: eess.IV

TL;DR: SkinGenBench benchmark shows generative architecture choice (StyleGAN2-ADA vs DDPMs) matters more than preprocessing complexity for synthetic dermoscopic images. StyleGAN2-ADA produces better synthetic images with lower FID/KID scores, and synthetic data augmentation improves melanoma detection by 8-15% F1-score.

Details

Motivation: To systematically investigate how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis, addressing the need for effective data augmentation in medical imaging where data scarcity is common.

Method: Created SkinGenBench benchmark using 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes. Evaluated two generative paradigms: StyleGAN2-ADA and DDPMs under basic geometric augmentation vs advanced artifact removal pipelines. Assessed synthetic images using perceptual metrics (FID, KID, IS), feature space analysis, and impact on diagnostic performance across five downstream classifiers.

Result: StyleGAN2-ADA consistently produced better synthetic images with lower FID (65.5) and KID (0.05) scores. Diffusion models generated higher variance samples but with reduced perceptual fidelity. Advanced artifact removal provided only marginal improvements. Synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, with ViT-B/16 achieving F10.88 and ROC-AUC0.98 (14% improvement over baselines).

Conclusion: Generative architecture choice has stronger influence than preprocessing complexity on both image fidelity and diagnostic utility. Synthetic data augmentation can significantly improve melanoma detection performance, with StyleGAN2-ADA outperforming diffusion models for this medical imaging task.

Abstract: This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID (65.5) and KID (0.05), while diffusion models generated higher variance samples at the cost of reduces perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, and ViT-B/16 achieving F10.88 and ROC-AUC0.98, representing an improvement of approximately 14% over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench

[388] Breast Cancer Neoadjuvant Chemotherapy Treatment Response Prediction Using Aligned Longitudinal MRI and Clinical Data

Rahul Ravi, Ruizhe Li, Tarek Abdelfatah, Stephen Chan, Xin Chen

Main category: eess.IV

TL;DR: ML framework using longitudinal MRI and clinical data predicts breast cancer treatment response (PCR) and survival (RFS), with image registration improving feature extraction and radiomics outperforming deep learning methods.

Details

Motivation: To improve prediction of neoadjuvant chemotherapy response in breast cancer patients using longitudinal MRI data, enabling better treatment monitoring and outcome prediction.

Method: Framework includes tumor segmentation, image registration, feature extraction (radiomics + 3 deep learning methods), feature selection, and ML modeling to predict PCR and 5-year RFS from longitudinal CE-MRI.

Result: Image registration improved predictive models; radiomics with logistic regression achieved best performance: PCR - AUC 0.88, accuracy 0.85; RFS - AUC 0.78, accuracy 0.72.

Conclusion: Image registration significantly enhances longitudinal feature learning; radiomics outperforms deep learning feature extractors with better performance and interpretability for treatment response prediction.

Abstract: Aim: This study investigates treatment response prediction to neoadjuvant chemotherapy (NACT) in breast cancer patients, using longitudinal contrast-enhanced magnetic resonance images (CE-MRI) and clinical data. The goal is to develop machine learning (ML) models to predict pathologic complete response (PCR binary classification) and 5-year relapse-free survival status (RFS binary classification). Method: The proposed framework includes tumour segmentation, image registration, feature extraction, and predictive modelling. Using the image registration method, MRI image features can be extracted and compared from the original tumour site at different time points, therefore monitoring the intratumor changes during NACT process. Four feature extractors, including one radiomics and three deep learning-based (MedicalNet, Segformer3D, SAM-Med3D) were implemented and compared. In combination with three feature selection methods and four ML models, predictive models are built and compared. Results: The proposed image registration-based feature extraction consistently improves the predictive models. In the PCR and RFS classification tasks logistic regression model trained on radiomic features performed the best with an AUC of 0.88 and classification accuracy of 0.85 for PCR classification, and AUC of 0.78 and classification accuracy of 0.72 for RFS classification. Conclusions: It is evidenced that the image registration method has significantly improved performance in longitudinal feature learning in predicting PCR and RFS. The radiomics feature extractor is more effective than the pre-trained deep learning feature extractors, with higher performance and better interpretability.

[389] MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation

Saikat Roy, Yannick Kirchhoff, Constantin Ulrich, Maximillian Rokuss, Tassilo Wald, Fabian Isensee, Klaus Maier-Hein

Main category: eess.IV

TL;DR: MedNeXt-v2 is a compound-scaled 3D ConvNeXt architecture that achieves SOTA performance in 3D medical image segmentation through improved backbone design and large-scale supervised pretraining on 18k CT volumes.

Details

Motivation: Existing large-scale pretraining efforts focus on dataset size but overlook whether backbone networks are effective representation learners at scale. The authors identify that routinely used backbones in pretraining pipelines are often suboptimal.

Method: 1) Comprehensive backbone benchmarking to show stronger from-scratch performance predicts better downstream performance after pretraining; 2) Introduce MedNeXt-v2 with 3D Global Response Normalization module and depth/width/context scaling; 3) Pretrain on 18k CT volumes; 4) Fine-tune across six CT/MR benchmarks (144 structures).

Result: State-of-the-art performance across six challenging CT and MR benchmarks, showing consistent gains over seven publicly released pretrained models. Key findings: stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, modality-specific pretraining offers negligible benefit with full finetuning.

Conclusion: MedNeXt-v2 establishes as a strong backbone for large-scale supervised representation learning in 3D medical image segmentation. The work emphasizes the importance of backbone architecture design alongside dataset scaling for effective representation learning.

Abstract: Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet

[390] V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

Main category: eess.IV

TL;DR: V-Rex is a software-hardware co-designed accelerator that enables real-time streaming video LLM inference on edge devices by addressing KV cache bottlenecks through a training-free dynamic retrieval algorithm and specialized hardware.

Details

Motivation: Streaming video LLMs face fundamental memory and computational challenges due to growing KV caches with continuous video input, requiring iterative prefill stages that cause extensive computation, data transfer, and accuracy degradation - especially problematic for edge deployment.

Method: V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm using temporal and spatial similarity-based token clustering to reduce KV cache memory. It also provides a compact hardware accelerator with a dynamic KV cache retrieval engine (DRE) featuring bit-level and early-exit computing units.

Result: Achieves 3.9-8.3 FPS real-time streaming video LLM inference on edge devices with negligible accuracy loss. DRE uses only 2.2% power and 2.0% area, delivering 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU.

Conclusion: V-Rex is the first comprehensive solution tackling KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

[391] MSDiff: Multi-Scale Diffusion Model for Ultra-Sparse View CT Reconstruction

Junyan Zhang, Mengxiao Geng, Pinhuang Tan, Yi Liu, Zhili Liu, Bin Huang, Qiegen Liu

Main category: eess.IV

TL;DR: MSDiff: A multi-scale diffusion model for ultra-sparse view CT reconstruction that improves image quality by integrating comprehensive and sparse sampling information with attention mechanisms.

Details

Motivation: Sparse-view CT reduces radiation exposure but creates reconstruction challenges. Score-based generative models struggle with ultra-sparse angles, leading to poor image quality.

Method: Proposes MSDiff with multi-scale diffusion models that integrate comprehensive and sparse sampling. Uses precise diffusion adjustments to extract diverse noise distributions, designs equidistant masks to leverage projection data correlations, and employs attention mechanisms.

Result: Significantly improves image reconstruction quality under ultra-sparse angles with good generalization across various datasets.

Conclusion: MSDiff effectively addresses ultra-sparse CT reconstruction challenges by focusing on global information distribution while preserving local characteristics, outperforming existing methods.

Abstract: Computed Tomography (CT) technology reduces radiation haz-ards to the human body through sparse sampling, but fewer sampling angles pose challenges for image reconstruction. Score-based generative models are widely used in sparse-view CT re-construction, performance diminishes significantly with a sharp reduction in projection angles. Therefore, we propose an ultra-sparse view CT reconstruction method utilizing multi-scale dif-fusion models (MSDiff), designed to concentrate on the global distribution of information and facilitate the reconstruction of sparse views with local image characteristics. Specifically, the proposed model ingeniously integrates information from both comprehensive sampling and selectively sparse sampling tech-niques. Through precise adjustments in diffusion model, it is capable of extracting diverse noise distribution, furthering the understanding of the overall structure of images, and aiding the fully sampled model in recovering image information more effec-tively. By leveraging the inherent correlations within the projec-tion data, we have designed an equidistant mask, enabling the model to focus its attention more effectively. Experimental re-sults demonstrated that the multi-scale model approach signifi-cantly improved the quality of image reconstruction under ultra-sparse angles, with good generalization across various datasets.

[392] Dynamic PET Image Prediction Using a Network Combining Reversible and Irreversible Modules

Jie Sun, Junyan Zhang, Qian Xia, Chuanfu Sun, Yumei Chen, Yunjie Yang, Huafeng Liu, Wentao Zhu, Qiegen Liu

Main category: eess.IV

TL;DR: Proposes a deep learning framework to predict full dynamic PET images from early frames, reducing scan time while maintaining quality.

Details

Motivation: Dynamic PET imaging is clinically valuable but requires prolonged scan times that cause patient discomfort. There's a need to reduce scanning time while maintaining the ability to study tracer kinetics and metabolic processes.

Method: Multi-module deep learning framework with reversible and irreversible modules that predicts kinetic parameter images from early dynamic PET frames, then generates complete dynamic PET images.

Result: Network demonstrated good predictive performance for kinetic parameters in simulated data, reconstructed high-quality dynamic PET images, and showed good generalization performance with clinical data.

Conclusion: The proposed method effectively reduces dynamic PET scanning time and has promising clinical application prospects.

Abstract: Dynamic positron emission tomography (PET) images can reveal the distribution of tracers in the organism and the dynamic processes involved in biochemical reactions, and it is widely used in clinical practice. Despite the high effectiveness of dynamic PET imaging in studying the kinetics and metabolic processes of radiotracers. Pro-longed scan times can cause discomfort for both patients and medical personnel. This study proposes a dynamic frame prediction method for dynamic PET imaging, reduc-ing dynamic PET scanning time by applying a multi-module deep learning framework composed of reversible and irreversible modules. The network can predict kinetic parameter images based on the early frames of dynamic PET images, and then generate complete dynamic PET images. In validation experiments with simulated data, our network demonstrated good predictive performance for kinetic parameters and was able to reconstruct high-quality dynamic PET images. Additionally, in clinical data experiments, the network exhibited good generalization performance and attached that the proposed method has promising clinical application prospects.

[393] Saturation-Aware Snapshot Compressive Imaging: Theory and Algorithm

Mengyu Zhao, Shirin Jalali

Main category: eess.IV

TL;DR: First theoretical analysis of SCI recovery under sensor saturation, showing optimal Bernoulli mask densities below 0.5, with novel SAPnet framework outperforming existing methods.

Details

Motivation: In SCI systems, multiplexing can cause sensor saturation that violates the linear model and degrades reconstruction quality. Existing methods lack theoretical understanding of recovery under saturation conditions.

Method: Model clipping as element-wise nonlinearity, derive finite-sample recovery bound linking error to mask density and saturation extent. Use theory to optimize mask patterns and propose Saturation-Aware PnP Net (SAPnet) that enforces consistency with saturated measurements.

Result: Theoretical analysis reveals optimal Bernoulli mask densities below one-half, decreasing with stronger saturation. SAPnet significantly outperforms existing PnP-based methods on standard video-SCI benchmarks.

Conclusion: First theoretical characterization of SCI recovery under saturation provides design rules for mask optimization and enables development of superior reconstruction methods like SAPnet that explicitly handle saturation.

Abstract: Snapshot Compressive Imaging (SCI) uses coded masks to compress a 3D data cube into a single 2D snapshot. In practice, multiplexing can push intensities beyond the sensor’s dynamic range, producing saturation that violates the linear SCI model and degrades reconstruction. This paper provides the first theoretical characterization of SCI recovery under saturation. We model clipping as an element-wise nonlinearity and derive a finite-sample recovery bound for compression-based SCI that links reconstruction error to mask density and the extent of saturation. The analysis yields a clear design rule: optimal Bernoulli masks use densities below one-half, decreasing further as saturation strengthens. Guided by this principle, we optimize mask patterns and introduce a novel reconstruction framework, Saturation-Aware PnP Net (SAPnet), which explicitly enforces consistency with saturated measurements. Experiments on standard video-SCI benchmarks confirm our theory and demonstrate that SAPnet significantly outperforms existing PnP-based methods.

[394] The Eye as a Window to Systemic Health: A Survey of Retinal Imaging from Classical Techniques to Oculomics

Inamullah, Imran Razzak, Shoaib Jameel

Main category: eess.IV

TL;DR: Survey paper exploring how retinal imaging combined with AI (oculomics) can serve as a non-invasive window to detect and monitor both ocular and systemic diseases.

Details

Motivation: The unique vascularized anatomy of the human eye provides a window for assessing human health, enabling early detection and monitoring of diseases. Advancements in AI and imaging technology create opportunities to bridge the gap between ocular observations and systemic health insights.

Method: This is a survey paper that reviews the evolution of retinal imaging techniques, analyzes the integration of AI-driven analysis, and examines the shift from classical retinal imaging to oculomics. The paper also identifies research gaps and future directions.

Result: The paper maps out the progression of retinal imaging technology and AI integration, highlighting how oculomics enables non-invasive markers for disease detection and monitoring. It identifies current hurdles and research gaps in the field.

Conclusion: Oculomics represents a promising frontier in ophthalmology that can provide systemic health insights through retinal imaging. While significant progress has been made, challenges remain that require further research to fully realize the potential of using the eye as a window to overall health.

Abstract: The unique vascularized anatomy of the human eye, encased in the retina, provides an opportunity to act as a window for human health. The retinal structure assists in assessing the early detection, monitoring of disease progression and intervention for both ocular and non-ocular diseases. The advancement in imaging technology leveraging Artificial Intelligence has seized this opportunity to bridge the gap between the eye and human health. This track paves the way for unveiling systemic health insight from the ocular system and surrogating non-invasive markers for timely intervention and identification. The new frontiers of oculomics in ophthalmology cover both ocular and systemic diseases, and getting more attention to explore them. In this survey paper, we explore the evolution of retinal imaging techniques, the dire need for the integration of AI-driven analysis, and the shift of retinal imaging from classical techniques to oculomics. We also discuss some hurdles that may be faced in the progression of oculomics, highlighting the research gaps and future directions.

[395] Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen, Haixia Pan

Main category: eess.IV

TL;DR: TMC network improves medical image segmentation by using text prompts and multi-stage cross-perception to enhance cross-modal interaction and feature representation.

Details

Motivation: Traditional segmentation methods like U-Net have weak semantic expression due to insufficient generalization and lack of interactivity. Existing text-guided methods suffer from insufficient cross-modal interaction and inadequate feature representation.

Method: Proposes Text-guided Multi-stage Cross-perception network (TMC) with Multi-stage Cross-attention Module (MCM) for fine-grained semantic understanding and Multi-stage Alignment Loss (MA Loss) for cross-modal semantic consistency across feature levels.

Result: Achieves Dice scores of 84.65% (QaTa-COV19), 78.39% (MosMedData), and 88.09% (Duke-Breast-Cancer-MRI), outperforming both U-Net-based networks and existing text-guided methods.

Conclusion: TMC effectively addresses limitations of traditional and text-guided segmentation methods by enhancing cross-modal interaction and feature representation, demonstrating superior performance across multiple medical imaging datasets.

Abstract: Medical image segmentation plays a crucial role in clinical medicine, serving as a key tool for auxiliary diagnosis, treatment planning, and disease monitoring. However, traditional segmentation methods such as U-Net are often limited by weak semantic expression of target regions, which stems from insufficient generalization and a lack of interactivity. Incorporating text prompts offers a promising avenue to more accurately pinpoint lesion locations, yet existing text-guided methods are still hindered by insufficient cross-modal interaction and inadequate cross-modal feature representation. To address these challenges, we propose the Text-guided Multi-stage Cross-perception network (TMC). TMC incorporates a Multi-stage Cross-attention Module (MCM) to enhance the model’s understanding of fine-grained semantic details and a Multi-stage Alignment Loss (MA Loss) to improve the consistency of cross-modal semantics across different feature levels. Experimental results on three public datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI) demonstrate the superior performance of TMC, achieving Dice scores of 84.65%, 78.39%, and 88.09%, respectively, and consistently outperforming both U-Net-based networks and existing text-guided methods.

Today’s Research Highlights

Table of Contents

cs.CL

[1] A Women’s Health Benchmark for Large Language Models

[2] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

[3] XLM: A Python package for non-autoregressive language models

[4] Perturb Your Data: Paraphrase-Guided Training Data Watermarking

[5] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

[6] Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

[7] Enhancing Long Document Long Form Summarisation with Self-Planning

[8] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

[9] LookAhead Tuning: Safer Language Models via Partial Answer Previews

[10] Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

[11] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

[12] Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

[13] Fun-ASR Technical Report

[14] AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

[15] Subjective Question Generation and Answer Evaluation using NLP

[16] Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

[17] Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates

[18] Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

[19] UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

[20] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

[21] Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

[22] Linear Personality Probing and Steering in LLMs: A Big Five Study

[23] Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

[24] Peeking Into The Future For Contextual Biasing

[25] Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering

[26] When the Gold Standard isn’t Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

[27] Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

[28] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

[29] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[30] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

[31] ShareChat: A Dataset of Chatbot Conversations in the Wild

[32] Studying the Effects of Collaboration in Interactive Theme Discovery Systems

[33] Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

[34] Generating Completions for Broca’s Aphasic Sentences Using Large Language Models

[35] Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

[36] Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

[37] Sample, Don’t Search: Rethinking Test-Time Alignment for Language Models

[38] Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

[39] ResSVD: Residual Compensated SVD for Large Language Model Compression

[40] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

[41] Hybrid and Unitary PEFT for Resource-Efficient Large Language Models

[42] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

[43] Text-to-SQL Task-oriented Dialogue Ontology Construction

[44] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

[45] Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference

[46] Same Content, Different Representations: A Controlled Study for Table QA

[47] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

[48] LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

[49] Adaptive Focus Memory for Language Models

[50] When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

[51] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

[52] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

[53] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

[54] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

[55] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

[56] Sigma-MoE-Tiny Technical Report

cs.CV

[57] V-Agent: An Interactive Video Search System Using Vision-Language Models

[58] Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content

[59] A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

[60] AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

[61] Region-Constraint In-Context Generation for Instructional Video Editing

[62] Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections

[63] Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

[64] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

[65] FakeParts: a New Family of AI-Generated DeepFakes

[66] Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video

[67] A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

[68] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

[69] FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

[70] Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

[71] Interpretable Similarity of Synthetic Image Utility

[72] DGH: Dynamic Gaussian Hair

[73] Predictive Modeling of Maritime Radar Data Using Transformer Architecture

[74] SDUM: A Scalable Deep Unrolled Model for Universal MRI Reconstruction

[75] Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

[76] Text-Conditioned Background Generation for Editable Multi-Layer Documents