Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 78]
cs.CV [Total: 180]
cs.AI [Total: 58]
cs.SD [Total: 9]
cs.LG [Total: 154]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 4]
eess.IV [Total: 10]

cs.CL

[1] Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

Yan Yang, Mouxiao Bian, Peiling Li, Bingjian Wen, Ruiyao Chen, Kangkun Mao, Xiaojun Ye, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Kaifeng Qiu, Junyan Wu

Main category: cs.CL

TL;DR: RxBench is a comprehensive benchmark for evaluating LLMs in prescription review, covering 14 common error types with 2,259 items. It reveals performance stratification among 18 LLMs and shows leading models can match or exceed pharmacist performance in certain tasks.

Details

Motivation: The rapid integration of LLMs into clinical decision support, particularly prescription review, requires systematic and fine-grained evaluation to understand their capabilities and limitations in this critical domain.

Method: Developed RxBench benchmark covering common prescription review categories with 14 frequent prescription error types from authoritative pharmacy references. Created 1,150 single-choice, 230 multiple-choice, and 879 short-answer items reviewed by clinical pharmacists. Benchmarked 18 state-of-the-art LLMs and performed targeted fine-tuning on a mid-tier model.

Result: Clear performance stratification: Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 formed top tier. Leading LLMs can match or exceed licensed pharmacist performance in certain tasks. Targeted fine-tuning produced specialized model rivaling top general-purpose LLMs on short-answer tasks.

Conclusion: RxBench establishes a standardized, error-type-oriented framework that reveals LLM capabilities/limitations in prescription review and provides foundational resource for building more reliable, specialized clinical tools.

Abstract: The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match or exceed human performance in certain tasks. Furthermore, building on insights from our benchmark evaluation, we performed targeted fine-tuning on a mid-tier model, resulting in a specialized model that rivals leading general-purpose LLMs in performance on short-answer question tasks. The main contribution of RxBench lies in establishing a standardized, error-type-oriented framework that not only reveals the capabilities and limitations of frontier LLMs in prescription review but also provides a foundational resource for building more reliable and specialized clinical tools.

[2] Deep Research: A Systematic Survey

Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni, Yougang Lyu, Run-Ze Fan, Bowen Jin, Yixuan Weng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Yang, Jiayi Wu, Jujia Zhao, Xiaqiang Tang, Xinbei Ma, Cunxiang Wang, Jiaxin Mao, Qingyao Ai, Jen-Tse Huang, Wenxuan Wang, Yue Zhang, Yiming Yang, Zhaopeng Tu, Zhaochun Ren

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive overview of Deep Research (DR) systems that combine LLMs with external tools like search engines to tackle complex, open-ended tasks requiring critical thinking and verifiable outputs.

Details

Motivation: Current LLMs have limitations in handling tasks that require critical thinking, multi-source information, and verifiable outputs. Single-shot prompting or standard retrieval-augmented generation falls short for complex, open-ended research tasks that demand deeper reasoning and external tool integration.

Method: The survey formalizes a three-stage roadmap for deep research systems and identifies four key components: query planning, information acquisition, memory management, and answer generation. It also summarizes optimization techniques including prompting strategies, supervised fine-tuning, and agentic reinforcement learning.

Result: The paper provides a systematic taxonomy and framework for understanding deep research systems, distinguishing them from related paradigms. It consolidates evaluation criteria and identifies open challenges to guide future development in this rapidly evolving field.

Conclusion: Deep research represents an important advancement in empowering LLMs as research agents for complex tasks. The survey establishes foundational concepts and frameworks while acknowledging the field’s rapid evolution, with plans for continuous updates to reflect ongoing progress.

Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.

[3] Mirror, Mirror on the Wall – Which is the Best Model of Them All?

Dina Sayed, Heiko Schuldt

Main category: cs.CL

TL;DR: This paper analyzes LLM leaderboards and benchmarks for model selection, using medical domain as case study, and proposes a systematic Model Selection Methodology.

Details

Motivation: The rapid proliferation of foundation LLMs makes selecting the most suitable model for specific use cases increasingly complex, requiring systematic approaches to evaluate both qualitative and quantitative dimensions.

Method: Analysis of current leaderboards and benchmarks for quantitative evaluation, with medical domain as case study, leading to development of Model Selection Methodology (MSM) - a systematic approach for model navigation, prioritization, and selection.

Result: Demonstrates the evolution, current landscape, and practical significance of quantitative evaluation through leaderboard analysis, showing how systematic assessment can guide model selection decisions.

Conclusion: Proposes MSM as a systematic framework to address the growing complexity of LLM selection, emphasizing the importance of quantitative evaluation alongside qualitative considerations for optimal model choice in specific domains.

Abstract: Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.

[4] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Xinyu Fu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li

Main category: cs.CL

TL;DR: CCD is a novel inference framework for Diffusion Language Models that improves generation quality and speed through trajectory rectification and adaptive sampling based on historical context consistency.

Details

Motivation: Existing DLM inference methods rely on local, immediate-step metrics (confidence/entropy) that lack reliable perspective, leading to inconsistent sampling trajectories and suboptimal generation quality.

Method: Coherent Contextual Decoding (CCD) with two innovations: 1) Trajectory rectification using historical context to enhance sequence coherence and reject suboptimal paths early, 2) Adaptive sampling strategy that dynamically adjusts unmasking budgets per step based on consistency metrics instead of rigid uniform allocations.

Result: Achieves simultaneous enhancement in inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

Conclusion: CCD provides a theoretically-grounded framework that improves DLM inference by leveraging historical context coherence and adaptive resource allocation, addressing limitations of current local-metric approaches.

Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

[5] Reversing Large Language Models for Efficient Training and Fine-Tuning

Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber

Main category: cs.CL

TL;DR: The paper introduces memory-efficient reversible architectures for LLMs that use time-reversible dynamics to avoid storing intermediate activations, enabling larger batch sizes and improved throughput while maintaining performance.

Details

Motivation: LLM training is expensive and time-consuming, often requiring fine-tuning of pre-trained models. Current architectures store all intermediate activations during training, leading to high memory consumption that limits batch sizes and throughput.

Method: Proposes reversible LLM architectures inspired by symmetric and symplectic differential equations. Uses time-reversible dynamics to retrieve hidden states during backpropagation instead of storing activations. Also introduces a method to convert existing non-reversible LLMs into reversible architectures through fine-tuning.

Result: The reversible architectures achieve comparable or improved performance on several datasets and benchmarks across multiple LLMs. They enable drastic memory reduction, allowing larger batch sizes for the same memory, thereby improving throughput.

Conclusion: Reversible LLM architectures provide a scalable and efficient path to reduce memory and computational costs for both training from scratch and fine-tuning, making LLM development more practical and accessible.

Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

[6] Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

Zirui Lin, Haris Gulzar, Monnika Roslianna Busto, Akiko Masaki, Takeharu Eda, Kazuhiro Nakadai

Main category: cs.CL

TL;DR: The paper explores Memory-Efficient Fine-Tuning (MEFT) methods for dialect identification using pre-trained speech models, achieving significant GPU memory reduction (73.25%) and training speedup (2.1x) while maintaining accuracy comparable to standard fine-tuning.

Details

Motivation: Fine-tuning speech models for dialect identification is computationally expensive in terms of memory and training time. While Parameter-Efficient Fine-Tuning (PEFT) methods offer parameter efficiency, they have limited improvement in memory efficiency and training speed.

Method: The authors apply Memory-Efficient Fine-Tuning (MEFT) methods, originally developed for language processing, to general-purpose pre-trained speech models. They comprehensively analyze GPU memory usage and fine-tuning speed across various MEFT methods, using Whisper model fine-tuned on the KeSpeech dataset for six Mandarin subdialects as a case study.

Result: The MEFT approach reduced GPU memory usage by up to 73.25% and accelerated training speed by a factor of 2.1, while maintaining accuracy comparable to both vanilla fine-tuning and PEFT methods.

Conclusion: MEFT methods provide an effective solution for efficient fine-tuning of speech models for dialect identification, offering substantial improvements in memory efficiency and training speed without sacrificing accuracy, making them practical for resource-constrained scenarios.

Abstract: Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the general-purpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.

[7] Spoken Conversational Agents with Large Language Models

Chao-Han Huck Yang, Andreas Stolcke, Larry Heck

Main category: cs.CL

TL;DR: Tutorial on evolving spoken conversational agents from cascaded ASR/NLU systems to voice-native LLMs, covering adaptation techniques, datasets, metrics, and practical implementation guidance.

Details

Motivation: Spoken conversational agents are transitioning from traditional cascaded architectures to voice-native large language models, requiring comprehensive understanding of adaptation techniques, cross-modal alignment, and practical implementation challenges.

Method: Distills the evolution path from cascaded ASR/NLU to end-to-end systems, frames adaptation of text LLMs to audio, cross-modal alignment, joint speech-text training, reviews datasets and metrics, and compares design choices including cascaded vs. E2E approaches.

Result: Provides systematic review of current approaches, links industrial assistants to open-domain/task-oriented agents, highlights reproducible baselines, and identifies key design trade-offs in streaming, post-ASR correction, and robustness across accents.

Conclusion: Offers practical implementation recipes and clear systems-level roadmap while outlining open challenges in privacy, safety, and evaluation for next-generation spoken conversational agents.

Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

[8] Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation

Pritish N. Desai, Tanay Kewalramani, Srimanta Mandal

Main category: cs.CL

TL;DR: Fine-tuning BERT for hate speech classification using TF-IDF sample selection (75% reduction) and vocabulary augmentation with abusive slang, achieving competitive performance with improved efficiency.

Details

Motivation: Abusive speech on social media constantly evolves with new slang and obfuscated terms that bypass detection systems, requiring adaptive and efficient moderation approaches.

Method: 1) TF-IDF-based sample selection to retain only the most informative 75% of training examples, 2) Augmenting BERT’s tokenizer with domain-specific slang and lexical variants from abusive contexts.

Result: The method achieves competitive performance on a widely used hate speech dataset while significantly improving computational efficiency through reduced training set size.

Conclusion: The approach demonstrates potential for scalable and adaptive abusive content moderation by balancing performance with computational efficiency through intelligent sample selection and vocabulary adaptation.

Abstract: Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT’s native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.

[9] Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

Main category: cs.CL

TL;DR: RESP is a self-reflective structured pruning framework for reasoning LLMs that preserves reasoning coherence by using self-generated calibration data and decode-only gradient-based importance estimation, achieving near-dense accuracy at 20-30% sparsity.

Details

Motivation: Reasoning LLMs like OpenAI o1 and DeepSeek-R1 are computationally expensive due to large model sizes and lengthy decode-time outputs, making them unsuitable for resource-constrained settings. Existing pruning methods severely damage RLMs' reasoning coherence even at moderate sparsity levels.

Method: RESP uses self-generated reasoning traces as calibration data (instead of human labels), decode-only gradient-based importance estimation, and progressive regeneration to maintain calibration fidelity as sparsity increases, aligning pruning decisions with the model’s reasoning dynamics.

Result: On Qwen3-8B, RESP preserves near-dense accuracy at 20-30% sparsity and substantially outperforms baselines at higher sparsity. At 40% sparsity, RESP achieves 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing strongest baselines by 66.87% and 47% respectively.

Conclusion: RESP demonstrates that aligning pruning with a model’s self-generated reasoning dynamics enables effective compression of reasoning LLMs while preserving their multi-step reasoning capabilities, making them more deployable in resource-constrained environments.

Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model’s reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model’s decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model’s own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model’s reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

[10] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation

David Ph. Shakouri, Crit Cremers, Niels O. Schiller

Main category: cs.CL

TL;DR: MODOMA is a computational multi-agent system for unsupervised language acquisition experiments where an adult agent teaches a child agent, resulting in knowledge-based language models that can generate and parse new utterances.

Details

Motivation: To create a fully parametrized computational laboratory environment for language acquisition experiments where researchers can control all aspects and explicitly examine acquired grammatical knowledge.

Method: Multi-agent system with adult and child language models using statistical and rule-based procedures for unsupervised language acquisition. The adult agent generates training/test data, and the child agent learns from this interaction.

Result: The child agent successfully acquired functional and content categories from machine-generated data, showing patterns similar to human language acquisition. The system demonstrated validity in modeling language acquisition.

Conclusion: MODOMA introduces novel possibilities for computational language acquisition experiments by providing a controllable, explicit representation of acquired grammatical knowledge that successfully models human-like acquisition patterns.

Abstract: This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.

[11] Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga

Main category: cs.CL

TL;DR: Swivuriso is a 3000-hour multilingual speech dataset for seven South African languages, covering agriculture, healthcare, and general domains, created to address gaps in existing ASR resources.

Details

Motivation: To address significant gaps in existing ASR datasets for South African languages and support development/benchmarking of speech recognition technologies in these underrepresented languages.

Method: Developed a multilingual speech dataset with ethical design principles, covering agriculture, healthcare, and general domains across seven South African languages through systematic data collection procedures.

Result: Created a 3000-hour dataset (Swivuriso) and presented baseline ASR model training/finetuning results, comparing performance with other existing ASR datasets for the target languages.

Conclusion: Swivuriso fills important resource gaps for South African language ASR development and provides a valuable benchmark dataset for future speech recognition research in these languages.

Abstract: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

[12] Lightweight Latent Reasoning for Narrative Tasks

Alexander Gurung, Nikolay Malkin, Mirella Lapata

Main category: cs.CL

TL;DR: LiteReason is a latent reasoning method that uses a lightweight projector to generate continuous latent tokens, allowing LLMs to skip reasoning steps and reduce computational cost while maintaining performance.

Details

Motivation: Current LLM reasoning optimization with RL is computationally expensive, especially for narrative tasks requiring many tokens. There's a need for more efficient methods that reduce reasoning length while maintaining performance.

Method: LiteReason uses a Reasoning Projector module to produce continuous latent tokens that help skip reasoning steps. During RL training, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed.

Result: Outperforms latent reasoning baselines and comes close to matching non-latent RL training while reducing final reasoning length by 77-92% on plot hole detection and book chapter generation tasks.

Conclusion: LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve, offering significant computational savings while maintaining strong task performance.

Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or “reasoning traces” that act as latent variables in the generation of an output given a query. A model’s ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model ‘skip’ reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

[13] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

Olivia Kim

Main category: cs.CL

TL;DR: DETAIL framework evaluates LLM performance across varying prompt specificity levels, finding that more specific prompts improve accuracy, especially for smaller models and procedural tasks.

Details

Motivation: Prompt design is critical for LLM reasoning performance, but the impact of prompt specificity (how detailed or vague prompts are) remains understudied. Understanding this relationship is important for optimizing LLM performance.

Method: Introduced DETAIL framework: generates multi-level prompts using GPT-4, quantifies specificity via perplexity, and assesses correctness using GPT-based semantic equivalence. Evaluated on 30 novel reasoning tasks across GPT-4 and O3-mini models.

Result: Specificity improves LLM accuracy, with stronger effects for smaller models and procedural tasks. The relationship between prompt specificity and performance varies across different model types and task categories.

Conclusion: The findings highlight the need for adaptive prompting strategies tailored to model capabilities and task types. The framework provides tools and data to support further research in prompt optimization and specificity effects.

Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

[14] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

Liangji Kong, Aditya Joshi, Sarvnaz Karimi

Main category: cs.CL

TL;DR: CAIRNS is a framework that helps farmer advisors get credible, readable answers about climate adaptation from web sources without needing model fine-tuning.

Details

Motivation: Climate adaptation strategies exist in both unstructured (scientific literature) and structured (government data) formats, but accessing credible, readable information from these complex sources is challenging for farmer advisors.

Method: CAIRNS uses a structured ScholarGuide prompt to enhance readability and citation reliability, and a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts for robust evaluation.

Result: CAIRNS outperforms baselines on most metrics using expert-curated question-answers, with ablation studies confirming results on all metrics. LLM-based evaluation shows correlation with human judgment.

Conclusion: The framework enables readable, verifiable, and domain-grounded question-answering for climate adaptation without requiring fine-tuning or reinforcement learning.

Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts – farmer advisors – to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

[15] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models

Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, Douglas Teodoro

Main category: cs.CL

TL;DR: Researchers created HealthContradict, an expert-verified dataset with 920 health question-answer pairs and contradictory documents to test language models’ ability to reason over conflicting biomedical contexts and resist incorrect information.

Details

Motivation: To understand how language models use contextual information to answer health questions and how their responses are impacted by conflicting contexts, particularly in biomedical domains where accurate information is critical.

Method: Created HealthContradict dataset with 920 expert-verified instances (questions, factual answers, contradictory documents), tested models in various prompt settings (correct/incorrect/contradictory context), and measured impact on outputs.

Result: HealthContradict provides better distinctions of models’ contextual reasoning than existing benchmarks. Fine-tuned biomedical models excel at exploiting correct context while resisting incorrect context, showing both parametric knowledge and contextual reasoning abilities.

Conclusion: The strength of biomedical language models lies in their dual capability: leveraging parametric knowledge from pretraining AND effectively using correct context while resisting incorrect information, with HealthContradict serving as a valuable evaluation benchmark.

Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models’ contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.

[16] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

Main category: cs.CL

TL;DR: Systematic study of 37 LLMs shows cross-family verification is most effective, post-training reduces self-improvement but boosts cross-family gains, and math/logic tasks are most verifiable.

Details

Motivation: Prior studies on solver-verifier interactions have been limited, focusing mainly on self-verification without examining cross-family verification or the effects of post-training on verification capabilities.

Method: Evaluated 37 models across multiple families, sizes, and base vs. post-trained variants on 9 benchmarks covering diverse reasoning tasks. Introduced and validated “verifier gain” metric to predict performance improvements from verifier-based rejection sampling. Compared self-verification, within-family verification, and cross-family verification.

Result: Cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; mathematical and logical tasks exhibit the highest inherent verifiability. Verifier gain and false positive rate scale with model size and post-training.

Conclusion: Systematic analysis reveals important patterns in LLM verification: cross-family verification outperforms self-verification, post-training changes verification dynamics, and task verifiability varies significantly across domains, with math/logic being most verifiable.

Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

[17] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering

Lei Fu, Xiang Chen, Kaige Gao Xinyue Huang, Kejian Tong

Main category: cs.CL

TL;DR: KARMA framework improves domain-specific QA by integrating heterogeneous knowledge sources with safety controls for sensitive service domains like healthcare and welfare.

Details

Motivation: Existing LLMs struggle with factual consistency and context alignment in sensitive domains like healthcare policies and government welfare, where both accuracy and safety are critical. Domain-specific QA systems need to integrate heterogeneous knowledge sources while ensuring trustworthy outputs.

Method: KARMA framework features: 1) dual-encoder architecture to fuse structured and unstructured knowledge sources, 2) gated memory unit to dynamically regulate external knowledge integration, and 3) safety-aware controllable decoder using safety classification and guided generation techniques to mitigate unsafe outputs.

Result: Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety metrics.

Conclusion: KARMA offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts, addressing the unique challenges of sensitive domains requiring both accuracy and safety.

Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.

[18] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models

Yunchao Wang, Guodao Sun, Zihang Fu, Zhehao Liu, Kaixing Du, Haidong Gao, Ronghua Liang

Main category: cs.CL

TL;DR: TaleFrame is an interactive story generation system that combines LLMs with HCI to enable fine-grained control over story creation through structured units (entities, events, relationships, outline).

Details

Motivation: Current story generation systems lack fine-grained control and clear input specifications, leading to unsatisfactory outputs that don't accurately translate user intent.

Method: Decomposes stories into four structured units (entities, events, relationships, outline), fine-tunes Llama model on 9,851 JSON entries from Tinystories dataset, uses JSON2Story approach, and provides intuitive drag-and-drop interface for user control.

Result: System enables precise story control through structured framework, generates coherent stories from structured data, and supports iterative refinement with seven-dimensional evaluation and suggestions.

Conclusion: TaleFrame successfully addresses the control problem in story generation through structured decomposition and interactive HCI, with quantitative evaluation and user studies demonstrating its usefulness.

Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.

[19] A Concise Review of Hallucinations in LLMs and their Mitigation

Parth Pulkundwar, Vivek Dhanawade, Rohit Yadav, Minal Sonkar, Medha Asurlekar, Sarita Rathod

Main category: cs.CL

TL;DR: A concise summary of hallucinations in language models, covering types, causes, and mitigation strategies as a one-stop resource for understanding and addressing this critical NLP challenge.

Details

Motivation: Hallucinations pose a serious threat to the reliability and safety of language models, casting a dangerous shadow over natural language processing. There's a critical need to understand different types of hallucinations, their origins, and effective mitigation approaches to make language models more trustworthy and practical.

Method: The paper provides a comprehensive review/survey approach, systematically categorizing and analyzing hallucinations in language models. It likely examines existing literature, classifies hallucination types, identifies root causes, and synthesizes current mitigation strategies into a structured framework.

Result: The document delivers a concise, straightforward summary serving as a one-stop resource for understanding hallucinations in language models. It provides systematic categorization of hallucination types, analysis of their origins, and practical mitigation approaches for the NLP community.

Conclusion: Hallucinations remain a critical challenge in language models that must be addressed for reliable NLP applications. This comprehensive summary provides essential knowledge about hallucination types, causes, and mitigation strategies, serving as a valuable resource for researchers and practitioners working to make language models more trustworthy and effective.

Abstract: Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.

[20] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Wei Fan, JinYi Yoon, Bo Ji

Main category: cs.CL

TL;DR: iMAD is a token-efficient multi-agent debate framework that selectively triggers debates only when likely to correct wrong answers, using a lightweight classifier trained on hesitation cues from single-agent self-critiques.

Details

Motivation: Multi-Agent Debate (MAD) improves reasoning on complex tasks but is inefficient for every query due to high computational costs and potential accuracy degradation by overturning correct single-agent answers.

Method: iMAD prompts a single agent to produce structured self-critique responses, extracts 41 interpretable linguistic/semantic hesitation features, and uses a lightweight debate-decision classifier trained with FocusCal loss to selectively trigger MAD.

Result: iMAD reduces token usage by up to 92% while improving final answer accuracy by up to 13.5% across six (visual) question answering datasets against five competitive baselines.

Conclusion: iMAD provides an efficient and effective framework for multi-agent reasoning by intelligently triggering debates only when beneficial, balancing computational efficiency with accuracy improvements.

Abstract: Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).

[21] What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints

Francesco Paolo Savatteri, Chahan Vidal-Gorène, Florian Cafiero

Main category: cs.CL

TL;DR: Textual embeddings (RoBERTa) are strong for fake-news detection, while numeric features work when language models are unavailable. Virality prediction is harder than fake-news detection and sensitive to label construction.

Details

Motivation: To evaluate practical tasks of fake-news detection and virality prediction in operational settings requiring rapid reaction, comparing different feature types and models under realistic constraints.

Method: Used EVONS and FakeNewsNet datasets to compare textual embeddings (RoBERTa, Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). Employed dimensionality reduction (t-SNE vs PCA) and analyzed label construction effects.

Result: Textual content alone strongly discriminates fake news; numeric pipelines remain viable when language models are unavailable. Virality prediction is markedly harder and highly sensitive to label construction. Dimensionality reduction shows non-linear structure more informative for virality. Swapping RoBERTa for Mistral yields only modest differences.

Conclusion: Textual embeddings are effective for fake-news detection, while virality prediction requires careful label construction and benefits from non-linear feature analysis. Practical constraints affect evaluation design, and the field needs better reproducibility practices.

Abstract: We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ‘‘viral’’ split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.

[22] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce

Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo

Main category: cs.CL

TL;DR: ADORE is a self-sustaining framework for e-commerce search relevance that combines LLM-generated training data with adversarial synthesis and knowledge distillation to overcome data scarcity and improve semantic matching.

Details

Motivation: E-commerce search relevance modeling faces two main challenges: semantic gaps in traditional term-matching methods like BM25, and neural models' reliance on scarce domain-specific hard samples that are difficult to obtain.

Method: ADORE integrates three innovations: 1) Rule-aware Relevance Discrimination using Chain-of-Thought LLM to generate intent-aligned training data refined via Kahneman-Tversky Optimization; 2) Error-type-aware Data Synthesis that auto-generates adversarial examples for robustness; 3) Key-attribute-enhanced Knowledge Distillation that injects domain-specific attribute hierarchies into deployable student models.

Result: Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework demonstrates improved relevance modeling while overcoming data scarcity challenges.

Conclusion: ADORE establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial e-commerce applications by automating annotation, adversarial generation, and distillation processes.

Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models’ reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.

[23] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu

Main category: cs.CL

TL;DR: DeepSeek-V3.2 achieves high efficiency with superior reasoning and agent performance through sparse attention, scalable RL, and agentic task synthesis.

Details

Motivation: To create a model that harmonizes computational efficiency with state-of-the-art reasoning and agent capabilities, competing with leading models like GPT-5 and Gemini-3.0-Pro.

Method: Three key innovations: 1) DeepSeek Sparse Attention (DSA) for efficient long-context processing, 2) Scalable reinforcement learning framework with robust protocols, 3) Large-scale agentic task synthesis pipeline for generating training data.

Result: DeepSeek-V3.2 performs comparably to GPT-5, with the high-compute variant (DeepSeek-V3.2-Speciale) surpassing GPT-5 and matching Gemini-3.0-Pro. Achieved gold-medal performance in both 2025 IMO and IOI competitions.

Conclusion: DeepSeek-V3.2 demonstrates that combining efficient attention mechanisms, scalable RL, and systematic agentic training can achieve state-of-the-art reasoning and agent performance while maintaining computational efficiency.

Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

[24] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang

Main category: cs.CL

TL;DR: CAPO introduces a curriculum learning approach for RL-based LLM training that separates positive and negative advantage signals, starting with positive-only samples to build foundations before introducing negative signals for better generalization.

Details

Motivation: Existing RL methods for post-training LLMs mix positive and negative advantage signals from the start, which can create ambiguous guidance and limit performance gains. The authors aim to address this by developing a more structured approach to signal utilization.

Method: CAPO uses an adaptive curriculum mechanism that first bootstraps imitation learning with only positive advantage samples to establish robust foundations, then gradually introduces negative signals to develop discriminative capabilities. The method is compatible with various optimization techniques including GRPO, PPO, RLOO, and Reinforce++.

Result: The method achieves stable and significant improvements in mathematical reasoning tasks and effectively generalizes to multimodal GUI reasoning scenarios, demonstrating versatility as an optimization framework.

Conclusion: CAPO provides a robust curriculum-based approach to RL training for LLMs that separates positive and negative advantage signals, leading to better generalization and performance across diverse reasoning tasks.

Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

[25] Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

Jing Ma

Main category: cs.CL

TL;DR: LLMs show a primacy effect in document summarization - they give more weight to the first document in multi-document inputs, creating biased summaries.

Details

Motivation: LLMs are increasingly used for summarizing multiple documents (like Google's AI Overviews), but it's unclear if they weight all inputs equally or show biases in how they process document sequences.

Method: Constructed 40 pro-neutral-con article triplets about abortion, permuted each triplet into six input orders, prompted Gemini 2.5 Flash to generate neutral overviews, and evaluated summaries using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). Used one-way ANOVA and pairwise comparisons to analyze positional effects.

Result: Found significant primacy effect for BERTScore across all stances - summaries are more semantically aligned with the first-seen article. Position 1 differs significantly from Positions 2 and 3, while Positions 2 and 3 don’t differ from each other, confirming selective preference for the first document.

Conclusion: LLMs exhibit positional bias (primacy effect) in multi-document summarization, posing risks for applications relying on LLM-generated overviews and agentic AI systems where LLM outputs disproportionately influence downstream actions.

Abstract: Large language models (LLMs) are now used in settings such as Google’s AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.

Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura

Main category: cs.CL

TL;DR: Empirical comparison of 7 model merging algorithms for bias mitigation in LLMs shows trade-off between bias reduction and downstream performance, with SLERP emerging as most balanced approach.

Details

Motivation: LLMs inherit and amplify societal biases from training data, threatening fairness. While model merging approaches have been explored for bias mitigation, there's no empirical comparison of different algorithms.

Method: Empirically surveyed 7 model merging algorithms (Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, Nearswap) applied to 13 open-weight models from GPT, LLaMA, and Qwen families. Evaluated using 3 bias datasets (BBQ, BOLD, HONEST) and downstream performance on SuperGLUE benchmark.

Result: Found trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, especially on reading comprehension, commonsense, and causal reasoning tasks. Linear, SLERP, and Nearswap consistently reduce bias while maintaining performance, with SLERP at moderate interpolation weights emerging as most balanced choice.

Conclusion: Model merging algorithms show potential for bias mitigation, but excessive debiasing or inappropriate methods degrade linguistic abilities. SLERP with moderate weights offers best balance between bias reduction and performance preservation.

Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing’’ LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

[27] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal, Naman Mishra

Main category: cs.CL

TL;DR: CREST is a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters, using cross-lingual transfer from 13 high-resource languages to generalize to low-resource languages.

Details

Motivation: Existing safety guardrails for LLMs are predominantly tailored for high-resource languages, leaving low-resource language speakers underrepresented. There's a need for universal safety systems that can serve global populations across diverse languages.

Method: CREST uses cluster-based cross-lingual transfer, training on only 13 strategically chosen high-resource languages to generalize to 100 languages total. The model is parameter-efficient with only 0.5B parameters.

Result: CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B+ parameters) across six safety benchmarks.

Conclusion: The research highlights limitations of language-specific guardrails and underscores the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world’s population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

[28] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma, Jun Wang, Zafeirios Fountas

Main category: cs.CL

TL;DR: LLMs show emergent Bayesian-like integration of multimodal cues without explicit training, but accuracy doesn’t guarantee robust uncertainty handling.

Details

Motivation: To investigate whether LLMs exhibit implicit Bayesian computational strategies for multimodal integration, similar to human perceptual processing, without explicit training or instruction.

Method: Created BayesBench - a psychophysics-inspired benchmark with four magnitude estimation tasks (length, location, distance, duration) across text and image modalities. Evaluated nine diverse LLMs with human judgments for calibration, using controlled ablations of noise, context, and instruction prompts.

Result: LLMs often adapt in Bayes-consistent ways, but accuracy doesn’t guarantee robustness. GPT-5 Mini achieved perfect text accuracy but failed to integrate visual cues efficiently, revealing dissociation between capability and strategy.

Conclusion: Accuracy-centric benchmarks may miss brittle uncertainty handling; emergent principled uncertainty handling correlates with Bayesian tendencies. The benchmark and consistency metric are released to inform future multimodal architecture designs.

Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

[29] SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

Jiahao Zhao, Shuaixing Zhang, Nan Xu, Lei Wang

Main category: cs.CL

TL;DR: SurveyEval: A benchmark for evaluating LLM-based automatic survey generation systems across quality, coherence, and accuracy dimensions.

Details

Motivation: Existing LLM-based survey systems integrate retrieval, organization, and synthesis into end-to-end pipelines, but there's no comprehensive evaluation framework for these complex systems.

Method: Created SurveyEval benchmark with 3 evaluation dimensions (overall quality, outline coherence, reference accuracy), extended across 7 subjects, and enhanced LLM-as-a-Judge framework with human references for better alignment.

Result: Specialized survey-generation systems produce substantially higher-quality results compared to general long-text or paper-writing systems.

Conclusion: SurveyEval serves as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

[30] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

Robert Belanec, Ivan Srba, Maria Bielikova

Main category: cs.CL

TL;DR: PEFT-Factory is a unified framework for efficient fine-tuning of Large Language Models that provides 19 PEFT methods, 27 datasets, and standardized evaluation metrics to improve replicability and benchmarking.

Details

Motivation: Current PEFT methods are difficult to replicate, deploy, and compare due to lack of standardization and unified implementation frameworks.

Method: Developed a modular framework based on LLaMA-Factory that supports both off-the-shelf and custom PEFT methods, with native support for 19 PEFT methods, 27 datasets across 12 tasks, and standardized evaluation metrics.

Result: Created a ready-to-use, controlled, and stable environment that improves replicability and benchmarking of PEFT methods, publicly available as open-source software.

Conclusion: PEFT-Factory addresses the reproducibility crisis in PEFT research by providing a unified framework that enables fair comparison and easier deployment of parameter-efficient fine-tuning methods for LLMs.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory

[31] Towards Unification of Hallucination Detection and Fact Verification for Large Language Models

Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: UniFact is a unified evaluation framework that bridges the gap between Hallucination Detection (HD) and Fact Verification (FV) paradigms for assessing LLM hallucinations, showing that hybrid approaches combining both methods achieve best performance.

Details

Motivation: LLM hallucinations undermine trust and adoption, but research has split into two isolated paradigms (model-centric HD and text-centric FV) with different assumptions, datasets, and evaluation protocols, creating a research schism that hinders progress.

Method: Introduces UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Conducts large-scale experiments across multiple LLM families and detection methods.

Result: Three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; (3) hybrid approaches integrating both methods consistently achieve state-of-the-art performance. Provides first in-depth analysis of why FV and HD diverged.

Conclusion: Calls for a new, integrated research agenda to unify Hallucination Detection and Fact Verification in LLMs. Comprehensive experimental results support the need for unification, and all code, data, and baselines are open-sourced.

Abstract: Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/

[32] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio

Main category: cs.CL

TL;DR: The paper proposes a three-tier data synthesis method for Dialogue-Based Generalized Referring Expressions Comprehension (GREC) to address distribution shift and data scarcity issues, achieving substantial improvements over prior approaches.

Details

Motivation: Existing GREC systems struggle with distribution shift between training and evaluation domains, and there's a scarcity of annotated dialogue grounding data, making it difficult to train robust models.

Method: A three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. The synthesized data is then used for fine-tuning models.

Result: Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics for GREC tasks.

Conclusion: The proposed data synthesis approach effectively addresses the data scarcity and distribution shift problems in dialogue-based GREC, leading to significant performance gains through scalable supervision generation.

Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

[33] TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

Mike Nkongolo, Hilton Vorster, Josh Warren, Trevor Naick, Deandre Vanmali, Masana Mashapha, Luke Brand, Alyssa Fernandes, Janco Calitz, Sibusiso Makhoba

Main category: cs.CL

TL;DR: TriLex framework expands sentiment lexicons for low-resource African languages using corpus extraction, cross-lingual mapping, and RAG refinement, improving sentiment analysis performance of African PLMs.

Details

Motivation: Low-resource African languages are underrepresented in sentiment analysis, limiting lexical coverage and multilingual NLP system performance.

Method: TriLex: three-stage retrieval augmented framework with corpus-based extraction, cross-lingual mapping, and RAG-driven lexical refinement for sentiment lexicon expansion.

Result: AfroXLMR achieved F1-scores >80% for isiXhosa and isiZulu with strong cross-lingual stability. AfriBERTa achieved ~64% F1 despite no pre-training on target languages. Both outperformed traditional ML baselines.

Conclusion: TriLex is a scalable, effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.

Abstract: Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. This study proposes TriLex, a three-stage retrieval augmented framework that unifies corpus-based extraction, cross lingual mapping, and retrieval augmented generation (RAG) driven lexical refinement to systematically expand sentiment lexicons for low-resource languages. Using the enriched lexicon, the performance of two prominent African pretrained language models (AfroXLMR and AfriBERTa) is evaluated across multiple case studies. Results demonstrate that AfroXLMR delivers superior performance, achieving F1-scores above 80% for isiXhosa and isiZulu and exhibiting strong cross-lingual stability. Although AfriBERTa lacks pre-training on these target languages, it still achieves reliable F1-scores around 64%, validating its utility in computationally constrained settings. Both models outperform traditional machine learning baselines, and ensemble analyses further enhance precision and robustness. The findings establish TriLex as a scalable and effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.

[34] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Yixuan Tang, Yi Yang

Main category: cs.CL

TL;DR: Stable rank is an intrinsic quality signal from LLM representations that measures effective dimensionality, enabling annotation-free alignment and outperforming external supervision methods.

Details

Motivation: Current LLM alignment methods face limitations: human annotations are scarce/subjective, reward models are vulnerable to hacking, and self-evaluation suffers from prompt sensitivity and biases. There's a need for intrinsic, annotation-free quality signals.

Method: Proposes stable rank - measures effective dimensionality of hidden states by computing ratio of total variance to dominant-direction variance. Also introduces Stable Rank Group Relative Policy Optimization (SR-GRPO) that uses stable rank as reward signal for reinforcement learning.

Result: Stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by 11.3 percentage points via Best-of-N sampling. SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming learned reward models and self-evaluation baselines.

Conclusion: Quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision. Stable rank provides an intrinsic, annotation-free approach that addresses limitations of current alignment methods.

Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.

[35] A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du, Lingling Liu, Yuping Zhao, Lifeng Dong

Main category: cs.CL

TL;DR: TCM-BEST4SDT: A comprehensive clinical case-based benchmark for evaluating LLMs in Traditional Chinese Medicine, focusing on Syndrome Differentiation and Treatment capabilities through four tasks and three evaluation mechanisms.

Details

Motivation: Existing LLM evaluations in TCM are limited to knowledge-based QA or syndrome accuracy, lacking assessment of treatment decision-making. There's an urgent need to evaluate LLMs' clinical application capabilities in TCM's individualized, holistic, and diverse "Syndrome Differentiation and Treatment" framework.

Method: Created TCM-BEST4SDT benchmark with four tasks (TCM Basic Knowledge, Medical Ethics, LLM Content Safety, SDT) using rigorous data annotation pipeline. Developed specialized reward model to quantify prescription-syndrome congruence. Evaluation framework integrates three mechanisms: selected-response evaluation, judge model evaluation, and reward model evaluation.

Result: Benchmark validated through experiments on 15 mainstream LLMs (general and TCM-specific). TCM-BEST4SDT is now publicly available to foster intelligent TCM research development.

Conclusion: TCM-BEST4SDT provides a comprehensive clinical case-based benchmark for evaluating LLMs’ TCM application capabilities, addressing limitations of existing evaluations and supporting advancement of intelligent TCM research.

Abstract: The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM’s “Syndrome Differentiation and Treatment” (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.

[36] BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion

Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues

Main category: cs.CL

TL;DR: BOOM is a multimodal multilingual lecture companion that translates lecture audio and slides to produce synchronized outputs across text, localized slides, and synthesized speech.

Details

Motivation: The globalization of education and growth of online learning create a need for localizing multimodal educational content (audio+slides) to provide accessible learning experiences in native languages.

Method: An end-to-end system that jointly translates lecture audio and slides, preserving all modalities: translated text, localized slides with visual elements, and synthesized speech.

Result: Slide-aware transcripts provide cascading benefits for downstream tasks like summarization and question answering. Code is released under MIT License.

Conclusion: BOOM enables students to access lectures in their native language while preserving the original multimodal content, addressing critical localization challenges in global education.

Abstract: The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

[37] promptolution: A Unified, Modular Framework for Prompt Optimization

Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer

Main category: cs.CL

TL;DR: Promptolution is a unified open-source framework for prompt optimization that integrates multiple optimizers and works with any LLM implementation.

Details

Motivation: Practical adoption of prompt optimization research is hindered by unmaintained, isolated codebases that are tied to specific implementations.

Method: Created a modular, extensible framework that integrates multiple contemporary discrete prompt optimizers while remaining agnostic to underlying LLM implementations.

Result: Provides all components needed for prompt optimization in a single unified system accessible to both practitioners and researchers.

Conclusion: Promptolution addresses the adoption barrier by offering a practical, extensible framework that unifies prompt optimization tools and makes them accessible for real-world use.

Abstract: Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.

[38] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens

Main category: cs.CL

TL;DR: System prompt optimization improves multilingual LLM performance by 5-10% through structured prompts that reduce language-switching and enhance reasoning consistency.

Details

Motivation: Real-world LLM deployments need single system prompts that work reliably across languages, but prior work focused only on English. There's a need to understand how prompts steer models toward accurate cross-lingual behavior.

Method: Proposed unified four-dimensional evaluation framework for multilingual prompts; conducted large-scale experiments on 5 languages, 3 LLMs, and 3 benchmarks; developed prompt optimization framework for multilingual settings; analyzed over 10 million reasoning units.

Result: Certain prompt components (CoT, emotion, scenario) correlate with robust multilingual behavior; optimization framework discovered prompts improving all metrics by 5-10%; performant prompts induce more structured/consistent reasoning patterns while reducing unnecessary language-switching.

Conclusion: System prompt optimization provides a scalable path to accurate and robust multilingual LLM behavior, with optimized prompts enhancing reasoning structure and cross-lingual consistency.

Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

[39] Bangla Hate Speech Classification with Fine-tuned Transformer Models

Yalda Keivan Jafari, Krishno Dey

Main category: cs.CL

TL;DR: The paper addresses hate speech detection in low-resource Bangla language, comparing traditional ML baselines with transformer models, finding BanglaBERT performs best despite smaller size.

Details

Motivation: Hate speech recognition in low-resource languages like Bangla is challenging due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Despite being spoken by over 230 million people, Bangla is significantly underrepresented in computational resources for automated moderation on social media platforms.

Method: The study reproduces official baselines (Majority, Random, SVM) and adds Logistic Regression, Random Forest, and Decision Tree as baseline methods. Transformer-based models including DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa are utilized for hate speech classification on BLP 2025 Shared Task Subtasks 1A and 1B.

Result: All transformer-based models outperformed baseline methods except for DistilBERT. BanglaBERT produced the best performance for both subtasks despite being smaller in size, outperforming both m-BERT and XLM-RoBERTa. This suggests language-specific pre-training is crucial for performance.

Conclusion: Language-specific pre-trained models like BanglaBERT are essential for hate speech detection in low-resource languages. The results highlight the potential and need for developing pre-trained language models specifically for Bangla and similar under-resourced languages.

Abstract: Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the growing need for automated moderation on social media platforms, Bangla is significantly under-represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official baselines (e.g., Majority, Random, Support Vector Machine) and also produce and consider Logistic Regression, Random Forest, and Decision Tree as baseline methods. We also utilized transformer-based models such as DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa for hate speech classification. All the transformer-based models outperformed baseline methods for the subtasks, except for DistilBERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language-specific pre-training is very important. Our results highlight the potential and need for pre-trained language models for the low-resource Bangla language.

[40] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

Haonan Wang, Chao Du, Kenji Kawaguchi, Tianyu Pang

Main category: cs.CL

TL;DR: ThinkMerge is a training-free decoding strategy that averages next-token logits from parallel reasoning traces at synchronization points to improve open-ended reasoning tasks like code generation and web research, outperforming majority voting.

Details

Motivation: Majority voting works well for close-ended QA but fails for open-ended reasoning tasks where a "majority" over complete solutions is ill-defined. There's a need for effective parallel test-time scaling for open-ended tasks.

Method: ThinkMerge runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. It integrates with vLLM/SGLang and works with standard decoding techniques like Top-p/Top-k.

Result: ThinkMerge matches or surpasses majority voting on AIME and GPQA, improves pass@1 by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B on LiveCodeBench (hard), and enhances web-based deep-research agents across GAIA, BrowseComp-en/zh, and XbenchDeepSearch.

Conclusion: Parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs. ThinkMerge provides consistent gains for code generation and web research tasks while being training-free and plug-and-play.

Abstract: Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a “majority” over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.

[41] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

Main category: cs.CL

TL;DR: SchED is a training-free early-exit algorithm for diffusion LLMs that accelerates decoding by 2.34-4.0× while maintaining 99.1-100% performance across various tasks.

Details

Motivation: Diffusion LLMs suffer from slow iterative sampling, limiting their practical utility despite being promising alternatives to autoregressive models.

Method: SchED aggregates full-span logit margins and halts decoding when a smooth, progress-dependent confidence threshold is met. It’s training-free and model-agnostic.

Result: Achieves 3.8-4.0× speedups on instruction-tuned models with 99.8-100% performance retention, and 2.34× speedups on base models with 99.1-100% retention across 10 benchmarks.

Conclusion: SchED makes dLLM decoding substantially more efficient by converting confidence stabilization into computational savings, outperforming prior early-exit methods especially on long-form generation.

Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model’s token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

[42] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs

Tanmay Agrawal

Main category: cs.CL

TL;DR: A framework using visual knowledge graphs to help users identify and correct LLM hallucinations in enterprise settings by linking model assertions to source knowledge with confidence indicators.

Details

Motivation: Enterprise LLMs augmented with domain knowledge often produce credible-looking hallucinations due to limited context windows and knowledge inconsistencies. Current mitigation strategies (gold-standard Q&A curation or secondary model verification) are costly and lack deterministic assurance.

Method: Organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs that link model assertions to underlying sources of truth with confidence indicators. Provides users with visual interface to diagnose inconsistencies and supply corrective feedback.

Result: Creates a human-in-the-loop workflow with structured feedback loop that enhances model reliability and continuously improves response quality by enabling users to identify hallucination zones and weak reasoning chains.

Conclusion: Visual knowledge graphs offer an intuitive approach to mitigate LLM hallucinations in enterprise settings by providing transparency into model reasoning and enabling corrective feedback, addressing limitations of current mitigation strategies.

Abstract: Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q&A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.

[43] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang

Main category: cs.CL

TL;DR: AutoNeural is an NPU-native Vision-Language Model architecture that replaces ViT with MobileNetV5-style backbone for stable quantization and integrates SSM with Transformer for linear-time generation, achieving 14x latency reduction and 3x decoding speed.

Details

Motivation: Current Vision-Language Models designed for GPUs perform poorly on Neural Processing Units (NPUs) due to quantization brittleness of Vision Transformers and I/O-bound autoregressive attention mechanisms that fail to utilize NPUs' high arithmetic throughput.

Method: Co-design NPU-native VLM architecture with: 1) MobileNetV5-style backbone using depthwise separable convolutions instead of ViT for stable INT4/8/16 quantization, and 2) Hybrid language backbone combining State-Space Model principles with Transformer layers using gated convolutions for linear-time complexity, eliminating KV caching overhead.

Result: 7x reduction in vision encoder quantization error, 14x end-to-end latency reduction, 3x decoding speed improvement, 4x longer context window than baselines. Validated on Qualcomm SA8295P SoC with real-time performance for automotive cockpit applications.

Conclusion: Rethinking model topology specifically for NPU constraints is essential for robust multi-modal edge intelligence, as demonstrated by AutoNeural’s hardware-aligned design enabling efficient integer-only inference on edge NPUs.

Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision–Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

[44] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Muyu Pan, Dheeraj Kodakandla, Mahfuza Farooque

Main category: cs.CL

TL;DR: A framework for translating English sentences to CNF logic using NLP techniques and fine-tuned LLMs to reduce hallucinations in logical translation.

Details

Motivation: LLMs enable automatic natural language to formal logic translation for automated reasoning and software verification, but suffer from hallucinations that compromise precision in logical tasks.

Method: Combines classical NLP with self-defined grammar, symbolic computation libraries, and fine-tuned language models to translate English to logical expressions, then to CNF for satisfiability solving.

Result: Fine-tuned models trained on different grammar settings can intentionally correct hallucinations made by the original model, enabling reliable CNF generation.

Conclusion: The framework successfully reduces hallucinations in logical translation tasks, providing a reliable approach for converting natural language to CNF for automated reasoning applications.

Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.

[45] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh

Main category: cs.CL

TL;DR: MoCoP is a dataset-free, closed-loop framework for continuously evaluating moral consistency in LLMs, revealing stable ethical characteristics rather than short-term fluctuations.

Details

Motivation: Existing alignment frameworks rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning evolves across contexts or temporal scales, highlighting the need for moral consistency evaluation.

Method: Moral Consistency Pipeline (MoCoP) combines three layers: lexical integrity analysis, semantic risk estimation, and reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision.

Result: MoCoP effectively captures longitudinal ethical behavior, showing strong inverse relationship between ethical and toxicity dimensions (rET = -0.81, p<0.001) and near-zero association with response latency (rEL ≈ 0), demonstrating moral coherence emerges as stable characteristics.

Conclusion: MoCoP offers a reproducible foundation for scalable, continuous auditing and advances computational morality in autonomous AI systems by reframing ethical evaluation as dynamic, model-agnostic moral introspection.

Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.

[46] On SkipGram Word Embedding Models with Negative Sampling: Unified Framework and Impact of Noise Distributions

Dezhi Liu, Richong Zhang, Ziqiao Wang

Main category: cs.CL

TL;DR: The paper introduces Word-Context Classification (WCC) framework that generalizes SkipGram with Negative Sampling (SGN) word embedding models, showing that using the data distribution as noise distribution yields best performance and convergence speed, while discovering several novel embedding models that outperform existing ones.

Details

Motivation: To create a more general framework for word embedding models that extends beyond the SkipGram with Negative Sampling (SGN) approach, allowing for better understanding of how noise distribution affects embedding quality and training efficiency.

Method: Proposes Word-Context Classification (WCC) framework that generalizes SGN models using noise examples, with theoretical analysis to justify the framework and experimental study of different noise distributions’ impact on learning.

Result: Theoretical analysis supports the WCC framework, and experiments show that using the data distribution as noise distribution yields best embedding performance and fastest convergence. Several novel embedding models discovered within this framework outperform existing WCC models.

Conclusion: The WCC framework successfully generalizes SGN word embedding models, provides theoretical justification for using noise examples, demonstrates that data distribution is optimal for noise distribution, and enables discovery of novel high-performing embedding models.

Abstract: SkipGram word embedding models with negative sampling, or SGN in short, is an elegant family of word embedding models. In this paper, we formulate a framework for word embedding, referred to as Word-Context Classification (WCC), that generalizes SGN to a wide family of models. The framework, which uses some ``noise examples’’, is justified through theoretical analysis. The impact of noise distribution on the learning of the WCC embedding models is studied experimentally, suggesting that the best noise distribution is, in fact, the data distribution, in terms of both the embedding performance and the speed of convergence during training. Along our way, we discover several novel embedding models that outperform existing WCC models.

[47] Unifying Linear-Time Attention via Latent Probabilistic Modelling

Rares Dolga, Lucas Maystre, Marius Cobzarenco, David Barber

Main category: cs.CL

TL;DR: A new directed linear attention mechanism based on probabilistic graphical models achieves competitive performance with standard attention while maintaining linear complexity.

Details

Motivation: Standard Transformers have quadratic attention complexity which limits long-sequence modeling, while existing linear attention methods often sacrifice performance, especially on discrete data like language.

Method: Proposes a directed parameterization of linear attention using probabilistic graphical models, introducing asymmetric structure for causal modeling. Combines global latent-variable attention with local standard attention, and uses recurrent parameterization of queries/keys to avoid incompatible positional encodings.

Result: The model achieves competitive performance with standard attention on language modeling benchmarks and outperforms existing linear attention variants.

Conclusion: The directed linear attention approach successfully addresses limitations of previous linear attention methods while maintaining linear complexity, offering a promising alternative for efficient language modeling.

Abstract: Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.

[48] OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, Chenghua Lin

Main category: cs.CL

TL;DR: OmniBench is a new benchmark for evaluating multimodal language models’ ability to process and reason across visual, acoustic, and textual inputs simultaneously, revealing current models’ limitations and proposing OmniInstruct dataset for improvement.

Details

Motivation: Current multimodal LLMs focus on integrating multiple modalities but lack exploration of simultaneous processing and reasoning across different inputs. There's a need to evaluate models' ability to handle tri-modal (visual, acoustic, textual) understanding.

Method: Introduced OmniBench benchmark with high-quality human annotations requiring integrated understanding across all three modalities. Developed OmniInstruct, a 96K-sample instruction tuning dataset for training OLMs (omni-language models).

Result: Open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts. Most baseline models perform poorly (~50% accuracy) even with textual alternatives to image/audio inputs.

Conclusion: There’s a need for more robust tri-modal integration techniques and training strategies to enhance OLM performance. The OmniBench benchmark and OmniInstruct dataset provide foundations for advancing multimodal reasoning capabilities.

Abstract: Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).

[49] Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots

Main category: cs.CL

TL;DR: LLMs can develop steganographic collusion through misaligned rewards, and current oversight methods are insufficient to prevent it.

Details

Motivation: As frontier model agents proliferate, there's concern about systemic risks from unsafe interactions, particularly undetectable collusion through steganography in agent communications.

Method: Two approaches: gradient-based reinforcement learning (GBRL) and in-context reinforcement learning (ICRL) to elicit sophisticated LLM-generated linguistic text steganography.

Result: First demonstration that unintended steganographic collusion can arise from mispecified reward incentives during training. Standard mitigations (passive oversight and communication paraphrasing) are not fully effective.

Conclusion: Steganographic collusion emergence is plausible and requires monitoring/research. Preventing it needs innovation in mitigation techniques beyond current approaches.

Abstract: The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches – a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method – for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations – both passive oversight of model outputs and active mitigation through communication paraphrasing – are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

[50] From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

Charles Zhang, Benji Peng, Xintian Sun, Qian Niu, Junyu Liu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Ming Liu, Yichao Zhang, Xinyuan Song, Cheng Fei, Caitlyn Heqi Yin, Lawrence KQ Yan, Hongyang He, Tianyang Wang

Main category: cs.CL

TL;DR: This is a comprehensive survey paper reviewing the evolution and current state of word embeddings and language models in NLP, covering foundational concepts, various embedding techniques, applications across domains, and future research directions.

Details

Motivation: To provide a comprehensive overview of the transformation in NLP brought by word embeddings and language models, synthesizing current methodologies and emerging trends to offer researchers and practitioners an in-depth resource for advancing embedding-based language models.

Method: The paper conducts a systematic review and survey of embedding techniques, tracing evolution from sparse representations to dense embeddings, examining static and contextualized embeddings, sentence/document embeddings, multimodal applications, and addressing advanced topics like compression, interpretability, and bias mitigation.

Result: The survey synthesizes current methodologies across the entire spectrum of embedding technologies, from foundational concepts to cutting-edge applications, providing a comprehensive resource that maps the landscape of embedding-based language models and identifies key technical challenges and ethical implications.

Conclusion: This comprehensive survey offers researchers and practitioners a valuable resource to understand the evolution, current state, and future directions of embedding-based language models, emphasizing the need for scalable training, enhanced interpretability, robust multimodal grounding, and addressing ethical concerns to push the boundaries of the field.

Abstract: Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.

[51] Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo

Main category: cs.CL

TL;DR: This paper introduces LongFinanceQA, a synthetic financial dataset with Chain-of-Thought reasoning to improve LLMs’ long-context understanding, using a Property-based Agentic Inference framework to generate human-like reasoning steps.

Details

Motivation: While LLMs can process longer sequences (2K to 2M+ tokens), simply extending input length doesn't guarantee effective long-context understanding. The paper aims to improve LLMs' ability to reason over long contexts through supervised Chain-of-Thought integration.

Method: 1) Created LongFinanceQA - synthetic financial dataset with intermediate CoT reasoning before final answers; 2) Developed Property-based Agentic Inference (PAI) framework to generate synthetic CoT reasoning through property extraction, retrieval, and summarization; 3) Fine-tuned LLaMA-3.1-8B-Instruct on LongFinanceQA.

Result: 1) GPT-4o-mini with PAI outperformed standard GPT-4o-mini by 20.0% on Loong benchmark; 2) LLaMA-3.1-8B-Instruct fine-tuned on LongFinanceQA achieved 28.0% gain on Loong’s financial subset.

Conclusion: Integrating supervised Chain-of-Thought reasoning with synthetic datasets like LongFinanceQA significantly improves LLMs’ long-context understanding, with PAI’s agentic framework effectively generating human-like reasoning steps that enhance both accuracy and interpretability.

Abstract: Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-based Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI’s reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 28.0% gain on Loong’s financial subset.

[52] SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan

Main category: cs.CL

TL;DR: SkyLadder proposes a short-to-long context window scheduling strategy for LLM pretraining that improves both performance and training efficiency compared to fixed long-context training.

Details

Motivation: The paper is motivated by the finding that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget, despite recent trends toward ever-expanding context windows. This suggests a need for better balancing long-context capability with pretraining efficiency.

Method: SkyLadder is a simple yet effective approach that implements a short-to-long context window transition during pretraining. The method gradually increases context window size rather than using fixed long contexts from the start.

Result: Experiments with 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens show SkyLadder yields consistent gains of up to 3.7% on common benchmarks while achieving up to 22% faster training speeds compared to baselines. It preserves strong standard benchmark performance while matching or exceeding baseline results on long context tasks.

Conclusion: SkyLadder demonstrates that strategically scheduling context window size during pretraining can significantly improve both model performance and training efficiency, challenging the conventional wisdom of using fixed long contexts throughout training.

Abstract: Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

[53] On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo

Main category: cs.CL

TL;DR: LLMs struggle with temporal reasoning on unseen structured/semi-structured data; the paper introduces RATA dataset and compares various reasoning methods, finding standalone LLMs insufficient for reliable solutions.

Details

Motivation: To explore LLMs' applicability in temporal reasoning tasks over data not present during training, particularly focusing on structured and semi-structured anonymized data where models must rely on reasoning rather than prior knowledge.

Method: Identified 17 common temporal reasoning tasks, created RATA dataset with semi-structured anonymized data, developed direct LLM pipeline, and compared various methodologies including Tree-of-Thought, self-reflexion, and code execution techniques.

Result: Results suggest that achieving scalable and reliable temporal reasoning solutions requires more than just standalone LLMs, highlighting the need for integrated approaches combining different reasoning techniques.

Conclusion: Standalone LLMs are insufficient for robust temporal reasoning on unseen structured data; integrated approaches combining multiple reasoning methodologies are necessary for scalable and reliable performance.

Abstract: The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.

Tim Wittenborg, Constantin Sebastian Tremel, Markus Stocker, Sören Auer

Main category: cs.CL

TL;DR: Paper presents a semi-automatic tool using LLMs and knowledge graphs to quantify scientific accuracy of online media, focusing on climate misinformation, but finds current knowledge graphs insufficient for scalable fact-checking.

Details

Motivation: Democratic societies need reliable information to combat misinformation in media. Citizens cannot manually verify the flood of content they consume daily, creating a need for automated verification tools.

Method: Developed a workflow using LLM-based statement extraction and knowledge graph analysis. Media content is semantified and compared against climate-related ground truth knowledge graphs. Involved 27 experts for development/evaluation and 43 anonymous participants for user survey.

Result: Tool provides beneficial veracity indication according to expert evaluations and user surveys. However, current climate change knowledge graphs are insufficient for scalable neurosymbolic fact-checking, and the approach cannot annotate public media at required granularity and scale.

Conclusion: Further work needed to develop FAIR (Findable, Accessible, Interoperable, Reusable) ground truth knowledge graphs and complementary metrics to support scientific civic discourse. Current state of climate knowledge graphs limits practical implementation.

Abstract: Democratic societies need reliable information. Misinformation in popular media, such as news articles or videos, threatens to impair civic discourse. Citizens are, unfortunately, not equipped to verify the flood of content consumed daily at increasing rates. This work aims to quantify the scientific accuracy of online media semi-automatically. We investigate the state of the art of climate-related ground truth knowledge representation. By semantifying media content of unknown veracity, their statements can be compared against these ground truth knowledge graphs. We implemented a workflow using LLM-based statement extraction and knowledge graph analysis. Our implementation can streamline content processing towards state-of-the-art knowledge representation and veracity quantification. Developed and evaluated with the help of 27 experts and detailed interviews with 10, the tool evidently provides a beneficial veracity indication. These findings are supported by 43 anonymous participants from a parallel user survey. This initial step, however, is unable to annotate public media at the required granularity and scale. Additionally, the identified state of climate change knowledge graphs is vastly insufficient to support this neurosymbolic fact-checking approach. Further work towards a FAIR (Findable, Accessible, Interoperable, Reusable) ground truth and complementary metrics is required to support civic discourse scientifically.

[55] Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

Shuxun Wang, Qingyu Yin, Chak Tou Leong, Qiang Zhang, Linyi Yang

Main category: cs.CL

TL;DR: The paper investigates how induction heads in LLMs cause repetitive token generation (repetition curse) by dominating output logits and proposes attention head regularization to mitigate this issue.

Details

Motivation: The repetition curse phenomenon in LLMs (generating repetitive/cyclic sequences) is widely observed but poorly understood mechanistically. The authors aim to uncover the underlying mechanisms driving this behavior.

Method: Focus on induction heads (attention heads specialized for in-context learning) and analyze their “toxicity” - tendency to dominate output logits during repetition, excluding other attention heads. Propose attention head regularization technique to reduce induction head dominance.

Result: Identified induction heads as key drivers of repetition curse, providing mechanistic explanation for the phenomenon. Demonstrated how their dominance during generation leads to repetitive outputs.

Conclusion: Understanding induction head toxicity offers insights for LLM design/training. Attention head regularization could mitigate repetition curse by reducing induction head dominance, promoting more diverse and coherent outputs.

Abstract: Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads–a specific type of attention head known for their ability to perform in-context learning–in driving this repetitive behavior. Specifically, we focus on the “toxicity” of induction heads, which we define as their tendency to dominate the model’s output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.

[56] MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, Shafiq Joty

Main category: cs.CL

TL;DR: MAS-ZERO is a self-evolved, inference-time framework that automatically designs multi-agent systems for each problem instance without needing validation sets, outperforming manual and automatic MAS baselines across reasoning, coding, and agentic tasks.

Details

Motivation: Current multi-agent systems (MAS) using LLMs rely on manually designed agent roles and communication protocols that don't align well with LLM strengths and can't adapt to novel tasks. Automatic MAS approaches require validation sets for tuning, produce static designs lacking inference-time adaptability, and remove flexibility to reduce to simpler systems.

Method: MAS-ZERO uses meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance. It enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and can reduce to simpler systems when appropriate, all without requiring a validation set.

Result: Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks show MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks while maintaining cost efficiency.

Conclusion: MAS-ZERO represents a significant advancement in automatic MAS design by providing a self-evolved, inference-time framework that dynamically adapts to each problem instance without validation sets, demonstrating superior performance across diverse task types.

Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs’ strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.

[57] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

Yan Yu, Yilun Liu, Minggui He, Shimin Tao, Weibin Meng, Xinhua Yang, Li Zhang, Hongxia Ma, Dengye Li, Daimeng Wei, Boxing Chen, Fuliang Li

Main category: cs.CL

TL;DR: ELSPR is a graph-based framework that identifies and filters ambiguous preference data causing non-transitivity in LLM pairwise evaluations, improving ranking reliability.

Details

Motivation: Non-transitive preferences (A>B, B>C, C>A) in LLM pairwise evaluations undermine ranking reliability, primarily caused by low-quality ambiguous preference data.

Method: ELSPR models pairwise preferences as tournament graphs, uses SCC analysis to quantify non-transitivity, and employs normalized directed graph structural entropy to measure preference clarity. It selectively removes problematic preference data while preserving transitive preferences.

Result: Models fine-tuned on ELSPR-filtered data show 13.8% reduction in non-transitivity, 0.088 decrease in structural entropy, and enhanced discriminative power. Human validation shows discarded data has lower inter-annotator agreement (34.4% vs 52.6%) and model-human consistency (51.2% vs 80.6%).

Conclusion: ELSPR is an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems by addressing non-transitivity through principled data filtering.

Abstract: Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.

[58] Multilingual Pretraining for Pixel Language Models

Ilker Kesen, Jonas F. Lotz, Ingo Ziegler, Phillip Rust, Desmond Elliott

Main category: cs.CL

TL;DR: PIXEL-M4 is a multilingual pixel language model pretrained on four diverse languages (English, Hindi, Ukrainian, Simplified Chinese) that outperforms English-only models on non-Latin scripts and shows strong cross-lingual transfer capabilities.

Details

Motivation: While pixel language models (which operate on rendered text images) have shown strong cross-lingual transfer capabilities, multilingual pretraining for these models remains underexplored. The authors aim to enhance pixel language models' ability to support diverse languages through multilingual pretraining.

Method: The authors introduce PIXEL-M4, a pixel language model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. They conduct multilingual evaluations on semantic and syntactic tasks, perform word-level probing analyses, and analyze hidden representations to understand the model’s capabilities.

Result: PIXEL-M4 outperforms English-only counterparts on non-Latin scripts in multilingual evaluations. Word-level probing confirms it captures rich linguistic features even in languages not seen during pretraining. Analysis shows multilingual pretraining yields a semantic embedding space closely aligned across pretraining languages.

Conclusion: Multilingual pretraining substantially enhances pixel language models’ capability to effectively support diverse languages, demonstrating that pixel-based approaches can successfully learn cross-lingual representations through multilingual training.

Abstract: Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

[59] Do Large Language Models Think Like the Brain? Sentence-Level Evidences from Layer-Wise Embeddings and fMRI

Yu Lei, Xingyang Ge, Yi Zhang, Yiming Yang, Bolei Ma

Main category: cs.CL

TL;DR: LLMs develop brain-like hierarchical representations as they scale, with higher-performing models showing stronger alignment with human neural responses during sentence comprehension.

Details

Motivation: To understand whether LLMs and human brains converge on similar computational principles, and whether brain-like patterns in LLMs emerge from scaling or reflect deeper alignment with human language processing architecture.

Method: Compared hierarchical embeddings from 14 publicly available LLMs with fMRI data from participants exposed to naturalistic narrative stories, constructing sentence-level neural prediction models to identify model layers most correlated with brain region activations.

Result: Improvements in model performance drive evolution of representational architectures toward brain-like hierarchies, with stronger functional and anatomical correspondence at higher semantic abstraction levels.

Conclusion: LLMs develop computational parallels with human brains as they scale, highlighting their potential as models for human language processing.

Abstract: Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how layer-wise representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels. These findings advance our understanding of the computational parallels between LLMs and the human brain, highlighting the potential of LLMs as models for human language processing.

[60] Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models

Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh

Main category: cs.CL

TL;DR: EKA-EVAL is a unified evaluation framework for LLMs with zero-code web interface and CLI, supporting 50+ multilingual benchmarks across 9 categories, local/proprietary models, and 11 core capabilities in a modular architecture.

Details

Motivation: Current LLM evaluation frameworks lack global applicability, flexibility, modularity, and comprehensive support for diverse tasks, model types, and linguistic settings, especially for low-resource languages.

Method: Developed EKA-EVAL with zero-code web interface and interactive CLI for accessibility; integrated 50+ multilingual benchmarks across 9 evaluation categories; supports local and proprietary models; designed with modular plug-and-play architecture offering 11 core capabilities.

Result: Outperforms 5 existing baselines with at least 2x better usability, highest user satisfaction, faster setup times, and consistent benchmark reproducibility; first comprehensive suite in single platform.

Conclusion: EKA-EVAL provides a scalable, multilingual evaluation framework with comprehensive coverage, open-source availability, and superior usability compared to existing solutions.

Abstract: The rapid evolution of Large Language Models’ has underscored the need for evaluation frameworks that are globally applicable, flexible, and modular, and that support a wide range of tasks, model types, and linguistic settings. We introduce EKA-EVAL, a unified, end- to-end framework that combines a zero-code web interface and an interactive CLI to ensure broad accessibility. It integrates 50+ multilingual benchmarks across nine evaluation categories, supports local and proprietary models, and provides 11 core capabilities through a modular, plug-and-play architecture. Designed for scalable, multilingual evaluation with support for low-resource multilingual languages, EKA-EVAL is, to the best of our knowledge, the first suite to offer comprehensive coverage in a single platform. Comparisons against five existing baselines indicate improvements of at least 2x better on key usability measures, with the highest user satisfaction, faster setup times, and consistent benchmark reproducibility. The framework is open-source and publicly available at https://github.com/lingo-iitgn/eka-eval.

[61] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

Zhu Liu, Zhen Hu, Lei Dai, Yu Xuan, Ying Liu

Main category: cs.CL

TL;DR: XISM is an interactive system that combines data-driven inference with expert knowledge to build semantic maps, addressing the scalability vs. interpretability trade-off in linguistic typology.

Details

Motivation: Existing semantic map construction methods face a tension between scalability and interpretability - either relying on labor-intensive expert reasoning or fully automated systems lacking expert involvement.

Method: XISM generates candidate maps via top-down procedures and allows users to iteratively refine edges in a visual interface with real-time metric feedback, combining data-driven inference with expert knowledge.

Result: Experiments in three semantic domains and expert interviews show that XISM improves linguistic decision transparency and controllability while maintaining computational efficiency.

Conclusion: XISM provides a collaborative approach for scalable and interpretable semantic-map building, bridging the gap between automated systems and expert knowledge.

Abstract: Semantic map models visualize systematic relations among semantic functions through graph structures and are widely used in linguistic typology. However, existing construction methods either depend on labor-intensive expert reasoning or on fully automated systems lacking expert involvement, creating a tension between scalability and interpretability. We introduce \textbf{XISM}, an interactive system that combines data-driven inference with expert knowledge. XISM generates candidate maps via a top-down procedure and allows users to iteratively refine edges in a visual interface, with real-time metric feedback. Experiments in three semantic domains and expert interviews show that XISM improves linguistic decision transparency and controllability in semantic-map construction while maintaining computational efficiency. XISM provides a collaborative approach for scalable and interpretable semantic-map building. The system\footnote{https://app.xism2025.xin/} , source code\footnote{https://github.com/hank317/XISM} , and demonstration video\footnote{https://youtu.be/m5laLhGn6Ys} are publicly available.

[62] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-3 is a dynamic evaluation framework using 220k graduate-level questions to address data contamination and leaderboard overfitting in LLM evaluation, featuring automated contamination-resistant curation, anti-cheating architecture, and calibrated LLM-as-a-judge scoring.

Details

Motivation: Existing static benchmarks for LLMs suffer from data contamination (models being trained on test data) and leaderboard overfitting, which obscure true model capabilities and create misleading performance metrics.

Method: Built on 220k proprietary graduate-level questions, LLMEval-3 dynamically samples unseen test sets for each evaluation run. Features include: contamination-resistant data curation, novel anti-cheating architecture, calibrated LLM-as-a-judge process achieving 90% human expert agreement, and relative ranking system for fair comparison.

Result: 20-month longitudinal study of nearly 50 leading models revealed a performance ceiling on knowledge memorization and exposed data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrated exceptional robustness in ranking stability and consistency.

Conclusion: LLMEval-3 provides a robust, credible methodology for assessing true LLM capabilities beyond leaderboard scores, offering strong empirical validation for the dynamic evaluation paradigm and promoting development of more trustworthy evaluation standards.

Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

[63] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

Chen Xu, Zhenyu Lv, Tian Lan, Xianyang Wang, Luyao Ji, Leyang Cui, Minqiang Yang, Jian Shen, Qunxi Dong, Xiuling Liu, Juan Wang, Bin Hu

Main category: cs.CL

TL;DR: LLMs trained to supervise therapist training by identifying common therapeutic mistakes and providing targeted feedback, avoiding direct patient-facing applications.

Details

Motivation: Direct use of LLMs in psychotherapy raises ethical/safety concerns, but LLMs can be valuable as supervisors to train real therapists. There's a contradiction: clear feedback standards are needed for training, but no absolute "gold standard" exists for appropriate therapeutic behaviors. However, common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback.

Method: Three-step approach: (1) Establish guidelines for mistaken behaviors and targeted correction strategies as standards. (2) Construct human-in-the-loop dialogue-feedback dataset where a mistake-prone agent intentionally makes standard mistakes during interviews, and a supervisor agent locates/identifies mistakes and provides targeted feedback. (3) Fine-tune models on this dataset (MATE) to create final supervisor model for real therapist training.

Result: Models fine-tuned on MATE dataset demonstrate high-quality feedback according to clinical guidelines through automated, human, and downstream assessments, showing significant potential for therapist training scenarios.

Conclusion: LLMs can be effectively deployed as supervisors for therapist training by focusing on identifying common mistakes rather than prescribing “correct” behaviors, providing a safer and more ethical application of AI in psychotherapy training.

Abstract: Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute “gold standard” for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.

[64] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: DESIGNER is a pipeline that uses “design logic” to automatically generate large-scale, multidisciplinary reasoning questions from raw documents, creating challenging datasets that significantly improve LLM reasoning capabilities.

Details

Motivation: Current LLMs struggle with complex, multi-step reasoning across diverse disciplines, and existing reasoning datasets lack disciplinary breadth, depth, diversity, and guiding principles for question synthesis.

Method: The DESIGNER pipeline: 1) extracts “design logic” patterns from existing questions across disciplines, 2) matches these logics with source documents (book/web corpus), 3) uses LLMs to generate questions with controllable types and difficulty, creating two large datasets spanning 75 disciplines.

Result: Created DLR-Book (3.04M questions) and DLR-Web (1.66M questions) with greater difficulty and diversity than baselines. Fine-tuning Qwen3 and Llama3 with this data substantially enhanced multidisciplinary reasoning, even surpassing official final models that underwent full post-training.

Conclusion: The DESIGNER pipeline successfully automates high-quality reasoning question generation at scale, demonstrating that carefully synthesized reasoning data can significantly boost LLM reasoning capabilities across diverse disciplines.

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of “design logic” and instruct LLMs to mimic human educators’ question-creation process, enabling the automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to generate reasoning questions with controllable question types and difficulty levels. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training process.

[65] WebMall – A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer

Main category: cs.CL

TL;DR: WebMall is the first offline multi-shop benchmark for evaluating web agents on complex comparison shopping tasks, featuring four simulated shops with heterogeneous product data from Common Crawl.

Details

Motivation: Existing web agent benchmarks either require online testing (live web) or use offline simulated environments, but current offline benchmarks like WebShop, WebArena, and Mind2Web only cover simple e-commerce tasks against single shops. There's a gap for an e-commerce benchmark that simulates multiple shops with heterogeneous product data and requires agents to perform complex shopping tasks.

Method: WebMall creates an offline multi-shop benchmark with four simulated shops populated with product data extracted from Common Crawl. The benchmark includes tasks ranging from specific product searches and price comparisons to advanced queries for complementary/substitute products and checkout processes. The authors validate WebMall using eight agents varying in observation space, short-term memory availability, and LLM architecture.

Result: Validation shows the benchmark is challenging, with even the best-performing agents achieving task completion rates below 55% in categories like “cheapest product search” and “vague product search.” This demonstrates the difficulty of multi-shop comparison shopping tasks.

Conclusion: WebMall fills an important gap in web agent evaluation by providing the first offline multi-shop benchmark for complex comparison shopping tasks, highlighting significant challenges in current web agent capabilities for realistic e-commerce scenarios.

Abstract: LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, which allow for the exact reproduction of the experimental setup. While DeepShop provides an online benchmark that requires agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, or Mind2Web cover only comparatively simple e-commerce tasks that need to be performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced queries for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with even the best-performing agents achieving task completion rates below 55% in the task categories cheapest product search and vague product search.

[66] Just-in-time and distributed task representations in language models

Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: Language models form two types of task representations during in-context learning: transferrable representations that evolve sporadically with temporal and semantic locality, and persistent task identity representations.

Details

Motivation: To understand when and how language models form representations for new tasks during in-context learning, particularly distinguishing between transferrable representations that can be used by other models and simpler task category representations.

Method: Studied two types of task representations: transferrable vector representations that can transfer task contexts to another model, and simpler representations of high-level task categories. Analyzed how these representations evolve over the course of context.

Result: Transferrable task representations evolve in non-monotonic, sporadic ways with two-fold locality: temporal locality (coming online only at certain tokens) and semantic locality (capturing small task scopes). Task identity representations persist throughout context. Transferrable representations condense evidence as more examples are provided.

Conclusion: Language models represent new tasks through both an inert, sustained sensitivity to task identity and an active, just-in-time representation for inference, revealing complex dynamics in how task knowledge is formed during in-context learning.

Abstract: Many of language models’ impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate when representations for new tasks are formed in language models, and how these representations change over the course of context. We study two different task representations: those that are ‘’transferrable’’ – vector representations that can transfer task contexts to another model instance, even without the full prompt – and simpler representations of high-level task categories. We show that transferrable task representations evolve in non-monotonic and sporadic ways, while task identity representations persist throughout the context. Specifically, transferrable task representations exhibit a two-fold locality. They successfully condense evidence when more examples are provided in the context. But this evidence accrual process exhibits strong temporal locality along the sequence dimension, coming online only at certain tokens – despite task identity being reliably decodable throughout the context. In some cases, transferrable task representations also show semantic locality, capturing a small task ‘‘scope’’ such as an independent subtask. Language models thus represent new tasks on the fly through both an inert, sustained sensitivity to the task and an active, just-in-time representation to support inference.

[67] Evolution and compression in LLMs: On the emergence of human-aligned categorization

Nathaniel Imel, Noga Zaslavsky

Main category: cs.CL

TL;DR: LLMs can develop human-like efficient color categorization systems through Information Bottleneck optimization, with only the most capable models achieving the full range of human-like near-optimal tradeoffs.

Details

Motivation: To investigate whether LLMs can evolve human-aligned semantic systems through the same Information Bottleneck optimization principle that underlies human semantic efficiency, despite not being explicitly trained for this objective.

Method: Two studies: 1) English color-naming study comparing LLMs’ complexity and alignment with human data; 2) Simulated cultural evolution using Iterated in-Context Language Learning (IICLL) to test if LLMs exhibit human-like inductive bias toward IB-efficiency.

Result: Larger instruction-tuned models achieve better English-alignment and IB-efficiency. Only Gemini 2.0 recapitulates the full range of human-like near-optimal IB-tradeoffs, while other models converge to low-complexity solutions. LLMs can iteratively restructure systems toward greater IB-efficiency like humans.

Conclusion: Human-aligned semantic categories can emerge in LLMs via the same Information Bottleneck principle that underlies human semantic efficiency, demonstrating that LLMs can evolve efficient human-like semantic systems.

Abstract: Converging evidence suggests that human systems of semantic categories achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy tradeoff. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-aligned semantic systems? To address this question, we focus on color categorization – a key testbed of cognitive theories of categorization with uniquely rich human data – and replicate with LLMs two influential human studies. First, we conduct an English color-naming study, showing that LLMs vary widely in their complexity and English-alignment, with larger instruction-tuned models achieving better alignment and IB-efficiency. Second, to test whether these LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via a method we refer to as Iterated in-Context Language Learning (IICLL). We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency. However, only a model with strongest in-context capabilities (Gemini 2.0) is able to recapitulate the wide range of near-optimal IB-tradeoffs observed in humans, while other state-of-the-art models converge to low-complexity solutions. These findings demonstrate how human-aligned semantic categories can emerge in LLMs via the same fundamental principle that underlies semantic efficiency in humans.

[68] Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs

Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, David Clifton

Main category: cs.CL

TL;DR: Geometric framework for hallucination detection in LLMs using archetypal analysis and convex hull volume to measure uncertainty at global and local levels.

Details

Motivation: LLMs often hallucinate plausible but incorrect answers, requiring uncertainty quantification for hallucination detection. Existing black-box approaches lack unified geometric interpretation and rely on disjoint heuristics.

Method: Proposes geometric framework based on archetypal analysis of response embeddings. Global: Geometric Volume measures convex hull volume of archetypes. Local: Geometric Suspicion ranks reliability using spatial relationships between responses and archetypes.

Result: Framework performs comparably or better than prior methods on short-form QA datasets, achieves superior results on medical datasets where hallucinations are critical. Theoretical link proven between convex hull volume and entropy.

Conclusion: Geometric approach provides unified framework for hallucination detection with continuous semantic boundary points, enabling reliable response selection and addressing critical hallucination risks.

Abstract: Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, requiring estimates for both global uncertainty (attributed to a batch of responses) and local uncertainty (attributed to individual responses). While recent black-box approaches have shown some success, they often rely on disjoint heuristics or graph-theoretic approximations that lack a unified geometric interpretation. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric Volume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which leverages the spatial relationship between responses and these archetypes to rank reliability, enabling hallucination reduction through preferential response selection. Unlike prior methods that rely on discrete pairwise comparisons, our approach provides continuous semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy.

[69] Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Inés Altemir Mariñas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bösch, Maximilian Böther, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, María Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Frederike Lübeck, Roman Machacek, Theofilos Manitaras, Andreas Marfurt, Kyle Matoba, Simon Matrenok, Henrique Mendonça, Fawzi Roberto Mohamed, Syrielle Montariol, Luca Mouchel, Sven Najem-Meyer, Jingwei Ni, Gennaro Oliva, Matteo Pagliardini, Elia Palme, Andrei Panferov, Léo Paoletti, Marco Passerini, Ivan Pavlov, Auguste Poiroux, Kaustubh Ponkshe, Nathan Ranchin, Javi Rando, Mathieu Sauser, Jakhongir Saydaliev, Muhammad Ali Sayfiddinov, Marian Schneider, Stefano Schuppli, Marco Scialanga, Andrei Semenov, Kumar Shridhar, Raghav Singhal, Anna Sotnikova, Alexander Sternfeld, Ayush Kumar Tarun, Paul Teiletche, Jannis Vamvas, Xiaozhe Yao, Hao Zhao, Alexander Ilic, Ana Klimovic, Andreas Krause, Caglar Gulcehre, David Rosenthal, Elliott Ash, Florian Tramèr, Joost VandeVondele, Livio Veraldi, Martin Rajman, Thomas Schulthess, Torsten Hoefler, Antoine Bosselut, Martin Jaggi, Imanol Schlag

Main category: cs.CL

TL;DR: Apertus is a fully open LLM suite trained exclusively on openly available data with robots.txt compliance, using Goldfish objective to suppress memorization while covering 1800+ languages with 40% non-English data, achieving SOTA results among fully open models.

Details

Motivation: Address two systemic shortcomings in today's open model ecosystem: data compliance (respecting content-owner rights and filtering problematic content) and multilingual representation (expanding beyond English-centric models).

Method: Pretrained exclusively on openly available data with robots.txt compliance, filtering for non-permissive/toxic/PII content. Used Goldfish objective to suppress verbatim recall while retaining performance. Trained on 15T tokens from 1800+ languages with ~40% non-English data. Released at 8B and 70B scales with full scientific artifacts.

Result: Approaches state-of-the-art results among fully open models on multilingual benchmarks, rivaling or surpassing open-weight counterparts. Models demonstrate strong performance while respecting data compliance and offering broad multilingual coverage.

Conclusion: Apertus provides a fully transparent, compliant LLM suite that addresses data rights and multilingual gaps in open models, enabling transparent audit and extension through complete artifact release with permissive licensing.

Abstract: We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

[70] ParlAI Vote: A Web Platform for Analyzing Gender and Political Bias in Large Language Models

Wenjie Lin, Hange Liu, Yingying Zhuang, Xutao Mao, Jingwei Shi, Xudong Han, Tianyu Shi, Jinrui Yang

Main category: cs.CL

TL;DR: ParlAI Vote is a web platform for exploring European Parliament debates/votes and testing LLMs on vote prediction and bias analysis, revealing systematic performance biases in state-of-the-art models.

Details

Motivation: To create an interactive system that connects European Parliament debates, speeches, and voting outcomes with demographic data, enabling analysis of LLM performance on political tasks and revealing biases in current models.

Method: Developed a web platform that integrates debate topics, speeches, roll-call votes, and demographic data (gender, age, country, political group). Implements the EuroParlVote benchmark with tasks for gender classification and vote prediction, allowing users to compare real voting outcomes with LLM predictions and visualize error breakdowns by demographic groups.

Result: The platform successfully highlights systematic performance bias in state-of-the-art LLMs, provides model reasoning visualization, and unifies data, models, and visual analytics in a single interface that lowers barriers for reproducing findings and auditing model behavior.

Conclusion: ParlAI Vote serves as a valuable tool for research, education, and public engagement with legislative decision-making, while clearly demonstrating both the capabilities and limitations of current LLMs in political analysis through interactive exploration of biases and model behavior.

Abstract: We present ParlAI Vote, an interactive web platform for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This web system connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. It unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. This web platform also shows model reasoning, helping users see why errors occur and what cues the models rely on. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.

[71] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter

Main category: cs.CL

TL;DR: PRECOG enables forecasting LLM performance from task descriptions alone, without running experiments, to inform study design and resource allocation.

Details

Motivation: Current LLM evaluation requires building benchmarks and running experiments, creating a bottleneck. The paper aims to enable performance forecasting before experiments to support better study design, pilot scoping, and resource allocation decisions.

Method: Created PRECOG corpus with 2,290 redacted description-performance pairs from 1,519 papers. Used text-only performance forecasting where models predict scores from task descriptions and configurations without dataset access. Tested in leakage-free settings using papers published after model knowledge cutoff.

Result: Performance forecasting is challenging but feasible: reasoning models achieve moderate prediction performance (MAE as low as 9.9) with well-calibrated uncertainty. Even in zero-leakage settings (forecasting on newly released datasets), models achieve nontrivial prediction accuracy.

Conclusion: PRECOG corpus enables anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization before running actual experiments, representing an initial step toward open-ended performance forecasting.

Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, run models, then iterate. We ask a question: can we forecast outcomes before running any experiments to inform earlier study design? For example, a team building an AI assistant for a certain task can estimate whether expected performance is around 50 or closer to 80, evidence that supports whether to proceed to a pilot study, how to scope it, and how to allocate resources. We study text-only performance forecasting, where a model predicts a score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a leakage free test split using papers published after the knowledge cutoff of the evaluated models. Experiments show the task is challenging but feasible: reasoning models achieve moderate prediction performance with well calibrated uncertainty, reaching mean absolute error as low as 9.9 at high confidence thresholds. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT5 with built in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

[72] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver

Main category: cs.CL

TL;DR: Researchers created a crisis taxonomy, dataset, and evaluation protocol to assess LLM safety in mental health contexts, finding significant risks in chatbot responses to suicidal and self-harm crises.

Details

Motivation: LLM-powered chatbots are increasingly used in mental health support, but there's a lack of unified crisis taxonomies and clinical evaluation standards to ensure safe detection and response to suicidal ideation and self-harm crises.

Method: Created a six-category crisis taxonomy, curated 2,252 examples from 12 mental health datasets, developed a clinical response assessment protocol, tested three LLMs for automatic crisis classification, and evaluated five models’ response appropriateness on a 5-point Likert scale.

Result: While some models respond reliably to explicit crises, many outputs in self-harm and suicidal categories are inappropriate or unsafe. Models vary in performance, with some having low harm rates while others generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment.

Conclusion: Urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. Alignment and safety practices beyond scale are crucial for reliable crisis support. The taxonomy, datasets, and evaluation methods support ongoing AI mental health research to reduce harm and protect vulnerable users.

Abstract: Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user’s crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

[73] Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini

Main category: cs.CL

TL;DR: Hyperdimensional Probe: A hybrid supervised probe that combines symbolic representations with neural probing using Vector Symbolic Architectures to unify input-oriented feature extraction and output-oriented logit investigation for better LLM interpretability.

Details

Motivation: Current LLM interpretability methods are fragmented - input-oriented methods (supervised probes, SAEs) focus on feature extraction while output-oriented methods (logit approaches) inspect distributions. A full understanding requires integrating both perspectives, which existing approaches struggle with due to constraints on latent feature definitions.

Method: Introduces Hyperdimensional Probe, a hybrid supervised probe that combines symbolic representations with neural probing using Vector Symbolic Architectures (VSAs) and hypervector algebra. It unifies top-down interpretability of supervised probes, SAE’s sparsity-driven proxy space, and output-oriented logit investigation.

Result: The method consistently extracts meaningful concepts across different LLMs, embedding sizes, and experimental setups. It uncovers concept-driven patterns in analogy-oriented inference and QA-focused text generation, supporting joint input-output analysis.

Conclusion: By supporting joint input-output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior interpretability methods.

Abstract: Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods either focus on input-oriented feature extraction, such as supervised probes and Sparse Autoencoders (SAEs), or on output distribution inspection, such as logit-oriented approaches. A full understanding of LLM vector spaces, however, requires integrating both perspectives, something existing approaches struggle with due to constraints on latent feature definitions. We introduce the Hyperdimensional Probe, a hybrid supervised probe that combines symbolic representations with neural probing. Leveraging Vector Symbolic Architectures (VSAs) and hypervector algebra, it unifies prior methods: the top-down interpretability of supervised probes, SAE’s sparsity-driven proxy space, and output-oriented logit investigation. This allows deeper input-focused feature extraction while supporting output-oriented investigation. Our experiments show that our method consistently extracts meaningful concepts across LLMs, embedding sizes, and setups, uncovering concept-driven patterns in analogy-oriented inference and QA-focused text generation. By supporting joint input-output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior methods.

[74] Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Main category: cs.CL

TL;DR: LLMs trained to reason about instruction hierarchies improve conflict resolution, safety, and controllability by 20% on benchmarks.

Details

Motivation: As LLMs take on high-stakes decision-making roles, they need to reconcile competing instructions from multiple sources (developers, users, tools) within prompts. Enforcing instruction hierarchies where higher-level directives override lower-priority requests is critical for LLM reliability and controllability.

Method: Reframe instruction hierarchy resolution as a reasoning task where models must “think” about relationships between user prompts and system instructions. Construct VerIH dataset (~7K aligned/conflicting system-user instructions with verifiable answers). Use lightweight reinforcement learning with VerIH to transfer general reasoning capabilities to instruction prioritization.

Result: Finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, with ~20% improvement on IHEval conflict setup. Reasoning ability generalizes to safety-critical settings, enhancing robustness against jailbreak and prompt injection attacks with up to 20% reduction in attack success rate.

Conclusion: Reasoning over instruction hierarchies provides a practical path to reliable LLMs where updates to system prompts yield controllable and robust changes in model behavior, improving both performance and safety.

Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

[75] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Fangqi Dai, Xingjian Jiang, Zizhuang Deng

Main category: cs.CL

TL;DR: HLPD is a new method for detecting machine-revised text by optimizing scoring models to prefer human-like writing patterns, achieving significant improvements over existing methods in identifying texts revised by advanced LLMs.

Details

Motivation: Existing methods for detecting LLM-generated text struggle with advanced LLM outputs and adversarially revised text, especially in black-box settings where the generating model is unknown. This creates risks of misinformation from trustworthy-looking LLM content.

Method: Proposes Human Language Preference Detection (HLPD) using Human Language Preference Optimization (HLPO) - a reward-based alignment process that shifts scoring models’ token distributions toward human-like writing patterns, making them more sensitive to human writing for better detection of machine-revised text.

Result: HLPD achieves 15.11% relative improvement in AUROC over ImBD for GPT-revised text, and 45.56% improvement over Fast-DetectGPT. For advanced LLM-generated text, HLPD achieves highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%.

Conclusion: HLPD effectively addresses limitations of existing methods by leveraging human writing style patterns, demonstrating superior performance in detecting machine-revised text across diverse revision scenarios and advanced LLMs.

Abstract: To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model’s token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

[76] AI Text Detectors and the Misclassification of Slightly Polished Arabic Text

Saleh Almohaimeed, Saad Almohaimeed, Mousa Jari, Khaled A. Alobaid, Fahad Alotaibi

Main category: cs.CL

TL;DR: Arabic AI detection models fail when human-authored articles are slightly polished by AI, with accuracy dropping dramatically from 92% to as low as 12%.

Details

Motivation: Current AI detectors misclassify slightly polished human-authored Arabic articles as AI-generated, risking false accusations of AI plagiarism and damaging detector credibility, with no existing solutions for Arabic language.

Method: Created two datasets: 1) 800 Arabic articles (400 AI-generated, 400 human-authored) to evaluate 14 LLMs and commercial detectors; 2) Ar-APT dataset with 400 human-authored articles polished by 10 LLMs using 4 settings (16,400 samples) to test 8 best models on polished text detection.

Result: All AI detectors incorrectly attribute polished human articles to AI. Claude-4 Sonnet dropped from 83.51% to 57.63% accuracy for LLaMA-3 polished text. Originality.AI dropped from 92% to 12% accuracy for Mistral or Gemma-3 polished articles.

Conclusion: Slight AI polishing of human-authored Arabic text severely degrades AI detector performance, highlighting vulnerability of current detection systems and need for more robust Arabic AI detection solutions.

Abstract: Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it as AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detectors. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human-authored text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51%, its performance decreased to 57.63% for articles slightly polished by LLaMA-3. Whereas the best performing commercial model, originality.AI, achieves 92% accuracy, dropped to 12% for articles slightly polished by Mistral or Gemma-3.

[77] A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis

Heger Arfaoui, Mohammed Iheb Hergli, Beya Benzina, Slimane BenMiled

Main category: cs.CL

TL;DR: Systematic BERTopic framework for focus group transcripts achieves 18% higher coherence than LDA with validated interpretable topics, revealing stability metrics capture distinct constructs requiring multi-criteria evaluation.

Details

Motivation: Traditional manual coding of focus group discussions is labor-intensive, limiting scalability and reproducibility. Need automated, systematic approaches for qualitative data analysis.

Method: Applied BERTopic to 10 focus group transcripts (1,075 utterances) about HPV vaccine perceptions in Tunisia. Conducted comprehensive hyperparameter exploration across 27 configurations with bootstrap stability analysis, performance metrics, and comparison with LDA baseline.

Result: Bootstrap analysis showed stability metrics (NMI and ARI) strongly disagree (r = -0.691) and have divergent relationships with coherence. Selected 7-topic model achieved 18% higher coherence than optimized LDA (0.573 vs. 0.486) with validated interpretability (ICC = 0.700, weighted Cohen’s kappa = 0.678).

Conclusion: Transformer-based topic modeling can extract interpretable themes from small focus group corpora when systematically configured, but quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. Provides reproducible framework for qualitative analysis.

Abstract: Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a systematic framework for applying BERTopic to focus group transcripts using data from ten focus groups exploring HPV vaccine perceptions in Tunisia (1,075 utterances). We conducted comprehensive hyperparameter exploration across 27 configurations, evaluating each through bootstrap stability analysis, performance metrics, and comparison with LDA baseline. Bootstrap analysis revealed that stability metrics (NMI and ARI) exhibited strong disagreement (r = -0.691) and showed divergent relationships with coherence, demonstrating that stability is multifaceted rather than monolithic. Our multi-criteria selection framework yielded a 7-topic model achieving 18% higher coherence than optimized LDA (0.573 vs. 0.486) with interpretable topics validated through independent human evaluation (ICC = 0.700, weighted Cohen’s kappa = 0.678). These findings demonstrate that transformer-based topic modeling can extract interpretable themes from small focus group transcript corpora when systematically configured and validated, while revealing that quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. We provide complete documentation and code to support reproducibility.

[78] Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, Usman Naseem

Main category: cs.CL

TL;DR: KardiaBench is a large-scale benchmark for identity-aware emotional reasoning in conversational agents, with Kardia-R1 framework using rubric-based reinforcement learning for interpretable empathetic cognition.

Details

Motivation: Current conversational agents lack identity-aware emotional reasoning due to: (1) situation-centric datasets without persistent user identity, and (2) opaque reward signals that hinder verifiable empathetic reasoning development.

Method: Created KardiaBench dataset (178,080 QA pairs across 22,080 conversations anchored to 671 real profiles) using model-in-the-loop pipeline with iterative rubric refinement. Proposed Kardia-R1 framework with Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method using explainable rubric rewards.

Result: Kardia-R1 consistently outperforms other methods across four LLM backbones in emotion accuracy, empathy, relevance, persona consistency, and safety metrics.

Conclusion: The work addresses limitations in empathetic AI by providing a user-grounded benchmark and framework for interpretable, identity-aware emotional reasoning, with both dataset and model publicly released.

Abstract: As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.

cs.CV

[79] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

Mirela G. Tulbure, Julio Caineta, Mark Broich, Mollie D. Gaines, Philippe Rufin, Leon-Friedrich Thomas, Hamed Alemohammad, Jan Hemmerling, Patrick Hostert

Main category: cs.CV

TL;DR: Fine-tuning TerraMind GFM on multimodal FloodsNet dataset improves flood mapping accuracy, with base-unfrozen model offering best balance of performance and computational efficiency.

Details

Motivation: Floods are devastating hazards, especially in record-warm 2024. While EO satellites provide critical coverage, operational accuracy depends on labeled data and model generalization. GFMs like TerraMind offer improved generalizability but their performance on diverse global flood events remains poorly understood.

Method: Fine-tuned TerraMind for flood extent mapping using FloodsNet dataset (85 global flood events with co-located Sentinel-1 SAR and Sentinel-2 optical imagery). Tested four configurations: base vs. large models with frozen vs. unfrozen backbones. Compared against TerraMind Sen1Floods11 example and U-Net trained on both datasets.

Result: Base-unfrozen configuration provided best balance of accuracy, precision, and recall at lower computational cost. Large unfrozen model achieved highest recall. FloodsNet-trained models outperformed Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations but with slightly lower accuracy and precision.

Conclusion: Integrating multimodal optical and SAR data with GFM fine-tuning enhances near-real-time flood mapping. This study provides one of the first global-scale GFM evaluations for flood segmentation, highlighting both potential and limitations for climate adaptation and disaster resilience.

Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM’s TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.

[80] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework

Haojin Deng, Yimin Yang

Main category: cs.CV

TL;DR: Proposes a context-enriched contrastive loss function with two convergence targets to address information distortion in contrastive learning, improving both generalization and convergence speed across 8 benchmark datasets.

Details

Motivation: Contrastive learning can introduce information distortion where models over-rely on samples with identical labels while neglecting positive pairs from the same original image, especially in large datasets.

Method: A context-enriched contrastive loss function with two components: (1) label-sensitive component that differentiates features of identical vs. distinct classes, and (2) component that draws closer augmented samples from the same source image while distancing all other samples.

Result: Outperforms 16 state-of-the-art contrastive learning methods on 8 benchmark datasets (CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, CelebA) in both generalization and convergence speed. Shows 22.9% improvement on BiasedMNIST dataset for addressing systematic distortion.

Conclusion: The proposed method effectively addresses information distortion in contrastive learning, leading to more efficient and equitable downstream training with superior performance across diverse datasets.

Abstract: Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

[81] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee

Main category: cs.CV

TL;DR: SelVA introduces text-conditioned selective video-to-audio generation that produces only user-intended sounds from multi-object videos, addressing the limitation of current approaches that generate mixed sounds.

Details

Motivation: Current video-to-audio generation approaches produce single source-mixed sounds at once, which is insufficient for multimedia production where audio tracks need individual handling for precise editing, mixing, and creative control. Visual features are often entangled, and region cues/prompts fail to specify target sources.

Method: SelVA treats text prompts as explicit selectors of target sources and modulates the video encoder to extract prompt-relevant video features. It uses supplementary tokens to suppress text-irrelevant activations via cross-attention with efficient parameter tuning. The model employs self-augmentation to overcome the lack of mono audio track supervision.

Result: SelVA is evaluated on VGG-MONOAUDIO benchmark and shows effectiveness across audio quality, semantic alignment, and temporal synchronization. Extensive experiments and ablations consistently verify its performance.

Conclusion: SelVA successfully addresses the selective V2A generation task by using text prompts as explicit selectors and modulating video features, providing robust semantic and temporal grounding for multimedia production applications.

Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

[82] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein

Main category: cs.CV

TL;DR: Proposes a joint evaluation framework for text-to-image models and vision-language models using 27 specific failure modes, with a dataset of prompts, generated images, and VLM annotations to systematically analyze attribute fidelity errors.

Details

Motivation: Current T2I models often fail to accurately capture specific attributes in prompts (like object counts and colors), while VLM benchmarks haven't kept pace with scene complexity needed to evaluate these failures. There's a need for hierarchical evaluation to compare prompt adherence across models.

Method: Proposes structured methodology to jointly evaluate T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in images generated from challenging prompts. Creates dataset with prompts, images from 5 T2I models (Flux, SD3 variants), and corresponding annotations from 3 VLMs (Molmo, InternVL3, Pixtral) processed by LLM (Llama3).

Result: Analysis reveals systematic errors in attribute fidelity and object representation across models. Current metrics are insufficient to capture nuanced errors, highlighting the need for targeted benchmarks.

Conclusion: Targeted benchmarks are crucial for advancing generative model reliability and interpretability, as current evaluation methods fail to capture the specific failure modes in prompt adherence.

Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

[83] Mapping of Lesion Images to Somatic Mutations

Rahul Mehta

Main category: cs.CV

TL;DR: LLOST: A deep latent variable model that predicts patients’ somatic mutation profiles from medical images using dual variational autoencoders with a shared latent space and conditional normalizing flow priors.

Details

Motivation: Early cancer diagnosis and treatment are crucial for efficacy. Medical imaging is used initially for diagnosis, while genetic information guides treatment selection. The paper aims to bridge these domains by predicting somatic mutations from medical images to enable earlier genetic insights.

Method: 1) Convert lesion images to point cloud representations for modality invariance. 2) Propose LLOST model with dual variational autoencoders for lesion point clouds and somatic mutation counts. 3) Use a shared latent space to unify features from both domains. 4) Employ conditional normalizing flow priors for each of the three latent spaces to handle diverse distributions.

Result: Model shows predictive performance for specific mutation counts and accurate mutation occurrence prediction. Identifies shared patterns between imaging and somatic mutation domains that reflect cancer type. Experiments conducted on TCIA medical images and Pan Cancer dataset somatic mutations.

Conclusion: The model successfully links imaging and genetic domains. Future improvements could include expanding to other genetic domains and enhancing model architecture.

Abstract: Medical imaging is a critical initial tool used by clinicians to determine a patient’s cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent stages of patient diagnosis, genetic information is extracted to help select specific patient treatment options. As the efficacy of cancer treatment often relies on early diagnosis and treatment, we build a deep latent variable model to determine patients’ somatic mutation profiles based on their corresponding medical images. We first introduce a point cloud representation of lesions images to allow for invariance to the imaging modality. We then propose, LLOST, a model with dual variational autoencoders coupled together by a separate shared latent space that unifies features from the lesion point clouds and counts of distinct somatic mutations. Therefore our model consists of three latent space, each of which is learned with a conditional normalizing flow prior to account for the diverse distributions of each domain. We conduct qualitative and quantitative experiments on de-identified medical images from The Cancer Imaging Archive and the corresponding somatic mutations from the Pan Cancer dataset of The Cancer Genomic Archive. We show the model’s predictive performance on the counts of specific mutations as well as it’s ability to accurately predict the occurrence of mutations. In particular, shared patterns between the imaging and somatic mutation domain that reflect cancer type. We conclude with a remark on how to improve the model and possible future avenues of research to include other genetic domains.

[84] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

Main category: cs.CV

TL;DR: HUD is a novel framework for Composed Video Retrieval that addresses modality disparity between video and text through hierarchical uncertainty-aware disambiguation, achieving state-of-the-art performance on both CVR and CIR tasks.

Details

Motivation: Previous CVR approaches overlook the information density disparity between video (rich semantics) and text (sparse semantics), leading to modification subject referring ambiguity and limited detailed semantic focus, which degrade retrieval performance.

Method: Proposes Hierarchical Uncertainty-aware Disambiguation network (HUD) with three components: Holistic Pronoun Disambiguation for cross-modal interaction, Atomistic Uncertainty Modeling for fine-grained semantics, and Holistic-to-Atomistic Alignment for precise feature learning.

Result: Achieves state-of-the-art performance across three benchmark datasets for both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks.

Conclusion: HUD effectively addresses modality disparity issues in multi-modal queries by leveraging hierarchical uncertainty-aware disambiguation, enabling precise composed feature learning and superior retrieval performance across video and image domains.

Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

[85] Spatiotemporal Pyramid Flow Matching for Climate Emulation

Jeremy Andrew Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao, Nomin-Erdene Bayarsaikhan, Daniele Visioni, Andrew Y. Ng, Duncan Watson-Parris

Main category: cs.CV

TL;DR: SPF is a new flow matching approach for climate emulation that uses a spatiotemporal pyramid to enable efficient, parallel sampling at multiple timescales, outperforming baselines on ClimateBench and scaling with the new ClimateSuite dataset.

Details

Motivation: Current generative approaches for climate emulation rely on weather-scale autoregression, which is slow for long climate horizons and hasn't demonstrated stable rollouts under nonstationary forcings. There's a need for efficient, accurate climate emulation across multiple timescales.

Method: Spatiotemporal Pyramid Flows (SPF) - a flow matching approach that models data hierarchically across spatial and temporal scales. It partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution while coupling each stage with specific timescales. Each stage is conditioned on prescribed physical forcings (greenhouse gases, aerosols).

Result: SPF outperforms strong flow matching baselines and pre-trained models on ClimateBench at yearly and monthly timescales, offering fast sampling especially at coarser temporal levels. The scaled SPF model shows good generalization to held-out scenarios across climate models using the new ClimateSuite dataset.

Conclusion: SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios, addressing limitations of previous autoregressive approaches.

Abstract: Generative models have the potential to transform the way we emulate Earth’s changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .

[86] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker, Amitabh Varshney

Main category: cs.CV

TL;DR: SplatSuRe selectively applies super-resolution only to undersampled regions lacking high-frequency information, using camera pose relative to scene geometry to guide where to add SR content, resulting in sharper and more consistent 3D Gaussian Splatting renders.

Details

Motivation: Existing super-resolution methods for 3D Gaussian Splatting uniformly apply SR across all images, causing multi-view inconsistencies and blurry renders. The authors recognize that close-up low-resolution views may contain high-frequency information for regions also captured in distant views, and that camera pose relative to scene geometry can inform where SR content is needed.

Method: SplatSuRe selectively applies super-resolution content only in undersampled regions that lack high-frequency supervision. The method uses camera pose relative to scene geometry to identify where to add SR content, avoiding uniform SR application across all images.

Result: The approach surpasses baselines in both fidelity and perceptual quality across Tanks & Temples, Deep Blending and Mip-NeRF 360 datasets. Gains are most significant in localized foreground regions where higher detail is desired.

Conclusion: Selective super-resolution application based on camera pose and scene geometry analysis yields sharper and more consistent results than uniform SR application, particularly benefiting foreground regions requiring higher detail.

Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

[87] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang, Kaihao Zhang

Main category: cs.CV

TL;DR: MRD is a training-free framework for high-resolution image understanding that addresses fragmentation issues in crop-based methods by using multi-resolution semantic fusion and open-vocabulary object detection.

Details

Motivation: Current crop-based methods for high-resolution image understanding in MLLMs fragment objects across multiple crops, disrupting semantic similarity computation and making it hard to localize complete objects.

Method: Proposes Multi-resolution Retrieval-Detection (MRD) with two key components: 1) multi-resolution semantic fusion that integrates semantic similarity maps from different resolutions to preserve object integrity, and 2) open-vocabulary object detection using sliding-window approach for direct object localization.

Result: Experiments on high-resolution image understanding benchmarks with different MLLMs demonstrate the effectiveness of the proposed approach.

Conclusion: MRD provides an effective training-free solution for high-resolution image understanding by addressing object fragmentation issues through multi-resolution processing and direct object detection.

Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

[88] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation

Mansoor Ali, Maksim Richards, Gilberto Ochoa-Ruiz, Sharib Ali

Main category: cs.CV

TL;DR: RobustSurg improves surgical scene segmentation generalization across centers and modalities by exploiting style/content information with instance normalization, feature covariance mapping, and a restitution module, achieving 10-32% improvements over SOTA methods.

Details

Motivation: Current surgical scene segmentation methods fail to generalize to unseen distributions (different centers) and modalities. Natural scene domain adaptation methods don't work well for surgical scenes due to limited visual cues and extreme diversity. Need to address domain gaps in surgical data.

Method: Proposes RobustSurg with: 1) Instance normalization and feature covariance mapping to minimize appearance variations and create robust features; 2) A restitution module within ResNet backbone to retain task-relevant features; 3) New multiclass, multicentre dataset for surgical scene segmentation.

Result: 23% improvement over baseline DeepLabv3+ and 10-32% improvement over SOTA on unseen HeiCholSeg dataset when trained on CholecSeg8K. 22% improvement over baseline and 11% improvement over recent SOTA on EndoUDA polyp dataset target set.

Conclusion: RobustSurg effectively addresses domain generalization in surgical scene segmentation by exploiting style/content information while preserving task-relevant features, demonstrating significant improvements across different surgical datasets and settings.

Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.

[89] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation

Miguel L. Martins, Miguel T. Coimbra, Francesco Renna

Main category: cs.CV

TL;DR: The paper introduces Monofractal and Multifractal Recalibration methods that leverage multifractal analysis for channel attention in convolutional networks, showing substantial gains in medical image segmentation tasks.

Details

Motivation: Multifractal analysis has proven effective for self-seeding phenomena but remains limited in deep learning. Existing methods rely on heavy pooling or feature-space decimation that constrain tasks like semantic segmentation. The authors aim to overcome these limitations by integrating multifractal analysis into deep learning architectures.

Method: Introduces two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between probability mass of exponents and multifractal spectrum to form statistical descriptions of encoder embeddings. Implemented as channel-attention functions in convolutional networks using a U-Net-based framework.

Result: Multifractal recalibration yields substantial gains over baselines with other channel-attention mechanisms using higher-order statistics. Validated on three medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Analysis reveals that excitation responses don’t become increasingly specialized with encoder depth in U-Net due to skip connections, and effectiveness relates to global statistics of instance variability.

Conclusion: Multifractal recalibration provides an effective channel-attention mechanism that leverages multifractal analysis for improved medical image segmentation. The approach captures pathological regularities in medical images and offers insights into attention layer behavior in U-Net architectures.

Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

[90] Towards Unified Video Quality Assessment

Chen Feng, Tianhao Peng, Fan Zhang, David Bull

Main category: cs.CV

TL;DR: Unified-VQA is a diagnostic mixture-of-experts framework for video quality assessment that provides both global quality scores and interpretable artifact vectors across multiple video formats without retraining.

Details

Motivation: Current VQA methods are monolithic, provide only single quality scores without diagnostic feedback, and are format-specific rather than truly generic solutions that work across different perceptual domains.

Method: Proposes Unified-VQA framework with multiple perceptual experts for distinct domains, using multi-proxy expert training with ranking-inspired loss and diagnostic multi-task head for global scores and artifact vectors via weakly-supervised learning.

Result: Outperforms 18 benchmark methods for both generic VQA and diagnostic artifact detection across 17 databases with diverse streaming artifacts in HD, UHD, HDR, and HFR formats, all without retraining or fine-tuning.

Conclusion: Represents an important step toward practical, actionable, and interpretable video quality assessment by providing unified diagnostic capabilities across multiple video formats.

Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple perceptual experts’’ dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

[91] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

Main category: cs.CV

TL;DR: AV-SpeakerBench is a new benchmark for evaluating multimodal LLMs on fine-grained speaker-centric audiovisual reasoning in videos, focusing on aligning who speaks, what is said, and when it occurs.

Details

Motivation: Existing video benchmarks rarely assess fine-grained reasoning about human speech, often being visually solvable or only coarsely evaluating speech, limiting insight into whether models can properly align speaker identity, speech content, and timing.

Method: Created a curated benchmark of 3,212 multiple-choice questions with: (1) speaker-centered formulation treating speakers as core reasoning units, (2) fusion-grounded question design embedding audiovisual dependencies into question semantics, and (3) expert-curated annotations ensuring temporal precision and cross-modal validity.

Result: Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception.

Conclusion: AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems, highlighting current limitations in audiovisual fusion capabilities.

Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

[92] Exploring the Potentials of Spiking Neural Networks for Image Deraining

Shuang Chen, Tomas Krajnik, Farshad Arvin, Amir Atapour-Abarghouei

Main category: cs.CV

TL;DR: Proposes VLIF neuron and spiking modules for image deraining, achieving SOTA performance with 13% energy consumption of other SNN methods.

Details

Motivation: SNNs have not been sufficiently explored in low-level vision tasks like image deraining. Traditional spiking neurons lack spatial contextual understanding and suffer from frequency-domain saturation issues.

Method: Introduces Visual LIF (VLIF) neuron to overcome spatial context limitations. Proposes Spiking Decomposition and Enhancement Module and lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning.

Result: Extensive experiments across five benchmark deraining datasets show significantly outperforming state-of-the-art SNN-based deraining methods with only 13% of their energy consumption.

Conclusion: Establishes foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks, demonstrating practical viability of biologically plausible frameworks.

Abstract: Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.

[93] Progressive Image Restoration via Text-Conditioned Video Generation

Peng Kang, Xijun Wang, Yu Yuan

Main category: cs.CV

TL;DR: Fine-tuning CogVideo for progressive visual restoration tasks by generating restoration trajectories instead of natural motion, achieving improved image quality metrics and temporal coherence.

Details

Motivation: Text-to-video models have strong temporal generation capabilities but their potential for image restoration remains underexplored. The authors aim to repurpose these models for progressive visual restoration tasks.

Method: Fine-tune CogVideo to generate restoration trajectories rather than natural video motion. Construct synthetic datasets for super-resolution, deblurring, and low-light enhancement with gradual transitions from degraded to clean frames. Compare two prompting strategies: uniform text prompt and scene-specific prompting via LLaVA multi-modal LLM refined with ChatGPT.

Result: The fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics (PSNR, SSIM, LPIPS) across frames. It effectively restores spatial detail and illumination consistency while maintaining temporal coherence. The model generalizes to real-world scenarios on ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability.

Conclusion: CogVideo can be successfully repurposed for progressive visual restoration tasks, learning to generate restoration trajectories that improve image quality metrics while maintaining temporal coherence, with strong generalization to real-world scenarios without additional training.

Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

[94] Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation

Andre Juarez, Luis Salsavilca, Frida Coaquira, Celso Gonzales

Main category: cs.CV

TL;DR: MORP-Synth improves SAR oil spill segmentation transfer from Mediterranean to Peruvian data using synthetic augmentation with morphological perturbation and conditional generative modeling.

Details

Motivation: Deep learning models for SAR oil spill segmentation fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, especially problematic for Peruvian coast where labeled Sentinel-1 data is scarce.

Method: Two-stage synthetic augmentation framework: Stage A applies Morphological Region Perturbation (curvature-guided label space method for realistic geometric variations), Stage B renders SAR-like textures using conditional generative INADE model from edited masks.

Result: Models pretrained on Mediterranean data degrade from 67.8% to 51.8% mIoU on Peruvian domain; MORP-Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).

Conclusion: MORP-Synth effectively addresses domain shift in SAR oil spill segmentation, improving transfer learning performance from Mediterranean to Peruvian conditions through synthetic data augmentation.

Abstract: Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that is particularly severe along the Peruvian coast where labeled Sentinel-1 data remain scarce. To address this problem, we propose \textbf{MORP–Synth}, a two-stage synthetic augmentation framework designed to improve transfer from Mediterranean to Peruvian conditions. Stage~~A applies Morphological Region Perturbation, a curvature guided label space method that generates realistic geometric variations of oil and look-alike regions. Stage~~B renders SAR-like textures from the edited masks using a conditional generative INADE model. We compile a Peruvian dataset of 2112 labeled 512$\times$512 patches from 40 Sentinel-1 scenes (2014–2024), harmonized with the Mediterranean CleanSeaNet benchmark, and evaluate seven segmentation architectures. Models pretrained on Mediterranean data degrade from 67.8% to 51.8% mIoU on the Peruvian domain; MORP–Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).

[95] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao

Main category: cs.CV

TL;DR: Self-supervised tracker using pre-trained video diffusion models’ motion representations achieves significant improvements in distinguishing visually similar objects without task-specific training.

Details

Motivation: Current self-supervised trackers struggle when visual cues become ambiguous, limiting scalability and generalization without extensive labeled data. There's a need for better methods to distinguish visually similar objects by their motion.

Method: Leverages pre-trained video diffusion models that inherently learn motion representations suitable for tracking. The denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Uses these diffusion-derived motion representations for robust tracking.

Result: Achieves up to 6-point improvement over recent self-supervised approaches on established benchmarks. Performs well on newly introduced tests focused on tracking visually similar items. Enables robust tracking of even identical objects across challenging viewpoint changes and deformations.

Conclusion: Pre-trained video diffusion models provide effective motion representations for tracking without task-specific training, significantly improving performance in distinguishing visually similar objects where current self-supervised methods fail.

Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

[96] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo

Main category: cs.CV

TL;DR: TALO: A plug-and-play framework for temporal alignment of 3D vision foundation model predictions using Thin Plate Spline and submap registration for consistent geometry across time.

Details

Motivation: 3D vision foundation models lack temporal consistency in online settings like driving scenarios, where predictions across time windows need alignment. Existing methods using global transformation have limitations in assumption validity, local alignment scope, and robustness to noisy geometry.

Method: Proposes a higher-DOF alignment framework based on Thin Plate Spline with globally propagated control points to correct spatially varying inconsistencies. Uses point-agnostic submap registration design that is robust to noisy geometry predictions. Fully plug-and-play and compatible with diverse 3D foundation models and camera configurations.

Result: Extensive experiments show the method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups (monocular and surround-view).

Conclusion: The proposed TALO framework demonstrates robust temporal alignment for 3D vision foundation models, addressing limitations of existing methods and showing strong generalization across various configurations.

Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \href{https://github.com/Xian-Bei/TALO}{https://github.com/Xian-Bei/TALO}.

[97] A multi-weight self-matching visual explanation for cnns on sar images

Siyuan Sun, Yongping Zhang, Hongcheng Zeng, Yamin Wang, Wei Yang, Wanting Yang, Jie Chen

Main category: cs.CV

TL;DR: MS-CAM is a visual explanation method for CNNs in SAR applications that combines channel-wise and element-wise weights to highlight regions of interest and improve interpretability.

Details

Motivation: CNNs have achieved success in SAR tasks but their complexity and opacity hinder high-reliability requirements, limiting SAR applications. Improving CNN interpretability is crucial for SAR development and deployment.

Method: Multi-weight self-matching class activation mapping (MS-CAM) matches SAR images with CNN feature maps and gradients, combining both channel-wise and element-wise weights to visualize the model’s decision basis.

Result: Experiments on a self-constructed SAR target classification dataset show MS-CAM more accurately highlights network regions of interest and captures detailed target features, enhancing interpretability. MS-CAM also proves feasible for weakly-supervised object localization.

Conclusion: MS-CAM improves CNN interpretability for SAR applications, with potential for weakly-supervised localization. Key factors like pixel thresholds affecting localization accuracy are analyzed to guide future work.

Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network’s regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.

[98] Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He, Chaorui Deng, Ang Li, Shen Yan

Main category: cs.CV

TL;DR: This paper analyzes inefficiencies in unified multimodal models, finds generation components are sensitive to compression, and proposes MoE adaptation with sparse activation to reduce parameters while maintaining performance.

Details

Motivation: Unified multimodal models that integrate both understanding and generation capabilities suffer from inference inefficiencies, as specific tasks or samples may not require the full model capacity. However, there's limited systematic understanding of how these inefficiencies manifest across different components.

Method: 1) Systematic analysis using training-free pruning as probing methodology (depth pruning and width reduction) to study component compressibility. 2) Proposed Mixture-of-Experts (MoE) Adaptation that partitions generation module into multiple experts with sparse activation to restore generation quality. 3) Validated through expert-frozen tuning and fully trainable adaptation.

Result: Understanding components show notable compressibility, especially in generation tasks. Generation components are highly sensitive to compression with sharp performance deterioration. MoE adaptation enables BAGEL model to achieve comparable performance to full model while activating only about half of its parameters.

Conclusion: The study reveals systematic inefficiencies in unified multimodal models and proposes an effective MoE adaptation approach that enables sparse activation to maintain performance while significantly reducing computational costs, addressing the inference efficiency problem in unified multimodal architectures.

Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

[99] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting

Bin Li, Daijie Chen, Qi Zhang

Main category: cs.CV

TL;DR: A weakly-supervised calibration-free multi-view crowd counting method that uses only crowd count supervision instead of expensive density maps or camera calibrations, achieving state-of-the-art performance.

Details

Motivation: Existing multi-view crowd counting methods require expensive camera calibrations and crowd annotations (density maps or image-level counts). There's a need for more practical, annotation-efficient methods suitable for real-world deployment.

Method: WSCF-MVCC uses only crowd count supervision for single-view counting, eliminating need for density maps. It employs a self-supervised ranking loss with multi-scale priors to enhance perception without extra annotations, and leverages semantic information for more accurate view matching and scene-level estimation.

Result: Outperforms state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, demonstrating superior performance compared to calibrated methods.

Conclusion: The proposed weakly-supervised calibration-free approach is more suitable for practical deployment than existing methods, as it reduces annotation costs while maintaining or improving counting accuracy.

Abstract: Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model’s perceptual ability without additional annotation costs. What’s more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in https://github.com/zqyq/Weakly-MVCC.

[100] VACoT: Rethinking Visual Data Augmentation with VLMs

Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, Chun Yuan

Main category: cs.CV

TL;DR: VACoT is a framework that dynamically applies image augmentations during VLM inference to improve robustness on challenging inputs, especially OCR-related adversarial scenarios, without requiring expensive retraining.

Details

Motivation: VLMs struggle with basic perception tasks that conventional models handle well, and retraining on augmented data is costly with diminishing returns. There's a need for inference-time robustness improvements without expensive training.

Method: VACoT dynamically invokes image augmentations during model inference, integrating a structured collection of general visual augmentations (not just local cropping). Uses efficient agentic reinforcement learning with conditional rewards to encourage necessary augmentations while penalizing verbose responses.

Result: Substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Demonstrated superiority on 13 perception benchmarks and introduced AdvOCR to highlight generalization benefits.

Conclusion: VACoT provides an effective framework for improving VLM robustness through inference-time visual augmentations, addressing limitations of current approaches while reducing training complexity and computational overhead.

Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.

[101] Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection

Daanish Hindustani, Sanober Hindustani, Preston Nguyen

Main category: cs.CV

TL;DR: SqueezeNet outperforms ResNet-50 for TB detection from chest X-rays, achieving 89% accuracy vs 73%, suggesting lightweight models may be better suited for mobile deployment in resource-limited settings.

Details

Motivation: Tuberculosis diagnosis remains challenging, especially in resource-limited areas where traditional methods like sputum smear microscopy are inefficient. There's a need for automated, accessible diagnostic tools using advanced technologies like deep learning.

Method: Used Kaggle dataset of 4,200 chest X-rays, compared pretrained ResNet-50 with SqueezeNet models. Applied data splitting, augmentation, and resizing for preprocessing. Evaluated using accuracy, precision, recall, F1 score, and confusion matrix.

Result: SqueezeNet significantly outperformed ResNet-50: 89% accuracy, 98% precision, 80% recall, 87% F1 score, 32% loss vs ResNet-50’s 73% accuracy, 88% precision, 52% recall, 65% F1 score, 54% loss.

Conclusion: Machine learning shows strong potential for TB detection, with SqueezeNet’s lighter architecture performing better. Models could be deployed on mobile devices for resource-limited areas, but further development of faster, smaller, more accurate models is needed for global TB control efforts.

Abstract: This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.

[102] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

Wenyi Xiong, Jian Chen

Main category: cs.CV

TL;DR: A map-free trajectory prediction algorithm using MoE for frequency component selection, selective attention for redundancy filtering, and multimodal decoding achieves state-of-the-art performance on complex interactive driving scenarios.

Details

Motivation: Existing trajectory prediction methods struggle with extracting valuable information from redundant data in complex interactive scenarios, reducing both computational efficiency and prediction accuracy, especially for intricate agent interactions.

Method: Proposes a map-free algorithm operating across temporal, spatial, and frequency domains: 1) Uses Mixture of Experts (MoE) to adaptively select critical frequency components and integrate multi-scale temporal features, 2) Implements selective attention module to filter redundant information in both temporal sequences and spatial interactions, 3) Designs multimodal decoder supervised by patch-level and point-level losses.

Result: Experiments on Nuscences datasets demonstrate the algorithm’s superiority, validating its effectiveness in handling complex interactive scenarios with improved computational efficiency and prediction accuracy.

Conclusion: The proposed map-free trajectory prediction algorithm successfully addresses challenges in complex interactive scenarios by efficiently extracting valuable information from redundant data through frequency domain processing, selective attention, and multimodal decoding.

Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.

[103] SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

Qingmei Li, Yang Zhang, Peifeng Zhang, Haohuan Fu, Juepeng Zheng

Main category: cs.CV

TL;DR: SAGE is a style-adaptive generalization framework that synthesizes visual prompts to improve frozen model performance on unseen domains without modifying model weights, addressing privacy constraints in semantic segmentation.

Details

Motivation: Traditional domain adaptation methods require access to model parameters for fine-tuning, which is often impossible due to privacy concerns and security constraints. There's a need for input-level strategies that can enhance generalization without modifying model weights.

Method: SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles. It first uses style transfer to construct diverse style representations of the source domain, then adaptively fuses these style cues based on each input’s visual context to create dynamic prompts that harmonize image appearance without touching the model interior.

Result: Extensive experiments on five benchmark datasets show SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints, and outperforms full fine-tuning baselines in all settings.

Conclusion: SAGE effectively bridges the gap between frozen model invariance and unseen domain diversity through a closed-loop design that synthesizes adaptive visual prompts, providing a practical solution for domain generalization under privacy constraints.

Abstract: Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbf{S}tyle-\textbf{A}daptive \textbf{GE}neralization framework (\textbf{SAGE}), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.

[104] On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning

Liyuan Lou, Wanyun Li, Wentian Gan, Yifei Yu, Tengfei Wang, Xin Wang, Zongqian Zhan

Main category: cs.CV

TL;DR: On-the-fly Feedback SfM enables real-time UAV photogrammetry with quality assessment and guided feedback for iterative exploration-exploitation of target areas.

Details

Motivation: Existing real-time UAV photogrammetry methods lack quality evaluation of 3D reconstruction and guided feedback for image acquisition, limiting effectiveness for time-critical applications like disaster response and digital-twin maintenance.

Method: An explore-and-exploit framework with three modules: (1) online incremental coarse-mesh generation from sparse 3D point cloud, (2) online mesh quality assessment with actionable indicators, and (3) predictive path planning for on-the-fly trajectory refinement.

Result: Achieves in-situ reconstruction and evaluation in near real time, significantly reduces coverage gaps and re-flight costs through actionable feedback, enabling transition from passive to intelligent adaptive workflows.

Conclusion: The integrated approach of data collection, processing, reconstruction, assessment, and online feedback provides an alternative for intelligent UAV photogrammetry workflows, with code publicly available.

Abstract: Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin maintenance. However, most existing methods focus on processing captured images or sequential frames in real time, without explicitly evaluating the quality of the on-the-go 3D reconstruction or providing guided feedback to enhance image acquisition in the target area. This work presents On-the-fly Feedback SfM, an explore-and-exploit framework for real-time UAV photogrammetry, enabling iterative exploration of unseen regions and exploitation of already observed and reconstructed areas in near real time. Built upon SfM on-the-fly , the proposed method integrates three modules: (1) online incremental coarse-mesh generation for dynamically expanding sparse 3D point cloud; (2) online mesh quality assessment with actionable indicators; and (3) predictive path planning for on-the-fly trajectory refinement. Comprehensive experiments demonstrate that our method achieves in-situ reconstruction and evaluation in near real time while providing actionable feedback that markedly reduces coverage gaps and re-flight costs. Via the integration of data collection, processing, 3D reconstruction and assessment, and online feedback, our on the-fly feedback SfM could be an alternative for the transition from traditional passive working mode to a more intelligent and adaptive exploration workflow. Code is now available at https://github.com/IRIS-LAB-whu/OntheflySfMFeedback.

[105] From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun

Main category: cs.CV

TL;DR: FDTA improves end-to-end MOT by enhancing object embeddings with spatial, temporal, and identity adapters to address low association accuracy in current methods.

Details

Motivation: Current end-to-end MOT methods have strong detection but low association accuracy due to object embeddings having high inter-object similarity from DETR architecture, which only emphasizes category-level discrimination within single frames rather than instance-level distinction across frames with spatial-temporal continuity.

Method: FDTA introduces explicit feature refinement with three adapters: Spatial Adapter (SA) integrates depth-aware cues for spatial continuity, Temporal Adapter (TA) aggregates historical information for temporal dependencies, and Identity Adapter (IA) uses quality-aware contrastive learning for instance-level separability.

Result: FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks including DanceTrack, SportsMOT, and BFT.

Conclusion: The proposed discriminative embedding enhancement strategy effectively addresses association accuracy limitations in end-to-end MOT methods by improving object embeddings across spatial, temporal, and identity dimensions.

Abstract: End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

[106] Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels

Kejia Hu, Mohammed Alsakabi, John M. Dolan, Ozan K. Tonguz

Main category: cs.CV

TL;DR: The paper presents a camera-guided radar labeling pipeline that generates accurate semantic segmentation labels for 4D radar point clouds without human annotation, addressing the scarcity of labeled radar datasets.

Details

Motivation: 4D radar shows promise for robust perception in adverse conditions, but radar semantic segmentation research is limited by scarce open-source datasets and labels. Existing datasets like RaDelft only provide LiDAR annotations and no public code for radar labeling, hindering reproducibility and downstream research.

Method: A camera-guided radar labeling pipeline that projects radar point clouds into camera-based semantic segmentation and applies spatial clustering to generate accurate radar labels without human annotation.

Result: The method reproduces RaDelft’s numerical results and generates labels that significantly enhance radar label accuracy. It establishes a reproducible framework for training and evaluating labeled 4D radar data, and quantifies how different fog levels affect radar labeling performance.

Conclusion: The camera-guided labeling pipeline provides an effective solution to the radar dataset scarcity problem, enabling reproducible research and advancing 4D radar semantic segmentation without relying on human annotations.

Abstract: Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcity of open source datasets and labels. The RaDelft data set, although seminal, provides only LiDAR annotations and no public code to generate radar labels, limiting reproducibility and downstream research. In this work, we reproduce the numerical results of the RaDelft group and demonstrate that a camera-guided radar labeling pipeline can generate accurate labels for radar point clouds without relying on human annotations. By projecting radar point clouds into camera-based semantic segmentation and applying spatial clustering, we create labels that significantly enhance the accuracy of radar labels. These results establish a reproducible framework that allows the research community to train and evaluate the labeled 4D radar data. In addition, we study and quantify how different fog levels affect the radar labeling performance.

[107] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou

Main category: cs.CV

TL;DR: Skywork-R1V4 is a 30B parameter multimodal agent that unifies image manipulation, web search, and planning through supervised fine-tuning alone, achieving SOTA on benchmarks without reinforcement learning.

Details

Motivation: Existing multimodal agentic systems treat image manipulation and web search as separate capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces.

Method: Developed Skywork-R1V4, a 30B parameter multimodal agent that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and interleaved reasoning. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories with stepwise consistency filtering.

Result: Achieves state-of-the-art results: 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Exhibits emergent long-horizon reasoning, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks.

Conclusion: Sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu, Shichao Xie, Yujian Yuan, Jian Sun, Zhining Gu, Junjie Wang, Xiaolong Wu, Mu Xu, Yujiu Yang

Main category: cs.CV

TL;DR: Nav-R² is a framework for open-vocabulary object-goal navigation that uses structured Chain-of-Thought reasoning and Similarity-Aware Memory to improve decision-making and success rates for locating unseen objects.

Details

Motivation: Existing object-goal navigation approaches suffer from opaque decision-making processes and low success rates when locating unseen objects in open-vocabulary settings.

Method: The framework explicitly models target-environment relationships and environment-action planning through structured Chain-of-Thought reasoning. It uses a Similarity-Aware Memory that compresses video frames and fuses historical observations to preserve target-relevant features without adding parameters. The authors also created a NavR²-CoT dataset to train the model.

Result: Nav-R² achieves state-of-the-art performance in localizing unseen objects, avoids overfitting to seen object categories, and maintains real-time inference at 2Hz.

Conclusion: The proposed framework successfully addresses the challenges of open-vocabulary object-goal navigation through explicit relationship modeling and efficient memory mechanisms, providing transparent decision-making and improved performance on novel objects.

Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

[109] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng

Main category: cs.CV

TL;DR: WISE is a multi-agent debate framework for vision-and-language reasoning that uses heterogeneous experts (Solvers and Reflectors) with a modified Dawid-Skene aggregation method, achieving 2-7% accuracy improvements over state-of-the-art MAD approaches.

Details

Motivation: Multi-agent debate (MAD) has shown promise for robust reasoning but has been mostly applied to language-only tasks, leaving its efficacy on multimodal vision-and-language problems underexplored. The authors aim to generalize MAD protocols to leverage complementary strengths of heterogeneous experts with single- and multi-modal capabilities.

Method: WISE (Weighted Iterative Society-of-Experts) partitions agents into Solvers (generate solutions) and Reflectors (verify correctness, assign weights, provide feedback). It uses a modified Dawid-Skene algorithm for post-processing that integrates the two-stage debate model, accounting for response variance and feedback weights.

Result: WISE consistently improves accuracy by 2-7% over state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks (SMART-840, VisualPuzzles, EvoChart-QA, and new SMART-840++ dataset) and LLM configurations.

Conclusion: The WISE framework successfully extends multi-agent debate to multimodal reasoning problems, demonstrating that heterogeneous expert agents with specialized roles (Solvers and Reflectors) can effectively leverage complementary strengths for improved performance on vision-and-language tasks.

Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents’ solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

[110] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin

Main category: cs.CV

TL;DR: MitUNet: A hybrid neural network for high-precision wall segmentation from 2D floor plans, using Mix-Transformer encoder and U-Net decoder with scSE attention, optimized with Tversky loss for better boundary accuracy and thin structure detection.

Details

Motivation: Existing methods for 3D reconstruction from floor plans struggle with thin structural components and produce masks with irregular boundaries, lacking the geometric precision needed for vectorization in automated 3D modeling pipelines.

Method: MitUNet combines a hierarchical Mix-Transformer encoder for global context capture with a U-Net decoder enhanced with scSE attention blocks for boundary recovery. Uses Tversky loss optimization with fine-tuned hyperparameters to balance precision and recall, suppressing false positives while maintaining sensitivity to thin structures.

Result: Outperforms standard single-task models on both public CubiCasa5k dataset and proprietary regional dataset, generating structurally correct masks with high boundary accuracy suitable for 3D reconstruction.

Conclusion: MitUNet provides a robust tool for wall segmentation in automated 3D reconstruction pipelines, addressing critical limitations in existing methods for precise structural element detection from 2D floor plans.

Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.

[111] Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

Main category: cs.CV

TL;DR: Proposes GuiDG framework for domain generalization in vision-language models using domain experts and cross-modal attention to improve generalization while maintaining efficiency.

Details

Motivation: Addresses the trade-off between domain specificity and domain generalization in fine-tuning large VLMs. Current methods that fine-tune a universal model on entire datasets compromise generalization to unseen domains.

Method: Two-step framework: 1) Uses prompt tuning to obtain source domain experts, 2) Introduces Cross-Modal Attention module to guide vision encoder fine-tuning via adaptive expert integration. Also constructs ImageNet-DG benchmark for few-shot DG evaluation.

Result: Extensive experiments on standard DG benchmarks and ImageNet-DG show GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Conclusion: Theoretical analysis shows training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. GuiDG framework successfully implements this insight to improve domain generalization in VLM fine-tuning.

Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, Xiangyu Zhang, Daxin Jiang

Main category: cs.CV

TL;DR: GUI Exploration Lab: A simulation environment for GUI agent navigation research that enables comprehensive training and evaluation through supervised fine-tuning followed by reinforcement learning.

Details

Motivation: Real-world GUI environments are complex and proprietary, making it difficult to obtain comprehensive environment information needed for agent training and evaluation, which hinders systematic investigation and benchmarking of agent navigation capabilities.

Method: Introduce GUI Exploration Lab simulation environment engine that enables flexible definition and composition of screens, icons, and navigation graphs with full access to environment information. Use three-stage training: 1) supervised fine-tuning for fundamental knowledge memorization, 2) single-turn reinforcement learning for generalization to unseen scenarios, 3) multi-turn reinforcement learning for exploration strategy development.

Result: The approach demonstrates effective generalization to real-world scenarios on both static and interactive benchmarks. Supervised fine-tuning provides crucial foundation, single-turn RL enhances generalization, and multi-turn RL further improves screen navigation performance through exploration strategies.

Conclusion: Reinforcement learning approaches show advantages in GUI navigation, and the findings offer practical guidance for building more capable and generalizable GUI agents. The GUI Exploration Lab enables systematic investigation of agent navigation capabilities in complex GUI environments.

Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

[113] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

Main category: cs.CV

TL;DR: WorldMM is a multimodal memory agent for long video understanding that uses three complementary memory types (episodic, semantic, visual) with adaptive retrieval across multiple temporal scales, achieving 8.4% average improvement over SOTA on long video QA benchmarks.

Details

Motivation: Existing video LLMs struggle with hours- or days-long videos due to limited context capacity and loss of visual details. Current memory-augmented methods rely too heavily on textual summaries and fixed temporal scales, failing to utilize visual evidence and handle events of variable durations.

Method: WorldMM constructs three complementary memory types: episodic memory (factual events across multiple temporal scales), semantic memory (continuously updated high-level conceptual knowledge), and visual memory (detailed scene information). An adaptive retrieval agent iteratively selects relevant memory sources and temporal granularities based on queries.

Result: WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods.

Conclusion: WorldMM effectively addresses long video reasoning challenges through its multimodal memory architecture with adaptive retrieval, demonstrating superior performance on complex long video understanding tasks.

Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

[114] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework

Daeyoung Kim

Main category: cs.CV

TL;DR: LightHCG: A lightweight causal representation learning model for glaucoma detection using convolutional VAE with HSIC-based disentanglement and graph autoencoders, achieving high performance with 93-99% fewer parameters than existing vision models.

Details

Motivation: Current AI-driven glaucoma detection approaches have limitations in reliability, parameter efficiency, susceptibility to spurious correlations, and lack of capability for intervention analysis or clinical simulations. There's a need for models that capture true causality among glaucoma-related physical factors.

Method: Proposed LightHCG: an extremely lightweight Convolutional VAE-based latent glaucoma representation model. Uses HSIC-based latent space disentanglement and Graph Autoencoder-based unsupervised causal representation learning to capture true causality among glaucoma-related factors in the optic nerve region.

Result: Achieves higher performance in glaucoma classification with 93-99% less weights (parameters) compared to existing advanced vision models like InceptionV3, MobileNetV2, and VGG16. Also enhances potential for AI-driven intervention analysis.

Conclusion: LightHCG represents a significant advancement in glaucoma detection by combining lightweight architecture with causal representation learning, addressing key limitations of current AI approaches while enabling new capabilities for clinical applications and intervention analysis.

Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.

[115] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Phuc Pham, Nhu Pham, Ngoc Quoc Ly

Main category: cs.CV

TL;DR: A momentum self-distillation method for efficient vision-language model training in healthcare, achieving competitive performance with SOTA while using only a single GPU.

Details

Motivation: Healthcare has limited annotated data and contrastive learning requires large batch sizes that are computationally expensive. Need to improve training efficiency while extracting knowledge from both data and models.

Method: Combines momentum method with distillation: (1) momentum self-distillation to enhance multimodal learning, (2) momentum mechanisms with gradient accumulation to enlarge effective batch size without increasing resources.

Result: Competitive with SOTA in zero-shot classification, substantial boost in few-shot adaptation (>90% AUC-ROC), 2-3% improvement in retrieval tasks, high training efficiency with single GPU.

Conclusion: Proposed method advances efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods, making VLM training more accessible.

Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .

[116] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee

Main category: cs.CV

TL;DR: BOLT extracts orthogonal spectral bases from existing fine-tuned models and adapts new tasks within this subspace with minimal training.

Details

Motivation: Meta-learning requires expensive meta-training, while many task-specific pre-trained models exist but transferring them efficiently to new tasks is underexplored.

Method: Extract dominant singular directions from multiple task vectors, orthogonalize them per layer to form reusable bases, then freeze bases and train only small diagonal coefficients for new tasks.

Result: Provides strong training-free initialization and achieves robust performance compared to PEFT baselines and meta-learned initialization with very few trainable parameters.

Conclusion: Constraining adaptation to task-informed orthogonal subspace offers effective alternative for unseen-task transfer with minimal compute.

Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

[117] Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

Fan Luo, Zeyu Gao, Xinhao Luo, Kai Zhao, Yanfeng Lu

Main category: cs.CV

TL;DR: TDE enhances SNNs for object detection by generating diverse temporal stimuli and using spike-driven attention to reduce energy consumption.

Details

Motivation: Existing SNNs have limited expressive power because they replicate or aggregate inputs in fixed intervals, leading to nearly identical stimuli across time steps, which is insufficient for complex tasks like object detection.

Method: Proposes Temporal Dynamics Enhancer (TDE) with two modules: Spiking Encoder (SE) generates diverse input stimuli across time steps, and Attention Gating Module (AGM) guides SE generation based on inter-temporal dependencies. Also introduces Spike-Driven Attention (SDA) to eliminate high-energy multiplication operations from AGM.

Result: TDE consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on PASCAL VOC and 47.6% on EvDET200K. SDA consumes only 0.240 times the energy of conventional attention modules.

Conclusion: TDE effectively enhances SNNs’ temporal modeling capacity for object detection while maintaining energy efficiency through spike-driven attention, making it a promising approach for energy-efficient visual perception tasks.

Abstract: Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model’s expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs’ capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.

[118] nuScenes Revisited: Progress and Challenges in Autonomous Driving

Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar

Main category: cs.CV

TL;DR: A comprehensive retrospective analysis of the nuScenes autonomous driving dataset, examining its creation, technical details, influence on subsequent datasets, and impact on AV/ADAS research standards and methodologies.

Details

Motivation: To provide an unprecedented look into the creation and technical details of nuScenes dataset that haven't been revealed in academic publications, and to trace its influence on autonomous driving research standards and subsequent datasets.

Method: Revisiting and analyzing the nuScenes dataset creation process, examining its extensions (nuImages and Panoptic nuScenes), tracing its influence on other datasets, and reviewing methodological developments in autonomous driving research using nuScenes.

Result: Reveals previously undisclosed technical details about nuScenes creation, demonstrates its pioneering role as the first dataset with radar data and diverse urban scenes from two continents, and shows its significant influence on defining community standards and inspiring numerous subsequent datasets.

Conclusion: nuScenes has been foundational for AV/ADAS development, setting key standards for multi-modal sensor fusion, diverse urban scene collection, and benchmark tasks, while significantly shaping the autonomous driving research landscape through its influence on subsequent datasets and methodologies.

Abstract: Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization & mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.

[119] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild

Valentin Bieri, Marie-Julie Rakotosaona, Keisuke Tateno, Francis Engelmann, Leonidas Guibas

Main category: cs.CV

TL;DR: HouseLayout3D is a new real-world benchmark for full building-scale 3D layout estimation, addressing limitations of current models that only handle single rooms/floors. MultiFloor3D is a training-free baseline that outperforms existing methods.

Details

Motivation: Current 3D layout estimation models are trained on synthetic datasets with simple single-room/single-floor environments, requiring scene splitting that removes global spatial context essential for reasoning about multi-level structures like staircases.

Method: The paper introduces HouseLayout3D benchmark for full building-scale layout estimation, and presents MultiFloor3D - a simple training-free baseline that leverages recent scene understanding methods.

Result: MultiFloor3D already outperforms existing 3D layout estimation models on both the new HouseLayout3D benchmark and prior datasets, highlighting the need for further research in building-scale layout estimation.

Conclusion: The work addresses a critical gap in 3D layout estimation by providing a real-world benchmark for multi-floor buildings and showing that even a simple training-free baseline can outperform existing methods, indicating significant room for improvement in this area.

Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.

[120] ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

Kerui Chen, Jianrong Zhang, Ming Li, Zhonglong Zheng, Hehe Fan

Main category: cs.CV

TL;DR: ClusterStyle: A clustering-based framework for stylized motion generation that models intra-style diversity using prototype-based structured embeddings at global and local levels.

Details

Motivation: Existing stylized motion generation models struggle to capture intra-style diversity - the ability for a single style to correspond to diverse motion variations. Current approaches learn unstructured embeddings that fail to model the rich variations within a style category.

Method: Proposes ClusterStyle with two key components: 1) Uses prototypes to model diverse style patterns across motions of the same category, capturing both global-level diversity among style motions and local-level diversity within temporal dynamics. 2) Creates two structured style embedding spaces (global and local) optimized via alignment with non-learnable prototype anchors. 3) Augments pretrained text-to-motion models with Stylistic Modulation Adapter (SMA) to integrate style features.

Result: Extensive experiments demonstrate that ClusterStyle outperforms existing state-of-the-art models in both stylized motion generation and motion style transfer tasks.

Conclusion: The prototype-based clustering framework effectively addresses the intra-style diversity challenge in stylized motion generation, providing better modeling of style variations at both global and local levels through structured embedding spaces.

Abstract: Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.

[121] See, Think, Learn: A Self-Taught Multimodal Reasoner

Sourabh Sharma, Sonam Gupta, Sadbhawna

Main category: cs.CV

TL;DR: STL is a self-training framework that improves VLMs’ multimodal reasoning by generating structured rationales (see-then-think) and learning from them, including negative rationales to enhance discrimination.

Details

Motivation: Current VLMs struggle with multimodal reasoning due to weaknesses in either perception or reasoning. Existing methods rely on expensive human annotations, proprietary models, or self-training that ignores perception, creating a need for cost-effective solutions.

Method: STL uses a structured reasoning template that forces models to “see before think” - first extract visual attributes in text form, then use them for reasoning. It employs self-training where models generate and learn from their own structured rationales, augmented with negative rationales that explain why wrong answers are incorrect.

Result: Experiments across diverse domains show STL consistently outperforms baselines trained only on answers or self-generated reasoning. Qualitative analysis confirms high-quality rationales.

Conclusion: STL provides a cost-effective solution to enhance multimodal reasoning in VLMs by jointly improving perception and reasoning through structured self-training with negative rationales.

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model’s ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

[122] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang

Main category: cs.CV

TL;DR: Audio-video joint denoising training improves video generation quality even when only video quality matters, not just audio-video synchrony.

Details

Motivation: To investigate whether audio-video joint denoising training can improve video generation quality beyond just achieving audio-video synchrony, even when the primary goal is video quality alone.

Method: Introduce AVFullDiT architecture that leverages pre-trained T2V and T2A modules for joint denoising. Train both T2AV model with AVFullDiT and T2V-only counterpart under identical settings for comparison.

Result: Audio-video joint denoising consistently improves video quality on challenging subsets with large and object contact motions. Audio prediction acts as a privileged signal that helps model internalize causal relationships between visual events and acoustic consequences.

Conclusion: Cross-modal co-training is a promising approach for developing stronger, more physically grounded world models, as audio-video joint denoising provides benefits beyond synchrony by regularizing video dynamics through causal relationship learning.

Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

[123] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

Zhongyi Cai, Yi Du, Chen Wang, Yu Kong

Main category: cs.CV

TL;DR: SEER-Bench introduces a sequential embodied AI benchmark for EQA and EMN tasks, with 3DSPMR approach using 3D spatial memory reasoning to enhance MLLMs for sequential task performance.

Details

Motivation: Real-world embodied agents face sequential tasks where sub-tasks build on previous ones, requiring reuse of spatial knowledge from prior explorations, which is underexplored in current research focused on single-task settings.

Method: 3DSPMR (3D Spatial Memory Reasoning) approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models for reasoning and exploration in sequential embodied tasks.

Result: Extensive experiments show 3DSPMR achieves substantial performance gains on both sequential Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN) tasks compared to baseline approaches.

Conclusion: This work addresses the practical challenge of sequential embodied tasks by introducing SEER-Bench benchmark and demonstrating that incorporating geometric information into MLLM-based spatial understanding significantly improves performance on sequential embodied AI tasks.

Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.

[124] TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution

Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, Bin Xiao

Main category: cs.CV

TL;DR: TGDD improves dataset distillation by dynamically aligning synthetic and original data distributions along the model’s training trajectory, capturing evolving feature representations for better downstream performance.

Details

Motivation: Current distribution matching methods for dataset distillation overlook the evolution of feature representations during training, which limits synthetic data expressiveness and weakens downstream performance.

Method: TGDD reformulates distribution matching as dynamic alignment along the model’s training trajectory, capturing evolving semantics at each stage while introducing distribution constraint regularization to reduce class overlap.

Result: TGDD achieves state-of-the-art performance on ten datasets with 5.0% accuracy gain on high-resolution benchmarks, maintaining favorable balance between performance and efficiency without additional optimization overhead.

Conclusion: TGDD effectively preserves semantic diversity and representativeness in synthetic data by considering feature evolution during training, leading to improved downstream task performance.

Abstract: Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model’s training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.

[125] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

Main category: cs.CV

TL;DR: WorldPack is a video world model with compressed memory that improves long-term spatial consistency and quality in video generation while reducing computational costs.

Details

Motivation: Existing video world models struggle with long-term temporal and spatial consistency due to prohibitively expensive computational costs for long-context inputs, creating a need for more efficient approaches.

Method: WorldPack uses compressed memory with two components: trajectory packing for high context efficiency, and memory retrieval to maintain consistency in rollouts and support long-term generations requiring spatial reasoning.

Result: WorldPack significantly outperforms state-of-the-art models on the LoopNav benchmark in Minecraft, demonstrating notable improvements in spatial consistency, fidelity, and quality for long-term generation.

Conclusion: WorldPack successfully addresses the long-standing problem of temporally- and spatially-consistent long-term world modeling through efficient compressed memory, enabling high-quality video generation with shorter context lengths.

Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.

[126] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Vishwesh Nath, Javier G. Tejero, Ruilong Li, Filippo Filicori, Mahdi Azizian, Sean D. Huver

Main category: cs.CV

TL;DR: G-SHARP is a commercially compatible real-time surgical scene reconstruction framework using Gaussian splatting for minimally invasive procedures, offering fast 3D modeling of deformable tissue with deployment on NVIDIA edge hardware.

Details

Motivation: Existing real-time endoscopic reconstruction methods often rely on non-commercial derivatives, limiting deployability in clinical settings. There's a need for commercially compatible solutions that can provide fast and accurate 3D modeling of deformable tissue for minimally invasive surgical procedures.

Method: G-SHARP is built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling and robust occlusion handling. The framework is designed for real-time surgical scene reconstruction and includes a Holoscan SDK application for deployment on NVIDIA IGX Orin and Thor edge hardware.

Result: The framework achieves state-of-the-art reconstruction quality on the EndoNeRF pulling benchmark with strong speed-accuracy trade-offs suitable for intra-operative use. It enables real-time surgical visualization in practical operating-room settings.

Conclusion: G-SHARP represents the first commercially compatible surgical pipeline using Gaussian splatting, overcoming deployment limitations of previous approaches while maintaining high-fidelity reconstructions and real-time performance for minimally invasive procedures.

Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

[127] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou

Main category: cs.CV

TL;DR: UCAgents is a hierarchical multi-agent framework for medical VQA that enforces unidirectional convergence through structured evidence auditing to address reasoning detachment in VLMs, achieving superior accuracy with significantly lower computational cost.

Details

Motivation: Current VLMs suffer from reasoning detachment where explanations drift from verifiable image evidence, undermining clinical trust. Existing multi-agent frameworks for MDT debates amplify textual noise and computational cost while failing to anchor reasoning to visual evidence.

Method: UCAgents uses a hierarchical multi-agent framework inspired by clinical workflows, enforcing unidirectional convergence through structured evidence auditing. It forbids position changes, limits interactions to targeted evidence verification, and introduces one-round inquiry discussion to uncover visual-textual misalignment risks. This creates a dual-noise bottleneck formalized via information theory.

Result: Achieves 71.3% accuracy on PathVQA (+6.0% over SOTA) with 87.7% lower token cost across four medical VQA benchmarks. Strikes balance between uncovering visual evidence and avoiding textual interference, demonstrating both diagnostic reliability and computational efficiency.

Conclusion: UCAgents effectively addresses reasoning detachment in medical VLMs by enforcing evidence-based convergence, achieving superior performance with significantly reduced computational overhead, making it suitable for real-world clinical deployment.

Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

[128] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu

Main category: cs.CV

TL;DR: ReVSeg introduces explicit multi-step reasoning for video object segmentation using pretrained VLMs, decomposing complex queries into interpretable semantic, temporal, and spatial operations optimized with reinforcement learning.

Details

Motivation: Current video object segmentation methods collapse complex reasoning about dynamics, causality, and temporal interactions into simplified latent embeddings, making the reasoning chain opaque and intractable. There's a need for more interpretable and explicit reasoning approaches.

Method: ReVSeg decomposes reasoning into three explicit sequential operations: semantics interpretation, temporal evidence selection, and spatial grounding, using pretrained vision language models. It employs reinforcement learning to optimize the multi-step reasoning chain, enabling self-refinement from outcome-driven signals.

Result: ReVSeg achieves state-of-the-art performance on standard video object segmentation benchmarks while producing interpretable reasoning trajectories, demonstrating the effectiveness of explicit decomposition and reinforcement learning optimization.

Conclusion: Explicit decomposition of reasoning into sequential operations combined with reinforcement learning optimization enables both high performance and interpretability in complex video object segmentation tasks, addressing the limitations of opaque latent reasoning approaches.

Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations – semantics interpretation, temporal evidence selection, and spatial grounding – aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

[129] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo

Main category: cs.CV

TL;DR: 3D-SLIM replaces causal attention masks with adaptive spatial masks for 3D scene-language tasks, eliminating sequential bias and enabling direct object-instruction attention without architectural changes.

Details

Motivation: Existing LLM-based 3D scene understanding methods use causal attention masks from language modeling, which create two conflicts: 1) sequential bias among order-agnostic 3D objects, and 2) restricted object-instruction attention that hinders task-specific reasoning.

Method: Proposes 3D Spatial Language Instruction Mask (3D-SLIM) with two components: Geometry-adaptive Mask (constrains attention based on spatial density rather than token order) and Instruction-aware Mask (enables object tokens to directly access instruction context). The approach requires no architectural modifications or extra parameters.

Result: Substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate effectiveness and underscore the importance of decoder design in 3D multi-modal reasoning.

Conclusion: 3D-SLIM effectively addresses limitations of causal attention masks in 3D scene understanding by adapting attention to spatial structure, demonstrating that decoder design is critical for 3D multi-modal reasoning despite being simple and parameter-free.

Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user’s task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

[130] YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

Main category: cs.CV

TL;DR: YingVideo-MV is the first cascaded framework for generating long music-performance videos with camera motions from audio signals, addressing the gap in existing methods that lack explicit camera control.

Details

Motivation: Existing diffusion models for audio-driven avatar video generation have made progress in synthesizing long sequences with audio-visual synchronization, but music-performance videos with camera motions remain largely unexplored. Current long-video generation methods lack explicit camera motion control.

Method: A cascaded framework integrating: 1) audio semantic analysis, 2) interpretable shot planning module (MV-Director), 3) temporal-aware diffusion Transformer architectures, 4) long-sequence consistency modeling, 5) camera adapter module that embeds camera poses into latent noise, and 6) time-aware dynamic window range strategy for adaptive denoising based on audio embeddings. Trained on a large-scale Music-in-the-Wild Dataset collected from web data.

Result: Comprehensive benchmark tests demonstrate outstanding performance in generating coherent and expressive music videos with precise music-motion-camera synchronization. The framework enables automatic synthesis of high-quality music performance videos from audio signals.

Conclusion: YingVideo-MV represents a significant advancement in music-driven long-video generation, successfully addressing the challenge of camera motion control while maintaining audio-visual synchronization and identity consistency across long sequences.

Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

[131] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

Main category: cs.CV

TL;DR: CIA is a new image-centric jailbreak attack method for MLLMs that embeds harmful queries into benign visual contexts using multi-agent systems and visualization strategies, achieving high toxicity scores and attack success rates.

Details

Motivation: Existing jailbreak attacks on MLLMs focus on text-image interplay and treat visual modality as secondary prompts, underutilizing images' potential to carry complex contextual information for attacks.

Method: Contextual Image Attack (CIA) uses a multi-agent system to embed harmful queries into seemingly benign visual contexts with four visualization strategies, plus contextual element enhancement and automatic toxicity obfuscation techniques.

Result: CIA achieves toxicity scores of 4.73 and 4.83 against GPT-4o and Qwen2.5-VL-72B models respectively, with Attack Success Rates of 86.31% and 91.07%, significantly outperforming prior work.

Conclusion: The visual modality itself is a potent vector for jailbreaking advanced MLLMs, demonstrating that image-centric attacks can be highly effective against multimodal safety alignments.

Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack’s efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

[132] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration

Mizuki Kikkawa, Tatsuya Yatagawa, Yutaka Ohtake, Hiromasa Suzuki

Main category: cs.CV

TL;DR: The paper analyzes limitations of deep-learning-based point set registration methods using Gaussian mixture models (GMMs) for partial-to-partial registration, particularly DeepGMR, and proposes an attention-based reference point shifting (ARPS) layer to achieve transformation-invariant features.

Details

Motivation: The study investigates problems with deep-learning-based registration methods using GMMs for partial-to-partial point set registration, focusing on the limitations of DeepGMR. The goal is to understand why these methods fail and propose a solution for achieving transformation-invariant features.

Method: The authors introduce an attention-based reference point shifting (ARPS) layer that robustly identifies a common reference point between two partial point sets using an attention module. This layer is designed to extract transformation-invariant features by finding common reference points rather than focusing on overlap regions.

Result: The ARPS layer significantly enhances the performance of DeepGMR and its variant UGMMReg. These extended models outperform prior deep learning methods that use attention blocks and Transformers to extract overlap regions or common reference points.

Conclusion: The findings provide deeper insights into registration methods using deep learning and GMMs, demonstrating that focusing on common reference points rather than overlap regions leads to better performance in partial-to-partial point set registration.

Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.

[133] A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

Wenjing Yu, Shuo Jiang, Yifei Chen, Shuo Chang, Yuanhan Wang, Beining Wu, Jie Dong, Mingxuan Liu, Shenghao Zhu, Feiwei Qin, Changmiao Wang, Qiyuan Tian

Main category: cs.CV

TL;DR: MedSeg-TTA is a comprehensive benchmark for test-time adaptation in medical image segmentation, evaluating 20 methods across 7 imaging modalities with standardized protocols to enable systematic comparison and guide clinical deployment.

Details

Motivation: Current evaluations of test-time adaptation for medical image segmentation are limited in modality coverage, task diversity, and methodological consistency, creating a need for standardized benchmarking to assess method reliability across diverse clinical scenarios.

Method: Created MedSeg-TTA benchmark with 20 representative adaptation methods across 7 imaging modalities (MRI, CT, ultrasound, pathology, dermoscopy, OCT, chest X-ray) using unified data preprocessing, backbone configuration, and test-time protocols. Examined four adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation.

Result: No single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts, feature-level and output-level methods excel in boundary-related metrics, and prior-based methods show strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts.

Conclusion: MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard to establish rigorous foundations for future research on robust, clinically reliable test-time adaptation, emphasizing the importance of principled method selection for clinical deployment.

Abstract: Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.

[134] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

Main category: cs.CV

TL;DR: dots.ocr is a unified Vision-Language Model that jointly learns document layout parsing tasks (layout detection, text recognition, relational understanding) in an end-to-end framework, achieving SOTA performance with strong multilingual capabilities.

Details

Motivation: Current document layout parsing methods use fragmented multi-stage pipelines that suffer from error propagation and fail to leverage joint training synergies. There's a need for unified models to better interpret structured knowledge for next-generation Vision-Language Models.

Method: Introduces dots.ocr, a single Vision-Language Model that jointly learns three core tasks (layout detection, text recognition, relational understanding) in a unified end-to-end framework. Uses a highly scalable data engine to synthesize a vast multilingual corpus.

Result: Achieves state-of-the-art performance on OmniDocBench. On the new XDocParse benchmark (126 languages), dots.ocr outperforms the next-best competitor by +7.4 points, demonstrating unparalleled multilingual capabilities.

Conclusion: The unified end-to-end paradigm for document layout parsing is effective and scalable. dots.ocr establishes a powerful baseline for global document intelligence research, showing advantages of joint learning over fragmented pipelines.

Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

[135] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

Main category: cs.CV

TL;DR: GeoDiT is a diffusion-based vision-language model for geospatial tasks that outperforms autoregressive models by using parallel refinement instead of sequential generation.

Details

Motivation: Autoregressive models are fundamentally misaligned with geospatial understanding because they force sequential narratives onto inherently parallel spatial scenes, hindering structured and coherent output generation.

Method: Introduces GeoDiT, the first diffusion-based vision-language model for geospatial domain, reframing generation as parallel refinement process with holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously.

Result: GeoDiT establishes new state-of-the-art on benchmarks requiring structured, object-centric outputs, achieving significant gains in image captioning, visual grounding, and multi-object detection where autoregressive models falter.

Conclusion: Aligning the generative process with the data’s intrinsic structure is key to unlocking superior performance in complex geospatial analysis, validating the parallel refinement approach over sequential autoregressive methods.

Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data’s intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

[136] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

Main category: cs.CV

TL;DR: ViT-SR uses a two-stage training strategy with self-supervised pretraining on colourization, then fine-tuning for 4x super-resolution, achieving state-of-the-art results on DIV2K benchmark.

Details

Motivation: Single Image Super-Resolution (SISR) remains a challenging problem in computer vision. The authors aim to improve Vision Transformer (ViT) performance for SISR by leveraging self-supervised learning to extract rich, generalizable visual representations from data.

Method: Two-stage training strategy: 1) Self-supervised pretraining on a colourization task to learn general visual representations, 2) Fine-tuning for 4x super-resolution. The model predicts a high-frequency residual image added to initial bicubic interpolation, simplifying residual learning.

Result: Achieved impressive results on DIV2K benchmark: SSIM of 0.712 and PSNR of 22.90 dB, demonstrating the efficacy of the two-stage approach and self-supervised pre-training for complex image restoration tasks.

Conclusion: The two-stage training strategy with self-supervised pretraining is effective for SISR. ViT-SR shows potential for image restoration tasks, with opportunities for further improvements using larger ViT architectures or alternative pretext tasks.

Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

[137] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, Bo Yang

Main category: cs.CV

TL;DR: SkyMoE is a Mixture-of-Experts vision-language model designed for remote sensing tasks that addresses the limitations of general-purpose VLMs by adaptively routing tasks based on interpretation granularity and task type.

Details

Motivation: General-purpose vision-language models are suboptimal for remote sensing tasks. Existing geospatial VLMs use unified modeling and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception with global contextual understanding.

Method: SkyMoE uses an adaptive router that generates task- and granularity-aware routing instructions to direct specialized LLM experts to handle different sub-tasks. It also employs a context-disentangled augmentation strategy that creates contrastive pairs between local and global features to promote expert decoupling and granularity sensitivity. The authors also construct MGRS-Bench, a comprehensive benchmark for evaluation.

Result: Extensive experiments on 21 public datasets demonstrate state-of-the-art performance across tasks, validating SkyMoE’s adaptability, scalability, and superior multi-granularity understanding in remote sensing.

Conclusion: SkyMoE effectively addresses the limitations of existing geospatial VLMs by providing specialized expert routing based on task granularity, achieving superior performance in multimodal, multi-task remote sensing interpretation through its adaptive MoE architecture and context-disentangled learning approach.

Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

[138] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection

Tai Le-Gia

Main category: cs.CV

TL;DR: This dissertation addresses zero-shot anomaly classification/segmentation, analyzing consistent anomaly failures, proposing CoDeGraph framework for filtering them, extending to 3D medical imaging, and bridging batch/text methods.

Details

Motivation: Zero-shot anomaly detection is crucial for industrial inspection and medical imaging where training data is unavailable. The research addresses core challenges like consistent anomalies that bias distance-based methods.

Method: 1) Formalized consistent anomaly problem and analyzed patch representations from pre-trained Vision Transformers; 2) Developed CoDeGraph framework using multi-stage graph construction, community detection, and structured refinement; 3) Extended to 3D medical imaging with training-free volumetric tokenization; 4) Bridged batch/text methods using CoDeGraph pseudo-masks to supervise vision-language models.

Result: Identified similarity scaling and neighbor-burnout phenomena in patch relationships; CoDeGraph effectively suppresses consistent anomalies; Achieved zero-shot 3D anomaly segmentation without training samples; Demonstrated pseudo-masks can supervise prompt-driven vision-language models.

Conclusion: The dissertation provides theoretical understanding and practical solutions for zero-shot anomaly classification/segmentation, addressing consistent anomaly failures, extending to 3D medical imaging, and bridging different zero-shot approaches.

Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.

[139] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Noisy Query Tokens method bridges Vision-Language Models with Diffusion Models using distributed representation learning to prevent task generalization collapse and enable stable continual learning.

Details

Motivation: Current methods using fixed learnable query tokens for connecting VLMs and Diffusion Models suffer from task generalization collapse - they fail to adapt to new tasks that are distant from their pre-training tasks, despite being computationally efficient.

Method: Proposes Noisy Query Tokens that learn a distributed representation space between VLM and Diffusion Model via end-to-end optimization. Also introduces a VAE branch with linear projection to recover fine-grained image details.

Result: Experimental results confirm the approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Conclusion: The proposed method successfully addresses the generalization collapse problem in bridging VLMs with Diffusion Models, providing a solution for stable continual learning while maintaining computational efficiency.

Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

[140] AVGGT: Rethinking Global Attention for Accelerating VGGT

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang

Main category: cs.CV

TL;DR: Training-free acceleration method for multi-view 3D models that achieves 8-10× speedup while maintaining or improving accuracy by strategically converting early global attention layers to frame attention and subsampling global attention with diagonal preservation.

Details

Motivation: Existing multi-view 3D models like VGGT and π³ rely heavily on global self-attention, resulting in high computational costs. While sparse-attention variants offer partial speedups, they lack systematic understanding of how global attention contributes to multi-view reasoning.

Method: 1) Conducted in-depth analysis of global attention modules in VGGT and π³, revealing role division: early layers don’t form meaningful correspondences, middle layers perform cross-view alignment, last layers provide minor refinements. 2) Proposed training-free two-step acceleration: convert early global layers to frame attention, and subsample global attention by subsampling K/V over patch tokens with diagonal preservation and mean-fill component.

Result: Achieves 8-10× speedup in inference time while matching or slightly improving accuracy of original models. Remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

Conclusion: The systematic analysis of global attention roles enables effective acceleration strategy that maintains performance while dramatically reducing computational cost, offering practical solution for efficient multi-view 3D reasoning.

Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

[141] OmniPerson: Unified Identity-Preserving Pedestrian Generation

Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, Yifan Xu

Main category: cs.CV

TL;DR: OmniPerson is a unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks, offering holistic control over pedestrian attributes and achieving state-of-the-art generation quality.

Details

Motivation: Person re-identification suffers from limited training data due to privacy concerns and annotation costs. Existing pedestrian generation methods fail to ensure identity consistency and lack sufficient controllability for effective dataset augmentation.

Method: 1) OmniPerson unified generation model with holistic control over pedestrian attributes (RGB/IR modality, image/video generation, multi-reference inputs, poses, text). 2) Multi-Refer Fuser for robust identity preservation from any number of reference images. 3) PersonSyn dataset with automated curation pipeline transforming public ReID benchmarks into richly annotated resources.

Result: OmniPerson achieves state-of-the-art in pedestrian generation, excelling in visual fidelity and identity consistency. Augmenting existing datasets with generated data consistently improves ReID model performance.

Conclusion: OmniPerson provides an effective solution for generating high-quality pedestrian data for ReID tasks, addressing data scarcity issues while maintaining identity consistency and offering comprehensive controllability for dataset augmentation.

Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.

[142] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

Main category: cs.CV

TL;DR: Panel2Patch extracts hierarchical vision-language supervision from biomedical figures by parsing layouts, panels, and markers to create multi-granular training data, enabling better pretraining with less data.

Details

Motivation: Current biomedical vision-language pretraining compresses rich scientific figures into coarse figure-level pairs, discarding fine-grained correspondences that clinicians actually use when examining local structures.

Method: Panel2Patch parses layouts, panels, and visual markers from biomedical figures and captions, constructing hierarchical aligned vision-language pairs at figure, panel, and patch levels. Uses granularity-aware pretraining strategy unifying heterogeneous objectives from coarse descriptions to fine region-focused phrases.

Result: By applying Panel2Patch to only a small set of literature figures, researchers extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

Conclusion: Panel2Patch addresses the limitation of coarse figure-level pretraining by mining hierarchical structure from biomedical literature, preserving local semantics and enabling more effective vision-language representation learning with less data.

Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

[143] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Yafei Song, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: A novel framework for co-speech gesture synthesis using diffusion models for motion generation and motion graph retrieval for video production, addressing the many-to-many mapping challenge between audio and gestures.

Details

Motivation: Existing methods using motion graphs with simple audio-motion feature matching are suboptimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping approaches.

Method: 1) Use diffusion model to generate gesture motions by learning joint audio-motion distribution with both low-level and high-level audio features. 2) Apply motion-based retrieval algorithm to find optimal path in motion graph using global and local similarity. 3) Seamlessly stitch retrieved motion segments for coherent video output.

Result: Experimental results show significant improvement over prior approaches in synchronization accuracy and naturalness of generated gestures.

Conclusion: The proposed framework effectively addresses the many-to-many mapping challenge in co-speech gesture synthesis by combining diffusion-based motion generation with sophisticated motion graph retrieval, resulting in more synchronized and natural gesture videos.

Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

[144] Content-Aware Texturing for Gaussian Splatting

Panagiotis Papantonakis, Georgios Kopanas, Fredo Durand, George Drettakis

Main category: cs.CV

TL;DR: Texture mapping for Gaussian Splatting: Using per-primitive texture maps to represent detailed appearance, reducing the need for excessive small Gaussian primitives while maintaining quality.

Details

Motivation: Current Gaussian Splatting methods waste resources by representing fine appearance details as many small Gaussian primitives, even when geometry and appearance have different frequency characteristics. Texture mapping offers a more efficient alternative for detailed appearance representation.

Method: Proposes a new appearance representation with per-primitive texture maps for 2D Gaussian primitives. Texture resolution adapts to image sampling frequency and content through adaptive upscaling/downscaling during optimization. Also enables control of primitive count based on texture resolution.

Result: The approach performs favorably in image quality while using fewer total parameters compared to alternative textured Gaussian primitive solutions.

Conclusion: Texture mapping can be effectively integrated into Gaussian Splatting optimization to efficiently represent detailed appearance, reducing parameter count while maintaining rendering quality.

Abstract: Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large number of small Gaussian primitives, which can be wasteful when geometry and appearance exhibit different frequency characteristics. Inspired by the long tradition of texture mapping, we propose to use texture to represent detailed appearance where possible. Our main focus is to incorporate per-primitive texture maps that adapt to the scene in a principled manner during Gaussian Splatting optimization. We do this by proposing a new appearance representation for 2D Gaussian primitives with textures where the size of a texel is bounded by the image sampling frequency and adapted to the content of the input images. We achieve this by adaptively upscaling or downscaling the texture resolution during optimization. In addition, our approach enables control of the number of primitives during optimization based on texture resolution. We show that our approach performs favorably in image quality and total number of parameters used compared to alternative solutions for textured Gaussian primitives. Project page: https://repo-sam.inria.fr/nerphys/gs-texturing/

[145] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu

Main category: cs.CV

TL;DR: RULER-Bench is a new benchmark for evaluating rule-based reasoning capabilities in video generation models, covering 40 tasks across 6 rule categories with 622 annotated instances, revealing current models only achieve 48.87% rule coherence.

Details

Motivation: Existing video generation benchmarks focus on visual perception factors (aesthetics, instruction adherence, temporal coherence) but neglect rule-based reasoning capabilities. There's a gap in fine-grained decomposition and comprehensive evaluation of reasoning abilities in video generation models.

Method: Built RULER-Bench benchmark with two fundamental paradigms (text-to-video and image-to-video) covering 40 representative tasks across 6 rule categories with 622 high-quality annotated instances. Uses checklist with 4 metrics and GPT-o3 for scoring, achieving 85% alignment with human judgments.

Result: State-of-the-art video generation models achieve only 48.87% on the rule coherence metric, showing significant room for improvement in reasoning capabilities. The benchmark provides comprehensive evaluation revealing current limitations.

Conclusion: RULER-Bench addresses the gap in evaluating rule-based reasoning in video generation models, providing insights to facilitate development of reasoning-aware video generation and advance models toward vision foundation intelligence.

Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

[146] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu

Main category: cs.CV

TL;DR: PPTBench is a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks, revealing that current MLLMs struggle with visual-layout reasoning despite good semantic understanding.

Details

Motivation: Existing benchmarks overlook layout-centric challenges in PowerPoint presentations, which are central to real-world slide creation and editing. PowerPoint presentations combine rich textual content with structured visual layouts, making them ideal for evaluating multimodal reasoning and layout understanding abilities of MLLMs.

Method: Introduced PPTBench using 958 PPTX files with 4,439 samples across four categories: Detection, Understanding, Modification, and Generation. The benchmark evaluates models’ ability to handle both semantic content and visual-layout reasoning in PowerPoint scenarios.

Result: Reveals substantial gap between semantic understanding and visual-layout reasoning in current MLLMs. Models can interpret slide content but fail to produce coherent spatial arrangements. Current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into API planning. Case studies show systematic layout errors like misalignment and element overlap.

Conclusion: PPTBench provides new perspective for evaluating VLLMs in PPT scenarios, highlighting challenges in visual-structural reasoning and coherent slide generation. The findings point to future research directions for improving multimodal reasoning capabilities. All datasets and code are released for reproducibility.

Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

[147] DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose, Yusuke Sekikawa, Nakamasa Inoue

Main category: cs.CV

TL;DR: DF-Mamba: A deformable state scanning Mamba framework for 3D hand pose estimation that addresses occlusion challenges by better capturing global context relationships.

Details

Motivation: Current 3D hand pose estimation methods struggle with severe occlusions (like overlapping hands) due to ResNet's limited global context modeling. CNN inductive bias isn't optimal for learning relationships between local occluded features and global contextual cues.

Method: Proposes Deformable Mamba (DF-Mamba) using state space modeling with deformable state scanning. It aggregates local convolution features while selectively preserving useful global context cues through selective state modeling and deformable scanning.

Result: Outperforms latest image backbones (VMamba, Spatial-Mamba) on all five diverse datasets covering single-hand/two-hand scenarios, hand-only/hand-object interactions, and RGB/depth-based estimation. Achieves state-of-the-art performance with ResNet-50 comparable inference speed.

Conclusion: DF-Mamba effectively addresses occlusion challenges in 3D hand pose estimation by capturing global context beyond standard convolution, demonstrating superior performance across diverse hand interaction scenarios.

Abstract: Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN’s inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba’s selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

[148] Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening

Yongchuan Cui, Peng Liu, Yi Zeng

Main category: cs.CV

TL;DR: A novel pretraining strategy using large-scale simulated datasets to improve generalization of remote sensing image fusion models across different satellite sensors.

Details

Motivation: Existing deep learning methods for remote sensing image fusion suffer from poor generalization due to limited real training data and domain gaps between different satellite sensors.

Method: Construct diverse simulated datasets using degradation operations and augmentations on natural and remote sensing images, then pretrain fusion models on these simulated data to learn robust spatial-spectral representations.

Result: Pretrained models significantly improve generalization across six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) in zero-shot and one-shot scenarios, working well with CNN, Transformer, and Mamba architectures.

Conclusion: The approach provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion, and demonstrates the potential of foundation models with advanced training strategies.

Abstract: Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.

[149] Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

Main category: cs.CV

TL;DR: RAMF framework improves hate speech detection in videos using multimodal fusion and adversarial reasoning, outperforming SOTA by 3% F1 and 7% hate recall.

Details

Motivation: Hate speech in online videos is a serious threat, and existing methods struggle with multimodal semantic fusion and understanding nuanced hateful content.

Method: Proposes Reasoning-Aware Multimodal Fusion (RAMF) with: 1) Local-Global Context Fusion (LGCF) for capturing local cues and global temporal structures, 2) Semantic Cross Attention (SCA) for fine-grained multimodal interaction, and 3) adversarial reasoning using vision-language model to generate objective descriptions, hate-assumed inferences, and non-hate-assumed inferences.

Result: Achieves robust generalization on two real-world hateful video datasets, improving state-of-the-art by 3% in Macro-F1 and 7% in hate class recall.

Conclusion: RAMF effectively addresses multimodal fusion challenges and nuanced hate understanding through structured reasoning, demonstrating superior performance for hate speech detection in videos.

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model’s contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

[150] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking

Dong Li, Jiahao Xiong, Yingda Huang, Le Chang

Main category: cs.CV

TL;DR: PoreTrack3D is the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking, containing over 440,000 facial trajectories with 68 manually reviewed full-length trajectories.

Details

Motivation: To advance the study of fine-grained facial expressions through analysis of subtle skin-surface motion by capturing both traditional facial landmarks and pore-scale keypoints trajectory, which hasn't been addressed in existing benchmarks.

Method: Created a comprehensive benchmark dataset with over 440,000 facial trajectories, including more than 52,000 trajectories longer than 10 frames and 68 manually reviewed trajectories spanning 150 frames. Developed a pipeline for high-fidelity facial motion capture and dynamic 3D reconstruction.

Result: Established the first performance baseline in this domain by systematically evaluating state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D. The dataset is publicly available and provides a new framework for facial motion capture research.

Conclusion: PoreTrack3D represents a significant advancement in fine-grained facial expression analysis, enabling research on subtle skin-surface motion through pore-scale trajectory tracking and establishing a foundation for future work in dynamic 3D Gaussian splatting for facial motion capture.

Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset’s creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D

[151] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Agathoklis Georgiou

Main category: cs.CV

TL;DR: Hybrid architecture combining ColPali’s patch-level visual similarity with OCR spatial coordinates to enable precise region retrieval for RAG, without additional training.

Details

Motivation: Current VLMs like ColPali retrieve entire pages rather than specific regions, limiting utility for RAG where precise context is needed. OCR systems have spatial coordinates but lack semantic relevance assessment.

Method: Unifies visual and OCR paradigms by using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. Formalizes coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduces intersection metrics for relevance propagation.

Result: Establishes theoretical bounds on retrieval precision. Approach operates at inference time without additional training. Released Snappy as open-source implementation with empirical evaluation ongoing.

Conclusion: Proposed hybrid architecture bridges the gap between semantic visual retrieval and precise spatial localization, enabling more effective region-level retrieval for RAG applications.

Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.

[152] PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes

Derui Shan, Qian Qiao, Hao Lu, Tao Du, Peng Lu

Main category: cs.CV

TL;DR: PolarGuide-GSDR is a novel framework that integrates polarization information with 3D Gaussian Splatting to achieve high-fidelity reconstruction of reflective scenes with real-time rendering, without needing environment maps or restrictive material assumptions.

Details

Motivation: Existing methods have limitations: Polarization-aware NeRF is slow and has restrictive assumptions, while 3D Gaussian Splatting struggles with reflection-geometry entanglement and requires environment maps for deferred reflection modules.

Method: A polarization-forward-guided paradigm establishing bidirectional coupling between polarization and 3DGS: 1) 3DGS geometric priors resolve polarization ambiguity, 2) refined polarization cues guide 3DGS’s normal and spherical harmonic representation for reflection separation.

Result: State-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis on public and self-collected datasets, while maintaining real-time rendering capabilities.

Conclusion: First framework embedding polarization priors directly into 3DGS optimization, achieving superior interpretability and real-time performance for modeling complex reflective scenes without environment maps or restrictive material assumptions.

Abstract: Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.

[153] Defense That Attacks: How Robust Models Become Better Attackers

Mohamed Awad, Mahmoud Akrm, Walid Gomaa

Main category: cs.CV

TL;DR: Adversarial training paradoxically increases transferability of adversarial examples, creating new ecosystem risks despite improving model robustness.

Details

Motivation: While adversarial training improves model robustness against attacks, its effect on attack transferability remains underexplored. The paper investigates whether adversarial training unintentionally makes adversarial examples more transferable between models.

Method: Trained a diverse zoo of 36 models (CNNs and ViTs), conducted comprehensive transferability experiments to analyze how adversarial training affects the transferability of adversarial examples between models.

Result: Revealed a clear paradox: adversarially trained models produce perturbations that transfer more effectively than those from standard models, introducing new ecosystem risks in machine learning security.

Conclusion: Robustness evaluations should assess both a model’s resistance to transferred attacks and its propensity to produce transferable adversarial examples. The authors release all models, code, and experimental scripts for reproducibility.

Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.

[154] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

Qionglin Ren, Dawei Zhang, Chunxu Tian, Dan Zhang

Main category: cs.CV

TL;DR: UAUTrack is a unified single-target tracking framework for Anti-UAV that integrates multiple modalities using text prior prompts, achieving state-of-the-art performance with good accuracy-speed tradeoff.

Details

Motivation: Current Anti-UAV tracking lacks unified cross-modal collaboration frameworks, with existing approaches focusing on independent models and struggling with effective multimodal data fusion. The field is still in early stages with limited solutions.

Method: Proposes UAUTrack: a single-stream, single-stage, end-to-end architecture with text prior prompt strategy that directs model attention to UAVs across scenarios, enabling effective multimodal integration.

Result: Achieves state-of-the-art performance on Anti-UAV and DUT Anti-UAV datasets, maintains favorable accuracy-speed trade-off on Anti-UAV410 dataset, demonstrating high accuracy and practical efficiency.

Conclusion: UAUTrack successfully addresses cross-modal collaboration challenges in Anti-UAV tracking through unified framework design and text prompting, advancing the field with both accuracy and efficiency improvements.

Abstract: Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

[155] PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution

Zhongbao Yang, Jiangxin Dong, Yazhou Yao, Jinhui Tang, Jinshan Pan

Main category: cs.CV

TL;DR: PGP-DiffSR is a lightweight diffusion method for image super-resolution that reduces computational costs through progressive pruning and phase-guided adaptation.

Details

Motivation: Existing diffusion-based super-resolution models rely on large backbones like SDXL and DiT, which cause excessive computational and memory costs during training and inference.

Method: 1) Progressive pruning to remove redundant blocks within diffusion backbone while preserving restoration capability. 2) Phase-exchange adapter module that uses phase information from inputs to guide the pruned diffusion model for better restoration.

Result: Achieves competitive restoration quality while significantly reducing computational load and memory consumption compared to large diffusion models.

Conclusion: PGP-DiffSR provides an efficient lightweight diffusion approach for image super-resolution that balances performance with computational efficiency through redundancy removal and phase-guided adaptation.

Abstract: Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at https://github.com/yzb1997/PGP-DiffSR.

[156] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen, Tieliang Gong, Yong-Jin Liu

Main category: cs.CV

TL;DR: FASA is a two-stage foreground-aware slot attention framework that explicitly separates foreground from background using dual-slot competition and pseudo-mask guidance to improve object discovery in real-world scenes.

Details

Motivation: Existing slot attention methods process foreground and background indiscriminately, causing background interference and suboptimal instance discovery on real-world data. There's a need for explicit foreground-background separation to improve object discovery performance.

Method: Two-stage framework: 1) Coarse scene decomposition with dual-slot competition to separate foreground from background using clustering-based initialization. 2) Masked slot attention where first slot captures background and remaining slots compete for foreground objects, guided by pseudo-masks from patch affinity graphs using self-supervised features.

Result: Extensive experiments on synthetic and real-world datasets show FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition.

Conclusion: Explicit foreground modeling with pseudo-mask guidance enables precise object discovery and object-coherent representation learning, addressing limitations of existing slot attention approaches that indiscriminately process foreground and background regions.

Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

Abdullah Al Asif, Farhana Chowdhury Tisha

Main category: cs.CV

TL;DR: A COVID-19 monitoring system that detects face mask usage and physical distancing compliance using attention-inceptionv3 model with 98%+ accuracy and 25 FPS performance.

Details

Motivation: To help minimize COVID-19 spread by automatically monitoring compliance with WHO protocols (face masks and 6-foot physical distancing) in public spaces.

Method: Developed a system using customized attention-inceptionv3 model trained on 10,800 images from two datasets (with/without masks) to detect mask usage and physical distancing.

Result: Achieved 98% training accuracy, 99.5% validation accuracy, 98.2% precision, and 25 FPS frame rate for real-time monitoring.

Conclusion: The system can effectively identify high-risk areas for virus spread, enabling authorities to take timely preventive measures and alert local populations.

Abstract: One of the deadliest pandemics is now happening in the current world due to COVID-19. This contagious virus is spreading like wildfire around the whole world. To minimize the spreading of this virus, World Health Organization (WHO) has made protocols mandatory for wearing face masks and maintaining 6 feet physical distance. In this paper, we have developed a system that can detect the proper maintenance of that distance and people are properly using masks or not. We have used the customized attention-inceptionv3 model in this system for the identification of those two components. We have used two different datasets along with 10,800 images including both with and without Face Mask images. The training accuracy has been achieved 98% and validation accuracy 99.5%. The system can conduct a precision value of around 98.2% and the frame rate per second (FPS) was 25.0. So, with this system, we can identify high-risk areas with the highest possibility of the virus spreading zone. This may help authorities to take necessary steps to locate those risky areas and alert the local people to ensure proper precautions in no time.

[158] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Yuxing Liu, Yong Liu

Main category: cs.CV

TL;DR: ClimaDrive is a semantics-guided image-to-image framework that synthesizes realistic out-of-distribution driving data with weather diversity and physical plausibility, addressing limitations in existing anomaly data generation methods.

Details

Motivation: Anomaly segmentation is crucial for safe autonomous driving but suffers from limited and non-diverse anomaly data. Existing synthetic data generation methods lack contextual coherence and physical realism, creating domain gaps between synthetic and real data.

Method: ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting to create semantically coherent and physically plausible OoD driving data. The framework constructs ClimaOoD, a large-scale benchmark spanning six driving scenarios under clear and adverse weather conditions.

Result: Training with ClimaOoD leads to robust improvements in anomaly segmentation across four state-of-the-art methods. AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF.

Conclusion: ClimaOoD enhances model robustness and offers valuable training data for better generalization in open-world anomaly detection, demonstrating the effectiveness of semantically coherent and weather-diverse synthetic data generation.

Abstract: Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.

[159] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection

Omid Reza Heidari, Yang Wang, Xinxin Zuo

Main category: cs.CV

TL;DR: ALDI++ domain adaptation framework with ViTDet backbone achieves SOTA performance for object detection in security X-ray imagery by addressing domain shift through self-distillation and feature alignment.

Details

Motivation: Domain adaptation is critical for object detection in real-world applications where distribution shifts degrade model performance, especially in security X-ray imaging where variations in scanning devices and environmental conditions cause significant domain discrepancies.

Method: ALDI++ framework integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift. Uses Vision Transformer for Detection (ViTDet) backbone for cross-domain object detection.

Result: ALDI++ surpasses state-of-the-art domain adaptation methods across multiple adaptation scenarios on EDS dataset. Achieves highest mean average precision (mAP) with ViTDet backbone. Category-wise analysis shows consistent improvements across diverse object classes.

Conclusion: ALDI++ establishes an efficient solution for domain-adaptive object detection, setting new benchmark for performance stability and cross-domain generalization in security X-ray imagery, confirming effectiveness of transformer-based architectures.

Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.

[160] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

Main category: cs.CV

TL;DR: GeoBridge is a foundation model for cross-view geo-localization that uses semantic anchors to bridge multi-view features through textual descriptions, enabling bidirectional matching across drone, satellite, and street views with language-to-image retrieval support.

Details

Motivation: Traditional satellite-centric geo-localization has limitations: it's fragile when high-resolution/up-to-date satellite imagery is unavailable, and fails to exploit complementary cues across different views (drone, satellite, street) and modalities (language, image).

Method: Proposes GeoBridge with a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions. Also introduces GeoLoc dataset - first large-scale cross-modal multi-view aligned dataset with 50,000+ pairs of drone, street-view panorama, satellite images and textual descriptions from 36 countries.

Result: GeoLoc pre-training significantly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The model supports bidirectional matching across views and language-to-image retrieval.

Conclusion: GeoBridge addresses limitations of traditional satellite-centric geo-localization by leveraging multi-view and multi-modal data through semantic anchors, with the GeoLoc dataset providing crucial training resources for robust, flexible localization.

Abstract: Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.

[161] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis

Yancheng Zhang, Guangyu Sun, Chen Chen

Main category: cs.CV

TL;DR: EGGS proposes a hybrid Gaussian representation combining 2D and 3D Gaussians to balance appearance fidelity and geometric accuracy in novel view synthesis.

Details

Motivation: 3D Gaussian Splatting provides high appearance fidelity but suffers from multi-view inconsistencies, while 2D Gaussian Splatting ensures multi-view consistency but loses texture details. There's a need to balance both appearance and geometry in novel view synthesis.

Method: EGGS integrates 2D and 3D Gaussians with three key components: Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between representations, and Frequency-Decoupled Optimization to exploit strengths of each Gaussian type.

Result: Extensive experiments show EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, with CUDA-accelerated implementation ensuring practical training and inference.

Conclusion: EGGS provides a practical solution for high-quality novel view synthesis by effectively balancing appearance fidelity and geometric accuracy through hybrid Gaussian representation.

Abstract: Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.

[162] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen

Main category: cs.CV

TL;DR: VLM-Pruner is a training-free token pruning method for vision-language models that balances redundancy reduction with spatial coverage to maintain object details while achieving high compression rates and inference speedup.

Details

Motivation: Existing token pruning methods for VLMs either overlook inter-token redundancy (wasting capacity) or ignore spatial relationships (leading to sparse selections that fail to cover target objects), creating computational bottlenecks for mobile deployment.

Method: Proposes VLM-Pruner with: 1) centrifugal token pruning paradigm for near-to-far selection preserving fine-grained details, 2) Buffering for Spatial Sparsity (BSS) criterion to defer selection of spatially distant tokens, 3) parallel greedy strategy for efficient token selection, and 4) selective fusion of salient information from discarded tokens into retained ones.

Result: Consistently outperforms strong baselines across five VLMs with 88.9% pruning rate while delivering end-to-end inference speedup.

Conclusion: VLM-Pruner effectively addresses limitations of existing pruning methods by balancing redundancy reduction with spatial coverage, enabling efficient deployment of VLMs on resource-constrained devices without compromising object understanding.

Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9% pruning rate, while delivering an end-to-end inference speedup.

[163] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu, Yujie Zhao, Haoqiang Kang, Daniel Zhao, Tajana Rosing, Hao Zhang

Main category: cs.CV

TL;DR: VideoScience-Bench is a new benchmark evaluating video models’ scientific reasoning capabilities across 200 physics/chemistry prompts requiring understanding of multiple concepts, with expert evaluations across 5 dimensions showing strong correlation between VLM-as-a-Judge and human assessments.

Details

Motivation: Current video generation models need zero-shot reasoning capabilities and understanding of real-world scientific laws for accurate physical outcome modeling. Existing benchmarks focus on physical commonsense but lack evaluation of scientific reasoning ability at undergraduate level.

Method: Created VideoScience-Bench with 200 carefully curated prompts spanning 14 topics and 103 concepts in physics/chemistry. Conducted expert-annotated evaluations across 7 state-of-the-art video models in T2V and I2V settings along 5 dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Used VLM-as-a-Judge approach to assess video generations.

Result: Strong correlation observed between VLM-as-a-Judge assessments and human expert evaluations. The benchmark successfully evaluates video models as both generators and reasoners, requiring scientific understanding consistent with expected physical/chemical phenomena.

Conclusion: VideoScience-Bench is the first benchmark to evaluate video models as scientific reasoners, not just generators. It provides comprehensive evaluation of undergraduate-level scientific understanding across multiple concepts and domains, with automated assessment showing strong correlation with human judgment.

Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models’ scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.

[164] Tissue-mask supported inter-subject whole-body image registration in the UK Biobank – A method benchmarking study

Yasemin Utkueri, Elin Lundström, Håkan Ahlström, Johan Öfverstedt, Joel Kullberg

Main category: cs.CV

TL;DR: Sex-stratified whole-body MR image registration using adipose tissue and muscle masks improves registration accuracy for UK Biobank data compared to intensity-only and existing methods.

Details

Motivation: To enable body-wide spatial standardization and correlation analysis of non-imaging data with image-derived parameters from UK Biobank's whole-body MR images, requiring robust inter-subject registration.

Method: Proposed sex-stratified inter-subject registration that uses subcutaneous adipose tissue and muscle masks from VIBESegmentator to augment intensity-based graph-cut registration.

Result: Mean Dice scores of 0.77/0.75 for males/females, 6pp higher than intensity-only method, 9pp/8pp better than uniGradICON, and 12pp/13pp better than MIRTK. Age-correlation maps were less noisy and better aligned.

Conclusion: Using two tissue masks improves whole-body registration of UK Biobank images, enabling more accurate spatial standardization for medical research applications.

Abstract: The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.

[165] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang

Main category: cs.CV

TL;DR: GeoViS is a geospatially rewarded visual search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process, outperforming existing methods on multiple benchmarks.

Details

Motivation: Transferring multimodal LLM visual grounding capabilities to remote sensing imagery is challenging due to extremely small targets in kilometer-scale scenes and complex geospatial relations involving relative positions, spatial hierarchies, and contextual dependencies across distant objects.

Method: GeoViS reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Instead of single-step prediction, it actively explores the global image through tree-structured visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to iteratively refine geospatial hypotheses.

Result: Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics.

Conclusion: GeoViS enables detection of subtle small-scale targets while maintaining holistic scene awareness, highlighting strong cross-domain generalization and interpretability for remote sensing visual grounding tasks.

Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

[166] In-Context Sync-LoRA for Portrait Video Editing

Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or

Main category: cs.CV

TL;DR: Sync-LoRA: A method for editing portrait videos that maintains frame-accurate synchronization while allowing diverse visual modifications by training an in-context LoRA on synchronized video pairs.

Details

Motivation: Portrait video editing requires flexible control over appearance, expressions, and object additions while preserving the subject's original temporal behavior and maintaining precise frame synchronization, which is challenging with existing methods.

Method: Uses an image-to-video diffusion model where edits start from the first frame and propagate through the sequence. Trains an in-context LoRA on automatically generated paired videos showing identical motion trajectories but different appearances, filtered for temporal alignment. This teaches the model to combine source motion cues with visual changes from edited first frames.

Result: Achieves high visual fidelity and strong temporal coherence, generalizing to unseen identities and diverse edits (appearance changes, object additions, background modifications) while robustly handling pose and expression variations.

Conclusion: Sync-LoRA successfully balances edit fidelity with precise motion preservation in portrait video editing, demonstrating robust performance across diverse modifications while maintaining frame-accurate synchronization.

Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject’s original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

[167] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone

Tristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle, Corentin Abgrall, Gabriele Facciolo

Main category: cs.CV

TL;DR: CAEVL enables UAV localization without paired UAV-satellite training data by using satellite-view images with domain shift augmentation, achieving competitive performance on real-world UAV data.

Details

Motivation: Current UAV localization methods require expensive paired UAV-satellite datasets that are difficult to acquire, limiting practical deployment in GNSS-denied environments.

Method: Proposes CAEVL model with a training paradigm that uses only satellite-view reference images, employing augmentation strategies to simulate domain shift between satellite and real UAV views.

Result: Achieves competitive performance compared to methods trained with paired data, validated on ViLD dataset of real-world UAV images.

Conclusion: Demonstrates effective UAV localization without requiring paired training data, showing strong generalization capabilities and practical applicability.

Abstract: Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

[168] AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery

Rakib Ahsan, MD Sadik Hossain Shanto, Md Sultanul Arifin, Tanzima Hashem

Main category: cs.CV

TL;DR: AttMetNet: Attention-enhanced deep learning framework for methane plume detection using Sentinel-2 satellite imagery, combining NDMI with attention U-Net to reduce false positives and improve detection accuracy.

Details

Motivation: Methane is a powerful greenhouse gas contributing to global warming. Accurate detection of methane emissions is crucial for timely action and climate change mitigation. Current methods (traditional band differences/ratios and CNN-based approaches) suffer from high false positive rates and lack mechanisms to prioritize methane-specific features.

Method: AttMetNet introduces a methane-aware architecture that fuses Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. This integration selectively amplifies methane absorption features while suppressing background noise. The model uses focal loss to address severe class imbalance from limited positive plume samples and sparse plume pixels. Trained on real methane plume datasets for practical robustness.

Result: Extensive experiments show AttMetNet surpasses recent methods in methane plume detection with lower false positive rate, better precision-recall balance, and higher IoU (Intersection over Union).

Conclusion: AttMetNet establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery, offering improved accuracy and reduced false positives compared to existing methods.

Abstract: Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2’s B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI’s plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.

[169] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

Main category: cs.CV

TL;DR: ViSAudio is an end-to-end framework for generating binaural spatial audio directly from silent video, outperforming existing two-stage methods by using conditional flow matching with dual-branch architecture.

Details

Motivation: Current video-to-audio generation focuses on mono output, lacking spatial immersion. Existing binaural approaches use a two-stage pipeline (mono generation then spatialization) that causes error accumulation and spatio-temporal inconsistencies.

Method: Proposes ViSAudio: an end-to-end framework using conditional flow matching with dual-branch audio generation architecture. Two dedicated branches model audio latent flows, integrated with a conditional spacetime module to balance channel consistency while preserving spatial characteristics.

Result: Comprehensive experiments show ViSAudio outperforms state-of-the-art methods in both objective metrics and subjective evaluations. Generates high-quality binaural audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.

Conclusion: The paper introduces end-to-end binaural spatial audio generation from video, presents the BiAudio dataset, and demonstrates ViSAudio’s superior performance in creating spatially immersive audio that aligns precisely with visual content.

Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

[170] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset

Qifan Liang, Junlin Li, Zhen Han, Xihao Wang, Zhongyuan Wang, Bin Mei

Main category: cs.CV

TL;DR: STANet is the first smoke-type-aware laparoscopic video desmoking network that distinguishes between Diffusion Smoke and Ambient Smoke, using a coarse-to-fine disentanglement module and attention mechanisms to improve surgical visibility.

Details

Motivation: Surgical smoke from electrocautery/lasers obstructs visual guidance in laparoscopic videos. Existing desmoking methods ignore smoke-type-specific distinctions based on motion patterns, which have different spatio-temporal characteristics.

Method: Proposes STANet with two sub-networks: 1) smoke mask segmentation with joint smoke mask/type prediction using attention-weighted mask aggregation, and 2) smokeless video reconstruction guided by two smoke mask types. Includes coarse-to-fine disentanglement module with smoke-type-aware cross attention to separate entangled smoke regions.

Result: Outperforms state-of-the-art approaches in quality evaluations and shows superior generalization across multiple downstream surgical tasks. Also created the first large-scale synthetic video desmoking dataset with smoke type annotations.

Conclusion: STANet effectively addresses smoke-type-specific desmoking challenges in laparoscopic surgery, improving surgical visualization and demonstrating strong performance and generalization capabilities.

Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.

[171] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

Main category: cs.CV

TL;DR: Video4Spatial shows video diffusion models can perform complex spatial tasks (navigation and object grounding) using only visual data, without depth or pose information, demonstrating visuospatial intelligence.

Details

Motivation: To investigate whether video generative models can exhibit visuospatial intelligence - a key human cognitive capability - using only visual data, without relying on auxiliary modalities like depth or poses.

Method: Video4Spatial framework uses video diffusion models conditioned solely on video-based scene context. It employs simple yet effective design choices in framework architecture and data curation to enable spatial understanding from video-only inputs.

Result: The model successfully performs scene navigation (following camera-pose instructions while maintaining 3D geometry consistency) and object grounding (semantic localization, instruction following, and planning). It generalizes to long contexts and out-of-domain environments.

Conclusion: Video generative models can achieve strong spatial understanding from video context alone, advancing them toward general visuospatial reasoning capabilities.

Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

[172] LumiX: Structured and Coherent Text-to-Intrinsic Generation

Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka

Main category: cs.CV

TL;DR: LumiX is a structured diffusion framework that jointly generates multiple intrinsic scene properties (albedo, irradiance, normal, depth, color) from text prompts with physical consistency.

Details

Motivation: Current text-to-image generation lacks structured, physically consistent scene representations. There's a need for coherent generation of multiple intrinsic properties that describe underlying scene geometry and lighting.

Method: Two key innovations: 1) Query-Broadcast Attention ensures structural consistency by sharing queries across all intrinsic maps in self-attention blocks. 2) Tensor LoRA parameter-efficiently models cross-map relations for efficient joint training.

Result: LumiX achieves 23% higher alignment and better preference score (0.19 vs. -0.41) compared to state-of-the-art methods. It also supports image-conditioned intrinsic decomposition within the same framework.

Conclusion: LumiX enables stable joint diffusion training and unified generation of multiple intrinsic properties, producing coherent and physically meaningful scene representations from text prompts.

Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

[173] PPTArena: A Benchmark for Agentic PowerPoint Editing

Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang

Main category: cs.CV

TL;DR: PPTArena is a benchmark for PowerPoint editing that tests reliable slide modifications via natural language instructions, and PPTPilot is a structure-aware agent that outperforms existing systems but still struggles with long-horizon tasks.

Details

Motivation: Current PowerPoint editing benchmarks are limited - they focus on image-PDF renderings or text-to-slide generation rather than reliable in-place editing of real slides. There's a need for a comprehensive benchmark that measures how well systems can modify existing PowerPoint decks according to natural language instructions while maintaining visual quality and structural integrity.

Method: 1) Created PPTArena benchmark with 100 decks, 2125 slides, and 800+ targeted edits covering text, charts, tables, animations, and master-level styles. 2) Developed dual VLM-as-judge pipeline that separately scores instruction following and visual quality using structural diffs and slide images. 3) Built PPTPilot agent that uses structure-aware planning, routes between programmatic tools and XML operations, and employs iterative plan-edit-check loops with task-specific constraints.

Result: PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. However, existing agents still underperform on long-horizon, document-scale tasks.

Conclusion: PPTArena provides a comprehensive benchmark for PowerPoint editing, and PPTPilot demonstrates significant improvements over existing systems through structure-aware planning and verification. However, reliable PowerPoint editing remains challenging, especially for long-horizon, document-scale tasks, highlighting the need for further research in this area.

Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

Tang Haonan, Chen Yanjun, Jiang Lezhi

Main category: cs.CV

TL;DR: TrackNetV5 introduces Motion Direction Decoupling and Residual-Driven Spatio-Temporal Refinement to overcome occlusion and directional ambiguity issues in fast-moving small object tracking, achieving new SOTA performance with minimal computational overhead.

Details

Motivation: Previous TrackNet versions (V1-V3) struggle with occlusions due to reliance on purely visual cues, while TrackNetV4 suffers from directional ambiguity because its absolute difference method discards motion polarity information.

Method: Two novel mechanisms: 1) Motion Direction Decoupling (MDD) module decomposes temporal dynamics into signed polarity fields to explicitly encode movement occurrence and trajectory direction; 2) Residual-Driven Spatio-Temporal Refinement (R-STR) head uses Transformer-based coarse-to-fine paradigm with factorized spatio-temporal contexts to estimate corrective residuals for recovering occluded targets.

Result: Achieves new state-of-the-art F1-score of 0.9859 and accuracy of 0.9733 on TrackNetV2 dataset, significantly outperforming previous versions with only 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities.

Conclusion: TrackNetV5 successfully addresses the limitations of previous versions by integrating directional priors and spatio-temporal refinement, establishing a robust architecture for fast-moving small object tracking with superior precision and efficiency.

Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.

[175] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang

Main category: cs.CV

TL;DR: The paper introduces UnicEdit-10M, a 10M-scale multimodal image editing dataset, and UnicBench, a comprehensive benchmark for evaluating model weaknesses across diverse editing behaviors, along with a lightweight data pipeline and verification model.

Details

Motivation: The performance gap between closed-source and open-source multimodal models is widening due to scarcity of large-scale, high-quality training data and comprehensive benchmarks that can diagnose model weaknesses across diverse editing behaviors.

Method: Introduces a lightweight data pipeline replacing multi-toolchains with an end-to-end model and unified post-verification stage. Trains a 7B dual-task expert model (Qwen-Verify) for efficient failure detection and instruction recaptioning to create UnicEdit-10M dataset. Also proposes UnicBench benchmark with novel metrics including Non-edit Consistency and Reasoning Accuracy.

Result: Creates UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. Develops UnicBench benchmark that extends beyond basic edits to assess spatial and knowledge-driven reasoning. Analysis of mainstream models on UnicBench reveals their limitations and provides directions for future research.

Conclusion: The proposed data pipeline, dataset, and benchmark address the scale-quality trade-off in multimodal model training and evaluation, enabling better diagnosis of model weaknesses and providing clear directions for future research in image editing capabilities.

Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

[176] IC-World: In-Context Generation for Shared World Modeling

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, Guosheng Lin

Main category: cs.CV

TL;DR: IC-World: A novel framework for shared world modeling that generates multiple videos from input images using in-context generation with large video models, enhanced by reinforcement learning for geometry and motion consistency.

Details

Motivation: To address the shared world modeling problem where a model needs to generate multiple videos from input images representing the same underlying world from different camera poses, ensuring consistency across generated videos.

Method: Proposes IC-World framework that leverages inherent in-context generation capability of large video models for parallel generation, then fine-tunes via reinforcement learning (Group Relative Policy Optimization) with novel reward models for scene-level geometry consistency and object-level motion consistency.

Result: IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency, demonstrating superior performance in shared world modeling tasks.

Conclusion: This is the first systematic exploration of shared world modeling with video-based world models, presenting an effective framework for generating consistent multi-view video sequences from input images.

Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.

[177] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, Guosheng Lin

Main category: cs.CV

TL;DR: PhyCustom is a fine-tuning framework with two regularization losses that enables diffusion models to accurately customize physical concepts, addressing the limitation of current methods that fail to reflect physical properties in generated images.

Details

Motivation: Current diffusion-based text-to-image customization methods excel at concrete concepts like styles and shapes but struggle with physical concepts. The core limitation is the absence of explicit physical knowledge during training, causing models to fail at accurately reflecting physical properties even when physics-related words appear in prompts.

Method: PhyCustom introduces a fine-tuning framework with two novel regularization losses: 1) isometric loss to activate diffusion models to learn physical concepts, and 2) decouple loss to eliminate mixture learning of independent concepts.

Result: Experiments on a diverse dataset show PhyCustom outperforms previous state-of-the-art and popular methods in physical customization both quantitatively and qualitatively.

Conclusion: The proposed PhyCustom framework successfully addresses the challenge of physical concept customization in diffusion models through specialized regularization losses, enabling more accurate reflection of physical properties in generated images.

Abstract: Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.

[178] CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Kai Cao, Le Lu, Jiebo Luo, Ling Zhang

Main category: cs.CV

TL;DR: CT-GLIP introduces grounded language-image pretraining for 3D CT scans, addressing the loss of critical sparse semantics in existing global alignment methods by creating fine-grained organ-level CT-report pairs for improved cross-modal contrastive learning.

Details

Motivation: Existing 3D medical VL pretraining methods rely on global alignment adapted from 2D scenarios, which transforms entire 3D images into single global embeddings, losing sparse but critical semantics needed for accurate diagnosis alignment.

Method: Proposes CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance grounded cross-modal contrastive learning, aligning grounded visual features with precise textual descriptions using 44,011 organ-level CT-report pairs from 17,702 patients covering 104 organs.

Result: Outperforms counterparts with global VL alignment, achieving average improvements of 15.1% F1 score for zero-shot AD, 1.9% AUC for TD, and 3.2% DSC for TS compared to vanilla CLIP. Enables zero-shot organ and abnormality identification using natural language.

Conclusion: Highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans, demonstrating superior performance across diverse downstream tasks.

Abstract: 3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.

[179] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

Main category: cs.CV

TL;DR: AAG method enables action anticipation from single frames using multimodal cues (RGB, depth, prior context) instead of temporal video aggregation, achieving competitive performance on instructional datasets.

Details

Motivation: Humans can anticipate actions from single moments with sufficient context, but conventional methods rely on temporal video aggregation. The paper investigates whether models can achieve similar competence using alternative modalities instead of temporal information.

Method: AAG (Action Anticipation at a Glimpse) combines RGB features with depth cues from a single frame for spatial reasoning, and incorporates prior action context obtained either from textual summaries generated by Vision-Language Models or from predictions of a single-frame action recognizer.

Result: AAG performs competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

Conclusion: Multimodal single-frame action anticipation can effectively replace temporal video aggregation for action prediction tasks, demonstrating that models can achieve human-like competence in anticipating actions from glimpses when provided with sufficient contextual information.

Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

[180] Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study

Vishal Dubey, Pallavi Tyagi

Main category: cs.CV

TL;DR: First systematic study of identity-preserving AIGC detection for Indian/South-Asian faces, showing fine-tuned detectors overfit to training-generator cues and fail to generalize to held-out IP-AIGC.

Details

Motivation: Modern image editors can create identity-preserving AIGC (IP-AIGC) where the same person appears with new attributes, but detector robustness and fairness for under-represented populations like Indian/South-Asian faces remains unclear.

Method: Assembled Indian-focused training splits from FairFD and HAV-DF datasets, created two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial generators (Gemini, ChatGPT) with identity-preserving prompts. Evaluated two SOTA detectors (AIDE, Effort) under pretrained and fine-tuned regimes using AUC, AP, EER, and accuracy metrics.

Result: Fine-tuning yields strong in-domain gains but consistently degrades performance on held-out IP-AIGC for Indian cohorts (e.g., AIDE AUC drops from 0.923 to 0.563), indicating overfitting to training-generator cues. On non-IP images, pretrained performance remains high, suggesting specific brittleness to identity-preserving edits rather than generic distribution shift.

Conclusion: IP-AIGC-Indian is a challenging and practically relevant scenario that motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.

Abstract: Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.

[181] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association

Abdul Hannan, Furqan Malik, Hina Jabbar, Syed Suleman Sadiq, Mubashir Noman

Main category: cs.CV

TL;DR: The paper presents a face-voice association method for multilingual environments, achieving 3rd place in FAME 2026 challenge with 33.1% EER on English-German data.

Details

Motivation: To address the challenge of face-voice association in multilingual scenarios, specifically focusing on English-German pairs in the FAME 2026 competition.

Method: Revisits fusion and orthogonal projection techniques to effectively focus on relevant semantic information across face and voice modalities.

Result: Achieved 33.1% Equal Error Rate (EER) on English-German data split, ranking 3rd in the FAME 2026 challenge.

Conclusion: The proposed fusion and orthogonal projection approach performs well for multilingual face-voice association, demonstrating effectiveness in cross-modal semantic alignment.

Abstract: Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.

[182] Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

Main category: cs.CV

TL;DR: VideoMarathon introduces a large-scale hour-long video instruction dataset and Hour-LLaVA model for long video understanding, achieving state-of-the-art performance on long video benchmarks.

Details

Motivation: Existing video-language models struggle with hour-long videos due to scarcity of well-annotated long video datasets, limiting progress in long-form video understanding.

Method: Created VideoMarathon dataset with 9,700 hours of diverse long videos (3-60 minutes each) and 3.3M QA pairs across six topics. Developed Hour-LLaVA with memory augmentation module for efficient hour-long video processing at 1-FPS sampling.

Result: Hour-LLaVA achieves best performance on multiple long video-language benchmarks, demonstrating dataset quality and model superiority for hour-scale video understanding.

Conclusion: VideoMarathon dataset enables effective training of hour-long Video-LMMs, and Hour-LLaVA’s memory augmentation provides efficient long video processing, advancing long-form video understanding capabilities.

Abstract: Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

[183] MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration

Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang, Jialuo Chen, Jiaxue Ni, Qian Luo, Jin Liu, Can Han, Changkai Ji, Zhi Qin Tan, Ajo Babu George, Liangyu Chen, Qianni Zhang, Dahong Qian, Shuai Wang, Huiyu Zhou

Main category: cs.CV

TL;DR: STSR 2025 Challenge benchmarked semi-supervised learning for CBCT and IOS in dentistry, with tasks for segmentation and registration using limited labeled data.

Details

Motivation: Annotated data scarcity in digital dentistry limits automated solutions for pulp canal segmentation and cross-modal registration between CBCT and IOS.

Method: Organized challenge with two tasks: semi-supervised segmentation of teeth/pulp canals in CBCT, and semi-supervised rigid registration of CBCT and IOS. Provided 60 labeled + 640 unlabeled IOS samples, and 30 labeled + 250 unlabeled CBCT scans.

Result: Top teams achieved Dice score of 0.967 and Instance Affinity of 0.738 for segmentation using nnU-Net and Mamba-like models with pseudo-labeling. For registration, PointNetLK with differentiable SVD and geometric augmentation succeeded, with hybrid neural-classical refinement enabling accurate alignment.

Conclusion: The challenge demonstrated effective SSL approaches for dental imaging with limited labels, with all data and code made publicly available for reproducibility and community advancement.

Abstract: Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.

[184] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang

Main category: cs.CV

TL;DR: Online RL post-training framework for video diffusion models improves camera control accuracy through geometry-based rewards and segment-level feedback.

Details

Motivation: Current video diffusion models rely mainly on supervised fine-tuning for camera control, leaving online reinforcement learning post-training underexplored despite its potential for better optimization.

Method: Proposes online RL post-training with verifiable geometry reward: estimates 3D camera trajectories, divides into segments, computes segment-wise relative poses, and uses alignment scores as dense reward signals. Also constructs dataset with diverse camera motions and subject dynamics.

Result: Outperforms SFT baselines in camera-control accuracy, geometric consistency, and visual quality, demonstrating superiority in camera-controlled video generation.

Conclusion: Online RL post-training with geometry-based rewards effectively improves camera control in video generation, addressing reward sparsity and optimization efficiency issues.

Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.

[185] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu

Main category: cs.CV

TL;DR: MindGPT-4ov is a multimodal LLM with a novel post-training paradigm that achieves SOTA performance at low cost through automated data generation, curriculum fine-tuning, and hybrid reinforcement learning.

Details

Motivation: To develop a cost-effective multimodal large language model that can achieve state-of-the-art performance while maintaining strong generalization abilities and enabling seamless transition from research to industrial deployment.

Method: Three key innovations: (1) Information density-based data generation with dual-dimensional tree-structured labels for automated cross-domain data production; (2) Collaborative curriculum supervised fine-tuning balancing domain knowledge injection with general capability preservation; (3) Hybrid reinforcement learning enhancing reasoning while optimizing for diversity, multimodal perception, and response conciseness. Plus infrastructure optimizations like 5D parallel training and inference quantization.

Result: Outperforms state-of-the-art models on benchmarks including MMBench, MMStar, MathVision, and MathVista. Demonstrates superior user experience in vertical domain tasks and enables seamless academic-to-industrial transition.

Conclusion: MindGPT-4ov provides a general post-training paradigm applicable to various MLLMs, achieving SOTA performance at low cost while maintaining strong generalization. The model weights, datasets, and code will be open-sourced to support community development.

Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community’s development of MLLMs.

[186] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models

Pierpaolo Serio, Giulio Pisaneschi, Andrea Dan Ryals, Vincenzo Infantino, Lorenzo Gentilini, Valentina Donzella, Lorenzo Pollini

Main category: cs.CV

TL;DR: Systematic investigation of how LiDAR-to-image projections affect place recognition when combined with vision foundation models, showing that carefully designed projections can effectively substitute end-to-end 3D learning.

Details

Motivation: To understand how different LiDAR-to-image projection methods impact metric place recognition performance when used with modern vision foundation models, and to determine if well-designed projections can replace computationally expensive end-to-end 3D learning approaches.

Method: Developed a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol to isolate the influence of 2D projections. Used consistent geometric and structural channels across multiple datasets and deployment scenarios to analyze projection characteristics.

Result: Identified projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments validated practical relevance, showing carefully designed projections can serve as effective surrogates for end-to-end 3D learning.

Conclusion: Carefully designed LiDAR-to-image projections, when coupled with state-of-the-art vision foundation models, can effectively replace end-to-end 3D learning approaches for LiDAR place recognition, offering practical advantages for real-time autonomous systems.

Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.

[187] Glance: Accelerating Diffusion Models with 1 Sample

Zhuobai Dong, Rui Zhao, Songjie Wu, Junchao Yi, Linjie Li, Zhengyuan Yang, Lijuan Wang, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: Phase-aware diffusion acceleration using two LoRA adapters (Slow-LoRA for early semantic stages, Fast-LoRA for later redundant phases) achieves 5× speedup with minimal training cost while maintaining visual quality.

Details

Motivation: Diffusion models have computational bottlenecks due to many inference steps. Previous distillation methods require heavy retraining and suffer from degraded generalization. The authors propose a smarter acceleration approach that adapts to different denoising phases rather than uniform speedup.

Method: Instead of training separate student models, the method equips the base diffusion model with two lightweight LoRA adapters: Slow-LoRA for early semantic stages (smaller speedups) and Fast-LoRA for later redundant phases (larger speedups). This phase-aware strategy requires minimal training - only 1 sample on a single V100 GPU within one hour.

Result: Achieves up to 5× acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. The LoRA experts generalize strongly to unseen prompts despite being trained with minimal data.

Conclusion: Phase-aware acceleration with specialized LoRA adapters provides an efficient solution to diffusion model deployment challenges, offering significant speedup with minimal training cost and strong generalization capabilities.

Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.

[188] DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation

Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Chenyang Si

Main category: cs.CV

TL;DR: DiverseAR improves sample diversity in bitwise autoregressive image generation by addressing binary classification limitations and sharp logits distribution through adaptive scaling and energy-based path search.

Details

Motivation: The paper addresses the underexplored challenge of sample diversity in autoregressive generative models with bitwise visual tokenizers. Current bitwise AR models suffer from limited diversity due to binary classification constraints and overly sharp logits distributions that cause sampling collapse.

Method: Proposes DiverseAR with two key components: (1) adaptive logits distribution scaling mechanism that dynamically adjusts binary output distribution sharpness during sampling for smoother predictions, and (2) energy-based generation path search algorithm that avoids sampling low-confidence tokens to preserve visual quality.

Result: Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation without sacrificing visual quality.

Conclusion: DiverseAR provides a principled and effective solution to enhance diversity in bitwise AR models by addressing fundamental limitations in binary prediction space and logits distribution, achieving better diversity while maintaining high visual fidelity.

Abstract: In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.

[189] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che, Songping Mai, Qi Tian

Main category: cs.CV

TL;DR: LoVoRA is a mask-free video editing framework for object removal and addition using object-aware localization, achieving temporally consistent edits without external masks or reference images.

Details

Motivation: Existing text-guided video editing methods require auxiliary masks or reference images, limiting scalability and generalization. The paper aims to develop a mask-free approach for more flexible video object editing.

Method: Proposes LoVoRA framework with: 1) Dataset construction pipeline using image-to-video translation, optical flow-based mask propagation, and video inpainting; 2) Learnable object-aware localization mechanism for spatio-temporal supervision; 3) Diffusion Mask Predictor for end-to-end editing without external control signals.

Result: Extensive experiments and human evaluation demonstrate LoVoRA’s effectiveness and high-quality performance in mask-free video object removal and addition with temporal consistency.

Conclusion: LoVoRA provides a novel mask-free solution for video object editing that eliminates the need for external guidance, offering better scalability and generalization while maintaining spatio-temporal consistency.

Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.

[190] Layout Anything: One Transformer for Universal Room Layout Estimation

Md Sohag Mia, Muhammad Abdullah Adnan

Main category: cs.CV

TL;DR: Layout Anything is a transformer-based indoor layout estimation framework that adapts OneFormer’s segmentation architecture with geometric constraints, achieving state-of-the-art performance with fast inference.

Details

Motivation: The paper aims to create an efficient indoor layout estimation system that eliminates complex post-processing while maintaining geometric consistency, making it suitable for AR and 3D reconstruction applications.

Method: Adapts OneFormer’s universal segmentation architecture with task-conditioned queries and contrastive learning, adding two key modules: layout degeneration strategy for data augmentation preserving Manhattan-world constraints, and differentiable geometric losses for planar consistency and sharp boundaries.

Result: Achieves state-of-the-art performance with pixel error of 5.43% and corner error of 4.02% on LSUN, 7.04% PE (5.17% CE) on Hedau, and 4.03% PE (3.15% CE) on Matterport3D-Layout, with fast 114ms inference time.

Conclusion: The framework successfully combines geometric awareness with computational efficiency, eliminating complex post-processing while achieving strong performance, making it suitable for practical applications like AR and large-scale 3D scene reconstruction.

Abstract: We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer’s universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer’s task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework’s combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.

[191] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems

Yuhan Chen, Yicui Shi, Guofa Li, Guangrui Bai, Jinyuan Shao, Xiangfei Huang, Wenbo Chu, Keqiang Li

Main category: cs.CV

TL;DR: UltraFast-LieNET is a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement in vehicular applications, achieving state-of-the-art performance with only 180 parameters.

Details

Motivation: Low-light environments like nighttime driving cause severe image degradation that challenges in-vehicle camera safety, but existing enhancement algorithms are too computationally intensive for real-time vehicular applications.

Method: Proposes UltraFast-LieNET with Dynamic Shifted Convolution (DSConv) kernels (only 12 parameters), Multi-scale Shifted Residual Blocks (MSRB) for expanded receptive field, residual structures, and a multi-level gradient-aware loss function for stability.

Result: Achieves PSNR of 26.51 dB on LOLI-Street dataset, outperforming state-of-the-art methods by 4.6 dB while using only 180 parameters. Validated across four benchmark datasets with superior real-time performance and enhancement quality.

Conclusion: UltraFast-LieNET provides an effective solution for real-time low-light image enhancement in resource-constrained vehicular applications, balancing performance and computational efficiency with minimal parameters.

Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET

Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang

Main category: cs.CV

TL;DR: BEVDilation: A LiDAR-centric fusion framework that uses image BEV features as implicit guidance rather than naive concatenation to address spatial misalignment from depth errors and point cloud sparsity.

Details

Motivation: Previous LiDAR-camera fusion methods suffer from degraded performance due to indiscriminate fusion of geometrically disparate sensors. The fundamental accuracy gap between LiDAR (precise geometry) and cameras (depth estimation errors) causes spatial misalignment in BEV fusion.

Method: Proposes BEVDilation with two key blocks: 1) Sparse Voxel Dilation Block - densifies foreground voxels using image priors to mitigate point cloud sparsity; 2) Semantic-Guided BEV Dilation Block - enhances LiDAR feature diffusion with image semantic guidance and long-range context capture. Formulates image BEV features as implicit guidance rather than naive concatenation.

Result: Achieves state-of-the-art performance on nuScenes benchmark while maintaining competitive computational efficiency. Demonstrates greater robustness to depth noise compared to naive fusion approaches.

Conclusion: BEVDilation’s LiDAR-centric strategy with image guidance effectively addresses spatial misalignment from depth errors and point cloud limitations, providing a more robust fusion approach for 3D object detection.

Abstract: Integrating LiDAR and camera information in the bird’s eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang

Main category: cs.CV

TL;DR: InEx is a training-free multi-agent framework that autonomously mitigates hallucination in multimodal LLMs through introspective reasoning and external verification, achieving 4-27% gains on benchmarks.

Details

Motivation: Hallucination remains a critical challenge in LLMs, especially for reliable multimodal LLMs. Existing solutions often require human intervention or don't fully leverage agents' autonomous hallucination mitigation capabilities. The authors were inspired by how humans make reliable decisions through introspective reasoning followed by external verification from diverse perspectives.

Method: InEx is a training-free multi-agent framework with internal introspective reasoning guided by entropy-based uncertainty estimation. The agent first generates a response, then iteratively verifies and refines it through external cross-modal multi-agent collaboration involving editing agents and self-reflection agents.

Result: Extensive experiments show InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

Conclusion: The InEx framework effectively mitigates hallucination in multimodal LLMs by combining introspective reasoning with external multi-agent verification, providing a training-free solution that significantly improves reliability and performance across various benchmarks.

Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent’s ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent’s reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

[194] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

Main category: cs.CV

TL;DR: U4D: An uncertainty-aware framework for 4D LiDAR world modeling that uses spatial uncertainty maps to guide generation in a “hard-to-easy” manner, ensuring geometric fidelity in challenging regions and temporal coherence.

Details

Motivation: Existing generative frameworks for 4D LiDAR world modeling treat all spatial regions uniformly, overlooking varying uncertainty across real-world scenes. This leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability for autonomous driving and embodied AI applications.

Method: U4D first estimates spatial uncertainty maps from a pretrained segmentation model to identify semantically challenging regions. It then performs generation in two sequential stages: (1) uncertainty-region modeling that reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion that synthesizes remaining areas under learned structural priors. The framework incorporates a mixture of spatio-temporal (MoST) block to adaptively fuse spatial and temporal representations during diffusion for temporal coherence.

Result: Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

Conclusion: U4D presents an effective uncertainty-aware framework for 4D LiDAR world modeling that addresses the limitations of uniform generation approaches by focusing on challenging regions first and ensuring temporal coherence, making it valuable for building reliable 4D worlds for autonomous driving and embodied AI.

Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a “hard-to-easy” manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

Md Sohag Mia, Md Nahid Hasan, Tawhid Ahmed, Muhammad Abdullah Adnan

Main category: cs.CV

TL;DR: GraphFusion3D improves 3D object detection by combining multi-modal fusion with graph-based reasoning to address sparse point cloud challenges.

Details

Motivation: Point clouds present challenges for 3D object detection due to sparse data, incomplete structures, limited semantic information, and difficulty capturing contextual relationships between distant objects.

Method: Proposes GraphFusion3D framework with Adaptive Cross-Modal Transformer (ACMT) for integrating image features into point representations, and Graph Reasoning Module (GRM) for proposal refinement using multi-scale graph attention to capture both local geometric structures and global semantic context. Uses cascade decoder for progressive refinement.

Result: Achieves 70.6% AP25 and 51.2% AP50 on SUN RGB-D, and 75.1% AP25 and 60.8% AP50 on ScanNetV2, demonstrating substantial performance improvement over existing approaches.

Conclusion: GraphFusion3D effectively addresses point cloud challenges through unified multi-modal fusion and graph-based reasoning, achieving state-of-the-art performance on major 3D detection benchmarks.

Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP${25}$ and 51.2% AP${50}$) and ScanNetV2 (75.1% AP${25}$ and 60.8% AP${50}$) demonstrate a substantial performance improvement over existing approaches.

[196] TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao

Main category: cs.CV

TL;DR: TEXTRIX is a native 3D attribute generation framework that directly colors 3D models in volumetric space using a Diffusion Transformer with sparse attention, avoiding multi-view fusion limitations and enabling both high-fidelity texture synthesis and precise 3D part segmentation.

Details

Motivation: Existing 3D texture generation methods rely on multi-view fusion, which suffers from inter-view inconsistencies and incomplete coverage of complex surfaces, limiting fidelity and completeness of generated content.

Method: Constructs a latent 3D attribute grid and uses a Diffusion Transformer with sparse attention to directly color 3D models in volumetric space, avoiding multi-view fusion. The same architecture can be trained to predict semantic attributes for 3D segmentation.

Result: State-of-the-art performance on both texture synthesis and 3D part segmentation tasks, producing seamless high-fidelity textures and accurate segmentation with precise boundaries.

Conclusion: TEXTRIX provides a unified native 3D framework that overcomes limitations of multi-view fusion approaches, enabling high-fidelity texture generation and precise 3D segmentation through direct volumetric processing.

Abstract: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

[197] DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

Main category: cs.CV

TL;DR: DynamicVerse: A physical-scale multimodal 4D world modeling framework that converts real-world videos into comprehensive 4D representations with metric-scale geometry, dynamic motion, instance masks, and descriptive captions.

Details

Motivation: Existing datasets for understanding dynamic physical worlds are limited by simulator constraints or traditional Structure-from-Motion methods, offering only up-to-scale annotations and limited descriptive captioning. This restricts foundation models' ability to accurately interpret real-world dynamics from monocular internet videos.

Method: Uses large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. Integrates window-based Bundle Adjustment with global optimization to convert long real-world video sequences into comprehensive 4D multimodal format.

Result: Created a large-scale dataset with 100K+ videos, 800K+ annotated masks, and 10M+ frames from internet videos. Achieved superior performance on video depth estimation, camera pose estimation, and camera intrinsics estimation benchmarks, capturing physical-scale measurements with greater global accuracy than existing methods.

Conclusion: DynamicVerse bridges the gap between limited existing datasets and the need for comprehensive 4D world modeling, enabling foundation models to better interpret real-world dynamics from monocular videos and advancing human-agent interaction capabilities.

Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

[198] DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Hongyang Li, Ya-Qin Zhang, Hao Zhao

Main category: cs.CV

TL;DR: DGGT is a pose-free dynamic scene reconstruction framework that jointly predicts 3D Gaussian maps and camera parameters from sparse, unposed images, enabling fast 4D reconstruction for autonomous driving.

Details

Motivation: Current methods for dynamic driving scene reconstruction rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical for scalable autonomous driving training and evaluation.

Method: Reformulates camera pose as model output rather than input; jointly predicts per-frame 3D Gaussian maps and camera parameters; uses lightweight dynamic head for dynamics disentanglement; lifespan head for temporal consistency; diffusion-based rendering refinement for artifact reduction.

Result: Achieves state-of-the-art performance and speed on Waymo, nuScenes, and Argoverse2 benchmarks; outperforms prior work in both dataset-specific training and zero-shot cross-dataset transfer; scales well with increasing input frames.

Conclusion: DGGT provides a single-pass, pose-free solution for fast, scalable 4D reconstruction of dynamic driving scenes, enabling practical applications for autonomous driving training and evaluation.

Abstract: Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

[199] SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

Main category: cs.CV

TL;DR: SurfFill: Gaussian surfel-based method that combines LiDAR and photogrammetry to complete point clouds by identifying ambiguous areas near missed structures and growing points using constrained Gaussian surfel optimization.

Details

Motivation: LiDAR provides accurate 3D reconstruction in flat regions but misses small geometric structures and fails with dark materials. Photogrammetry captures fine details but lacks LiDAR's accuracy in featureless regions. Combining both approaches can leverage their complementary strengths.

Method: 1. Analyze LiDAR artifacts from beam divergence at thin structures/edges. 2. Use density changes to identify ambiguous areas near missed regions. 3. Constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization/densification on ambiguous areas. 4. Extract Gaussian primitives and sample points to complete point clouds. 5. Extend with divide-and-conquer scheme for building-scale completion.

Result: Outperforms previous reconstruction methods on LiDAR point cloud completion tasks for both synthetic and real-world scenes.

Conclusion: SurfFill successfully combines LiDAR and camera-based capture strengths, using Gaussian surfels to complete point clouds by targeting ambiguous areas identified through density analysis, with scalable performance for large scenes.

Abstract: LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

[200] Aligning Diffusion Models with Noise-Conditioned Perception

Alexander Gambashidze, Anton Kulikov, Yuriy Sosnin, Ilya Makarov

Main category: cs.CV

TL;DR: Using perceptual objectives in U-Net embedding space for diffusion model preference optimization outperforms standard latent-space methods in quality and efficiency.

Details

Motivation: Diffusion models optimize in pixel/VAE space which doesn't align well with human perception, leading to slower and less efficient training during preference alignment compared to language models.

Method: Fine-tuning Stable Diffusion 1.5 and XL using DPO, CPO, and SFT in the U-Net embedding space rather than standard latent space, leveraging perceptual objectives.

Result: Significantly outperforms standard latent-space implementations across various metrics. For SDXL: 60.8% general preference, 62.2% visual appeal, and 52.1% prompt following against original SDXL-DPO on PartiPrompts, while reducing compute.

Conclusion: The approach improves efficiency and quality of human preference alignment for diffusion models and is easily integrable with other optimization techniques.

Abstract: Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8% general preference, 62.2% visual appeal, and 52.1% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: https://huggingface.co/alexgambashidze/SDXL_NCP-DPO_v0.1

[201] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta

Main category: cs.CV

TL;DR: A method to adapt frame-based neural networks for stable video inference by adding stability adapters and training with an accuracy-stability-robustness loss, improving temporal consistency and corruption robustness.

Details

Motivation: Frame-based networks applied to video often produce temporally inconsistent outputs (flickering), especially when inputs contain time-varying corruptions. There's a need for general adaptation methods that can stabilize video inference while maintaining quality.

Method: Introduces stability adapters that can be inserted into any architecture, with resource-efficient training using frozen base networks. Proposes a unified accuracy-stability-robustness loss framework for temporal stability and corruption robustness, with theoretical analysis of training conditions.

Result: Method improves temporal stability and robustness against various image corruptions (compression artifacts, noise, adverse weather) while preserving or improving prediction quality. Validated on multiple vision tasks: denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+).

Conclusion: The proposed approach provides a general, effective solution for adapting frame-based models to achieve stable and robust video inference, addressing both temporal inconsistency and corruption sensitivity through a unified theoretical framework.

Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

[202] AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry

Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Yilin Liu, Durvesh Malpure, Pete Meltzer

Main category: cs.CV

TL;DR: AutoBrep: A Transformer model that autoregressively generates high-quality, watertight B-Rep CAD models using unified tokenization of geometry and topology.

Details

Motivation: Direct end-to-end generation of B-Rep CAD models with precise geometry and watertight topology remains challenging despite recent progress. Current methods struggle with maintaining topological validity while generating complex solid models.

Method: Uses a Transformer model with unified tokenization scheme encoding both geometric primitives (surfaces/curves) as latent geometry tokens and structural relationships as topological reference tokens. Sequence follows B-Rep face adjacency graph breadth-first traversal, progressively generating neighboring faces and edges with topological structure.

Result: Outperforms baselines with better quality and watertightness, highly scalable to complex solids with good fidelity and inference speed. Natively supports B-Rep autocompletion through unified tokenization, enabling user-controllable CAD generation with minimal changes.

Conclusion: AutoBrep demonstrates effective B-Rep generation through unified tokenization and autoregressive modeling, achieving high-quality, valid CAD models while supporting practical applications like autocompletion and user-controlled generation.

Abstract: The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at https://github.com/AutodeskAILab/AutoBrep.

[203] Mutually-Aware Feature Learning for Few-Shot Object Counting

Yerim Jeon, Subeen Lee, Jihwan Kim, Jae-Pil Heo

Main category: cs.CV

TL;DR: MAFEA introduces a mutually-aware feature learning framework for few-shot object counting that enables query-exemplar interaction from the start, reducing target confusion in multi-category scenarios.

Details

Motivation: Current extract-and-match approaches in few-shot object counting have a fundamental limitation: query and exemplar features are extracted independently without interaction, leading to insufficient target awareness and confusion when multiple object classes coexist in the same image.

Method: Proposes Mutually-Aware FEAture learning (MAFEA) that encodes query and exemplar features with mutual awareness from the outset, encouraging interaction throughout the pipeline. Also introduces a background token to effectively associate query’s target region with exemplars while decoupling background regions.

Result: Achieves state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks with remarkably reduced target confusion compared to previous methods.

Conclusion: Mutually-aware feature learning from the beginning of the pipeline is crucial for few-shot object counting, enabling better target awareness and reducing confusion in multi-category scenarios, leading to superior performance on standard benchmarks.

Abstract: Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without additional training. However, the prevailing extract-and-match approach has a shortcoming: query and exemplar features lack interaction during feature extraction since they are extracted independently and later correlated based on similarity. This can lead to insufficient target awareness and confusion in identifying the actual target when multiple class objects coexist. To address this, we propose a novel framework, Mutually-Aware FEAture learning (MAFEA), which encodes query and exemplar features with mutual awareness from the outset. By encouraging interaction throughout the pipeline, we obtain target-aware features robust to a multi-category scenario. Furthermore, we introduce background token to effectively associate the query’s target region with exemplars and decouple its background region. Our extensive experiments demonstrate that our model achieves state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks with remarkably reduced target confusion.

[204] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Kehan Qi, Saumya Gupta, Qingqiao Hu, Weimin Lyu, Chao Chen

Main category: cs.CV

TL;DR: FLAT connects unrolled networks to flow ODEs for MRI reconstruction, improving stability and convergence with fewer iterations than diffusion models.

Details

Motivation: MRI acquisition is slow, limiting clinical utility. Unrolled networks accelerate reconstruction but suffer from unstable evolution due to freely-learnable parameters. Diffusion models offer stability but are computationally expensive.

Method: Theoretical proof that unrolled networks are discrete implementations of conditional probability flow ODEs. Proposed Flow-Aligned Training (FLAT) derives unrolled parameters from ODE discretization and aligns intermediate reconstructions with ideal ODE trajectory.

Result: FLAT achieves high-quality reconstructions with up to 3× fewer iterations than diffusion-based models and significantly greater stability than unrolled networks on three MRI datasets.

Conclusion: Connecting unrolled networks to flow ODEs provides theoretical foundation for stable MRI reconstruction, offering both computational efficiency and improved convergence.

Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.

[205] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

Main category: cs.CV

TL;DR: MAViD is a multimodal framework for audio-visual dialogue that uses a Conductor-Creator architecture to generate synchronized long-duration interactive content with consistent identity and natural speech.

Details

Motivation: Existing approaches focus on non-interactive systems and produce constrained, unnatural human speech. The main challenges are integrating understanding and generation capabilities, achieving seamless audio-video fusion, and generating long videos with consistent identity, timbre, and tone.

Method: Proposes a Conductor-Creator architecture: Conductor handles understanding, reasoning, and generating motion/speech instructions; Creator produces interactive responses. Uses combined autoregressive (audio) and diffusion (video) models for consistent long-form generation, plus a novel fusion module for synchronized multimodal content.

Result: Extensive experiments show the framework can generate vivid, contextually coherent long-duration dialogue interactions and accurately interpret users’ multimodal queries.

Conclusion: MAViD successfully addresses the challenges of multimodal audio-visual dialogue by integrating understanding and generation through its innovative architecture, enabling natural, synchronized long-form interactions.

Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users’ multimodal queries.

[206] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia

Main category: cs.CV

TL;DR: MultiShotMaster is a framework for controllable multi-shot video generation that extends pretrained single-shot models with novel RoPE variants for flexible shot arrangement and spatiotemporal grounding.

Details

Motivation: Current video generation techniques struggle with narrative multi-shot videos that require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. Single-shot models can't handle the complexities of multi-shot storytelling.

Method: Extends pretrained single-shot model with two novel RoPE variants: 1) Multi-Shot Narrative RoPE applies explicit phase shift at shot transitions for flexible arrangement while preserving narrative order; 2) Spatiotemporal Position-Aware RoPE incorporates reference tokens and grounding signals for spatiotemporal-grounded reference injection. Also creates automated data annotation pipeline to extract multi-shot videos with captions, cross-shot grounding signals, and reference images.

Result: Framework supports multi-shot video generation with text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scenes. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate superior performance and outstanding controllability.

Conclusion: MultiShotMaster successfully addresses the challenges of multi-shot video generation by extending single-shot models with novel architectural modifications and automated data collection, enabling highly controllable narrative video creation with flexible shot arrangement and spatiotemporal grounding.

Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

[207] OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

Main category: cs.CV

TL;DR: OneThinker is a unified multimodal reasoning model that combines image and video understanding across 10 fundamental visual tasks using a single model, addressing the scalability limitations of task-specific approaches.

Details

Motivation: Existing RL approaches for visual reasoning in MLLMs train separate models for different tasks and treat image/video reasoning as disjoint domains, limiting scalability, practical versatility, and knowledge sharing across tasks and modalities.

Method: Constructed OneThinker-600k training corpus covering diverse visual tasks, used commercial models for CoT annotation to create OneThinker-SFT-340k for SFT cold start, and proposed EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations.

Result: OneThinker delivers strong performance on 31 benchmarks across 10 fundamental visual understanding tasks, exhibits effective knowledge transfer between certain tasks, and shows preliminary zero-shot generalization ability.

Conclusion: OneThinker marks a step toward a unified multimodal reasoning generalist by unifying image and video understanding across diverse tasks in a single model, with all code, model, and data released.

Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

[208] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Minseop Kwak, Jin-Hwa Kim, Seungryong Kim

Main category: cs.CV

TL;DR: CAMEO is a training technique that supervises attention maps in multi-view diffusion models using geometric correspondence to improve view consistency and training efficiency.

Details

Motivation: Multi-view diffusion models show view consistency but the underlying mechanism is unclear. The authors found that attention maps acquire geometric correspondence during training, but this signal is incomplete and degrades with large viewpoint changes.

Method: CAMEO directly supervises attention maps using geometric correspondence during training. The key insight is that supervising just a single attention layer is sufficient to guide the model toward learning precise correspondences.

Result: CAMEO reduces training iterations by half while achieving superior performance at same iteration counts. It preserves geometry/structure of reference images, accelerates convergence, and improves novel view synthesis performance. The method is model-agnostic.

Conclusion: Supervising attention maps with geometric correspondence is an effective way to enhance multi-view diffusion models, improving both training efficiency and generation quality while maintaining view consistency.

Abstract: Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.

[209] TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, Qi She

Main category: cs.CV

TL;DR: TimeSearch enables large video-language models to understand long videos using human-like hierarchical temporal search with Spotlight and Reflection primitives.

Details

Motivation: Current LVLMs struggle with long videos due to frame overload; downsampling causes visual hallucinations. Need human-like temporal understanding.

Method: Two human-like primitives: 1) Spotlight identifies relevant temporal events using Temporal-Augmented Frame Representation (TAFR) binding visual features with timestamps; 2) Reflection evaluates correctness using LVLMs’ temporal self-reflection. Progressive exploration with confidence-based prioritization.

Result: Outperforms SOTA: accuracy improves from 41.8% to 51.5% on LVBench. Temporal grounding improves mIoU on Charades-STA by 11.8%.

Conclusion: TimeSearch effectively enables LVLMs to understand long videos in human-like manner; TAFR stimulates surprising temporal grounding ability.

Abstract: Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8% to 51.5% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8%. The code will be released.

[210] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen

Main category: cs.CV

TL;DR: MagicQuill V2 introduces layered composition for generative image editing, bridging diffusion models’ semantic power with traditional graphics software’s granular control through content, spatial, structural, and color layers.

Details

Motivation: Current diffusion models use singular, monolithic prompts that fail to disentangle distinct user intentions for content, position, and appearance, creating a gap between semantic power and granular control.

Method: Deconstructs creative intent into a stack of controllable visual cues: content layer (what to create), spatial layer (where to place it), structural layer (how it’s shaped), and color layer (palette). Includes specialized data generation pipeline for context-aware content integration, unified control module for all visual cues, and fine-tuned spatial branch for precise local editing including object removal.

Result: Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

Conclusion: MagicQuill V2’s layered composition paradigm successfully bridges the gap between diffusion models’ semantic capabilities and traditional graphics software’s control, enabling more precise and intuitive generative image editing.

Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

[211] Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment

Quoc-Huy Tran, Ahmed Mehmood, Muhammad Ahmed, Muhammad Naufil, Anas Zafar, Andrey Konin, M. Zeeshan Zia

Main category: cs.CV

TL;DR: Unsupervised transformer framework for temporal activity segmentation using both frame-level and segment-level cues, trained via temporal optimal transport and achieving state-of-the-art performance on four datasets.

Details

Motivation: Previous unsupervised activity segmentation methods rely primarily on frame-level information only, missing valuable segment-level cues that could improve segmentation accuracy and robustness.

Method: Three-module framework: 1) Frame-level prediction module with transformer encoder trained via temporal optimal transport, 2) Segment-level prediction module with transformer decoder for video transcripts, 3) Frame-to-segment alignment module for matching features. Uses pseudo labels inspired by temporal optimal transport for unsupervised training.

Result: Achieves comparable or better performance than previous methods on four public datasets: 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly.

Conclusion: The proposed unsupervised transformer framework effectively leverages both frame-level and segment-level cues for temporal activity segmentation, demonstrating superior performance through comprehensive experiments on multiple datasets.

Abstract: This paper presents an unsupervised transformer-based framework for temporal activity segmentation which leverages not only frame-level cues but also segment-level cues. This is in contrast with previous methods which often rely on frame-level information only. Our approach begins with a frame-level prediction module which estimates framewise action classes via a transformer encoder. The frame-level prediction module is trained in an unsupervised manner via temporal optimal transport. To exploit segment-level information, we utilize a segment-level prediction module and a frame-to-segment alignment module. The former includes a transformer decoder for estimating video transcripts, while the latter matches frame-level features with segment-level features, yielding permutation-aware segmentation results. Moreover, inspired by temporal optimal transport, we introduce simple-yet-effective pseudo labels for unsupervised training of the above modules. Our experiments on four public datasets, i.e., 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly show that our approach achieves comparable or better performance than previous methods in unsupervised activity segmentation. Our code and dataset are available on our research website: https://retrocausal.ai/research/.

[212] 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

Main category: cs.CV

TL;DR: 3DIS is a novel framework that decouples multi-instance generation into depth-based layout generation and attribute rendering using pre-trained ControlNet, achieving superior layout precision and attribute rendering without additional training.

Details

Motivation: Multi-instance generation (MIG) techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering.

Method: 3DIS decouples MIG into two stages: (1) generating a coarse scene depth map for accurate instance positioning and scene composition, and (2) rendering fine-grained attributes using pre-trained ControlNet on any foundational model without additional training. It integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering.

Result: Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. The framework offers seamless compatibility with diverse foundational models.

Conclusion: 3DIS provides a robust, adaptable solution for advanced multi-instance generation that addresses the limitations of current MIG techniques, offering superior performance and compatibility with state-of-the-art models.

Abstract: The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.

[213] Convolution goes higher-order: a biologically inspired mechanism empowers image classification

Simone Azeglio, Olivier Marre, Peter Neri, Ulisse Ferrari

Main category: cs.CV

TL;DR: A biologically-inspired CNN with learnable higher-order convolutions (Volterra-like expansion) outperforms traditional CNNs on standard benchmarks, with optimal performance at 3rd/4th order that aligns with natural image statistics.

Details

Motivation: To bridge neuroscience and deep learning by incorporating complex nonlinear biological visual processing into CNNs, specifically the multiplicative interactions observed in biological vision systems.

Method: Equip classical CNNs with learnable higher-order convolutions using a Volterra-like expansion of the convolution operator. Evaluate on synthetic datasets for sensitivity to higher-order correlations and standard benchmarks (MNIST, FashionMNIST, CIFAR10, CIFAR100, Imagenette).

Result: The architecture outperforms traditional CNN baselines, achieves optimal performance with 3rd/4th order expansions that align with natural image pixel intensity distributions. Systematic perturbation analysis shows different convolution orders process distinct visual information aspects. Representational Similarity Analysis reveals distinct geometries across layers.

Conclusion: This work bridges neuroscience and deep learning, offering biologically-inspired computer vision models that better capture complex visual patterns, especially in resource-constrained scenarios, and provides insights into visual information processing.

Abstract: We propose a novel approach to image classification inspired by complex nonlinear biological visual processing, whereby classical convolutional neural networks (CNNs) are equipped with learnable higher-order convolutions. Our model incorporates a Volterra-like expansion of the convolution operator, capturing multiplicative interactions akin to those observed in early and advanced stages of biological visual processing. We evaluated this approach on synthetic datasets by measuring sensitivity to testing higher-order correlations and performance in standard benchmarks (MNIST, FashionMNIST, CIFAR10, CIFAR100 and Imagenette). Our architecture outperforms traditional CNN baselines, and achieves optimal performance with expansions up to 3rd/4th order, aligning remarkably well with the distribution of pixel intensities in natural images. Through systematic perturbation analysis, we validate this alignment by isolating the contributions of specific image statistics to model performance, demonstrating how different orders of convolution process distinct aspects of visual information. Furthermore, Representational Similarity Analysis reveals distinct geometries across network layers, indicating qualitatively different modes of visual information processing. Our work bridges neuroscience and deep learning, offering a path towards more effective, biologically inspired computer vision models. It provides insights into visual information processing and lays the groundwork for neural networks that better capture complex visual patterns, particularly in resource-constrained scenarios.

[214] UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping

Aashish Rai, Dilin Wang, Mihir Jain, Nikolaos Sarafianos, Kefan Chen, Srinath Sridhar, Aayush Prakash

Main category: cs.CV

TL;DR: UVGS: A novel 2D representation for 3D Gaussian Splatting that enables direct application of 2D diffusion models for 3DGS generation.

Details

Motivation: 3D Gaussian Splatting (3DGS) shows superior 3D modeling quality but is challenging to generate due to its discrete, unstructured, and permutation-invariant nature. Existing methods struggle with direct 3DGS generation.

Method: Transform 3DGS into structured 2D UVGS representation using spherical mapping, compress heterogeneous Gaussian features into 3-channel RGB-like images via multi-branch network, then apply pre-trained 2D diffusion models without additional training.

Result: Enables various 3DGS generation applications including unconditional, conditional generation, and inpainting using existing 2D diffusion models. UVGS representation is scalable by increasing 2D resolution to accommodate more Gaussians.

Conclusion: UVGS bridges 3DGS with 2D generative models, unlocking novel 3DGS generation capabilities by leveraging existing 2D diffusion models, providing a simple yet effective solution to the 3DGS generation challenge.

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network. The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate various unconditional, conditional generation, and inpainting applications of 3DGS based on diffusion models, which were previously non-trivial.

Jinya Sakurai, Yuki Koyama, Issei Sato

Main category: cs.CV

TL;DR: FairT2I is a training-free framework that uses latent variable guidance and LLM-based bias detection to mitigate societal biases in text-to-image generation while preserving image quality.

Details

Motivation: Text-to-image models often reproduce societal biases from their training data, creating a need for effective debiasing methods that don't require retraining.

Method: Uses mathematically principled latent variable guidance to decompose score functions, LLM-based bias detection to identify bias-prone attributes, and attribute resampling with adjustable distributions.

Result: LLMs outperform human annotators in bias detection, and FairT2I achieves better bias mitigation and diversity than baselines while maintaining image quality and prompt fidelity.

Conclusion: FairT2I provides a unified, flexible, and interactive framework for bias-aware text-to-image generation that subsumes existing approaches and enables real-time user control.

Abstract: Text-to-image (T2I) models have advanced creative content generation, yet their reliance on large uncurated datasets often reproduces societal biases. We present FairT2I, a training-free and interactive framework grounded in a mathematically principled latent variable guidance formulation. This formulation decomposes the generative score function into attribute-conditioned components and reweights them according to a defined distribution, providing a unified and flexible mechanism for bias-aware generation that also subsumes many existing ad hoc debiasing approaches as special cases. Building upon this foundation, FairT2I incorporates (1) latent variable guidance as the core mechanism, (2) LLM-based bias detection to automatically infer bias-prone categories and attributes from text prompts as part of the latent structure, and (3) attribute resampling, which allows users to adjust or redefine the attribute distribution based on uniform, real-world, or user-specified statistics. The accompanying user interface supports this pipeline by enabling users to inspect detected biases, modify attributes or weights, and generate debiased images in real time. Experimental results show that LLMs outperform average human annotators in the number and granularity of detected bias categories and attributes. Moreover, FairT2I achieves superior performance to baseline models in both societal bias mitigation and image diversity, while preserving image quality and prompt fidelity.

[216] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation

Max Gandyra, Alessandro Santonicola, Michael Beetz

Main category: cs.CV

TL;DR: NOCTIS is a training-free framework for instance segmentation of novel objects using Grounded-SAM 2 for proposals and DINOv2 for embeddings, with cyclic thresholding for robust matching, outperforming RGB and RGB-D methods on BOP 2023 challenge.

Details

Motivation: Instance segmentation of novel objects without retraining is challenging. Existing methods struggle with generalization across diverse object types without model adaptation.

Method: Combines Grounded-SAM 2 for object proposals (bounding boxes + masks) and DINOv2 for embeddings. Uses cyclic thresholding (CT) for proposal-object matching based on class/patch embedding similarity. Adds appearance score, average proposal confidence, and RGB-only pipeline.

Result: Outperforms best RGB and RGB-D methods on BOP 2023 challenge’s “Model-based 2D segmentation of unseen objects” task across seven core datasets without any training/fine-tuning.

Conclusion: NOCTIS demonstrates effective training-free instance segmentation for novel objects using pre-trained models with robust matching mechanisms, achieving state-of-the-art performance on benchmark datasets.

Abstract: Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals’ bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the “Model-based 2D segmentation of unseen objects” task.

[217] Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

Thibaut Loiseau, Guillaume Bourmaud, Vincent Lepetit

Main category: cs.CV

TL;DR: Alligat0R replaces CroCo’s cross-view completion with covisibility segmentation pre-training, using a new Cub3 dataset, achieving better pose regression performance.

Details

Motivation: Cross-view completion pre-training (like CroCo) is ill-posed in non-covisible regions, limiting effectiveness. Need a pre-training approach that works well in both covisible and non-covisible areas while providing interpretable predictions.

Method: Alligat0R replaces cross-view completion with covisibility segmentation task. It predicts for each pixel in one image whether it’s covisible in second image, occluded, or outside field of view. Uses new Cub3 dataset with 5M image pairs and dense covisibility annotations from nuScenes and ScanNet.

Result: Alligat0R significantly outperforms CroCo in relative pose regression tasks. The method provides interpretable predictions and works effectively in both covisible and non-covisible regions.

Conclusion: Covisibility segmentation pre-training (Alligat0R) is more effective than cross-view completion (CroCo) for vision tasks, especially when dealing with varying degrees of image overlap. The approach provides interpretability and better performance on pose regression.

Abstract: Pre-training techniques have greatly advanced computer vision, with CroCo’s cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, cross-view completion is ill-posed in non-covisible regions, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that replaces cross-view learning with a covisibility segmentation task. Our method predicts whether each pixel in one image is covisible in the second image, occluded, or outside the field of view, making the pre-training effective in both covisible and non-covisible regions, and provides interpretable predictions. To support this, we present Cub3, a large-scale dataset with 5M image pairs and dense covisibility annotations derived from the nuScenes and ScanNet datasets. Cub3 includes diverse scenarios with varying degrees of overlap. The experiments show that our novel pre-training method Alligat0R significantly outperforms CroCo in relative pose regression. Code is available at https://github.com/thibautloiseau/alligat0r.

[218] MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution

Xinrui Li, Jinrong Zhang, Jianlong Wu, Chong Chen, Liqiang Nie, Zhouchen Lin

Main category: cs.CV

TL;DR: MegaSR improves T2I-based Real-ISR by addressing fine detail deficiency, semantic inconsistency, and edge ambiguity through customized semantics and multimodal signal fusion.

Details

Motivation: Existing T2I-based Real-ISR methods suffer from three critical issues: (1) fine detail deficiency leading to incorrect local reconstruction, (2) block-wise semantic inconsistency causing distracted interpretations across U-Net blocks, and (3) edge ambiguity resulting in structural degradation.

Method: Proposes MegaSR with two key components: Customized Semantics Module (CSM) supplements fine-grained semantics from image modality and regulates semantic fusion between multi-level knowledge for different U-Net blocks. Multimodal Signal Fusion Module (MSFM) identifies expressive multimodal signals through pair-wise comparisons and aggregates them for structurally consistent reconstruction.

Result: Extensive experiments on real-world and synthetic datasets show state-of-the-art performance on quality-driven metrics while remaining competitive on fidelity-focused metrics, achieving a balance between perceptual realism and faithful content reconstruction.

Conclusion: MegaSR successfully addresses the identified limitations of T2I-based Real-ISR by enhancing models with fine-grained customized semantics and expressive guidance, enabling semantically rich and structurally consistent reconstruction.

Abstract: Text-to-image (T2I) models have ushered in a new era of real-world image super-resolution (Real-ISR) due to their rich internal implicit knowledge for multimodal learning. Although bringing high-level semantic priors and dense pixel guidance have led to advances in reconstruction, we identified several critical phenomena by analyzing the behavior of existing T2I-based Real-ISR methods: (1) Fine detail deficiency, which ultimately leads to incorrect reconstruction in local regions. (2) Block-wise semantic inconsistency, which results in distracted semantic interpretations across U-Net blocks. (3) Edge ambiguity, which causes noticeable structural degradation. Building upon these observations, we first introduce MegaSR, which enhances the T2I-based Real-ISR models with fine-grained customized semantics and expressive guidance to unlock semantically rich and structurally consistent reconstruction. Then, we propose the Customized Semantics Module (CSM) to supplement fine-grained semantics from the image modality and regulate the semantic fusion between multi-level knowledge to realize customization for different U-Net blocks. Besides the semantic adaptation, we identify expressive multimodal signals through pair-wise comparisons and introduce the Multimodal Signal Fusion Module (MSFM) to aggregate them for structurally consistent reconstruction. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the method. Notably, it not only achieves state-of-the-art performance on quality-driven metrics but also remains competitive on fidelity-focused metrics, striking a balance between perceptual realism and faithful content reconstruction.

[219] Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

Liewen Liao, Weihao Yan, Wang Xu, Ming Yang, Songan Zhang, H. Eric Tseng

Main category: cs.CV

TL;DR: A comprehensive survey of learning-based 3D reconstruction techniques in autonomous driving, covering technical evolutions, practical applications, current challenges, and future directions.

Details

Motivation: Learning-based 3D reconstruction has become transformative for autonomous driving, enabling precise environment modeling for tasks like dense mapping, simulation, and scene understanding. With rapid growth in this research area, there's a need for systematic review of both technical developments and practical applications.

Method: The survey begins with introduction to learning-based 3D reconstruction fundamentals, then provides multi-dimensional examination of cutting-edge methodologies organized according to autonomous driving’s technical requirements and fundamental challenges. The analysis includes systematic organization of existing approaches and identification of trends.

Result: The survey identifies existing technical challenges and notes insufficient disclosure of on-board validation and safety verification details in current literature. It analyzes development trends and cutting-edge research in the field.

Conclusion: The paper suggests potential directions to guide future studies in learning-based 3D reconstruction for autonomous driving, addressing identified gaps in validation and safety verification while advancing the field’s technical capabilities.

Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of environments through advanced neural representations. It has inspired pioneering solutions for vital tasks in autonomous driving, such as dense mapping and closed-loop simulation, as well as comprehensive scene feature for driving scene understanding and reasoning. Given the rapid growth in related research, this survey provides a comprehensive review of both technical evolutions and practical applications in autonomous driving. We begin with an introduction to the preliminaries of learning-based 3D reconstruction to provide a solid technical background foundation, then progress to a rigorous, multi-dimensional examination of cutting-edge methodologies, systematically organized according to the distinctive technical requirements and fundamental challenges of autonomous driving. Through analyzing and summarizing development trends and cutting-edge research, we identify existing technical challenges, along with insufficient disclosure of on-board validation and safety verification details in the current literature, and ultimately suggest potential directions to guide future studies.

[220] Detect Anything 3D in the Wild

Hanxue Zhang, Haoran Jiang, Qingsong Yao, Yanan Sun, Renrui Zhang, Hao Zhao, Hongyang Li, Hongzi Zhu, Zetong Yang

Main category: cs.CV

TL;DR: DetAny3D is a promptable 3D detection foundation model that achieves zero-shot generalization to novel objects and camera configurations using only monocular inputs by transferring knowledge from pre-trained 2D foundation models.

Details

Motivation: Existing deep learning approaches for 3D object detection struggle with zero-shot generalization to novel objects and camera configurations. The limited availability of annotated 3D data constrains training foundation models for 3D detection, necessitating the use of rich prior knowledge from extensively pre-trained 2D models.

Method: DetAny3D incorporates two core modules: 1) 2D Aggregator that aligns features from different 2D foundation models, and 2) 3D Interpreter with Zero-Embedding Mapping that stabilizes early training in 2D-to-3D knowledge transfer. The model uses only monocular inputs and is promptable for detecting any novel object.

Result: Experimental results show strong generalization capabilities: achieves state-of-the-art performance on unseen categories and novel camera configurations, and surpasses most competitors on in-domain data. Demonstrates potential for real-world applications like rare object detection in autonomous driving.

Conclusion: DetAny3D sheds light on the potential of 3D foundation models for diverse real-world applications and demonstrates promise for further exploration of 3D-centric tasks in open-world settings, addressing the fundamental challenge of limited 3D annotated data through effective 2D knowledge transfer.

Abstract: Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.

Lu Yue, Dongliang Zhou, Liang Xie, Erwei Yin, Feitian Zhang

Main category: cs.CV

TL;DR: ST-Booster enhances VLN-CE navigation through iterative spatiotemporal boosting with hierarchical encoding, multi-granularity fusion, and value-guided waypoint generation.

Details

Motivation: VLN-CE faces two core perception challenges: 1) heterogeneous visual memories and weakened global spatial correlations due to absence of predefined observation points, and 2) structural noise from cumulative 3D reconstruction errors impairing local feature perception.

Method: ST-Booster consists of three modules: Hierarchical SpatioTemporal Encoding (HSTE) for global topological graphs and local grid maps, Multi-Granularity Aligned Fusion (MGAF) for geometry-aware knowledge fusion with instructions, and ValueGuided Waypoint Generation (VGWG) that generates Guided Attention Heatmaps for environment-instruction relevance modeling.

Result: ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments, as demonstrated through extensive comparative experiments and performance analyses.

Conclusion: The proposed iterative spatiotemporal booster effectively addresses VLN-CE perception challenges through multi-granularity perception and instruction-aware reasoning, achieving superior navigation performance.

Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules – Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.

[222] WorldMem: Long-term Consistent World Simulation with Memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan

Main category: cs.CV

TL;DR: WorldMem is a framework that uses a memory bank with attention mechanisms to enhance world simulation by maintaining long-term 3D spatial consistency and capturing dynamic evolution over time.

Details

Motivation: Current world simulation methods struggle with limited temporal context windows, leading to failures in maintaining long-term consistency, especially in preserving 3D spatial consistency across significant viewpoint or temporal gaps.

Method: WorldMem introduces a memory bank consisting of memory units that store memory frames and states (poses and timestamps). It employs a memory attention mechanism to extract relevant information from these memory frames based on their states, enabling accurate reconstruction of previously observed scenes.

Result: The method can accurately reconstruct previously observed scenes even under significant viewpoint or temporal gaps. By incorporating timestamps, it captures dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in virtual and real scenarios validate its effectiveness.

Conclusion: WorldMem provides an effective framework for enhancing world simulation by addressing long-term consistency issues through memory mechanisms, enabling more realistic modeling of both static and dynamic virtual environments.

Abstract: World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

[223] ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment

Martin JJ. Bucher, Iro Armeni

Main category: cs.CV

TL;DR: ReSpace: A generative framework using autoregressive language models for text-driven 3D indoor scene synthesis and editing with explicit room boundaries and asset-agnostic deployment.

Details

Motivation: Current 3D indoor scene synthesis methods have limitations: they oversimplify object semantics with one-hot encodings, require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer natural language semantics but lack editing functionality, are limited to rectangular layouts, or have weak spatial reasoning.

Method: ReSpace uses autoregressive language models with a compact structured scene representation featuring explicit room boundaries. It employs a dual-stage training approach combining supervised fine-tuning and preference alignment to train a language model for object addition that considers user instructions, spatial geometry, object semantics, and scene composition. For editing, it uses a zero-shot LLM for object removal and prompts for addition.

Result: Experimental results show ReSpace surpasses state-of-the-art on object addition tasks and achieves superior human-perceived quality on full scene synthesis. The framework also introduces a voxelization-based evaluation method that captures fine-grained geometry beyond 3D bounding boxes.

Conclusion: ReSpace provides an effective framework for text-driven 3D indoor scene synthesis and editing that addresses limitations of existing methods by combining explicit room boundaries, asset-agnostic deployment, and language model capabilities for both synthesis and editing tasks.

Abstract: Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., ‘chair’ or ’table’), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language (e.g., ‘modern studio with light wood furniture’), but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a voxelization-based evaluation capturing fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on addition and achieve superior human-perceived quality on full scene synthesis.

[224] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels

Jonathan Grizou, Carlos de la Torre-Ortiz, Tuukka Ruotsalo

Main category: cs.CV

TL;DR: CURSOR is a self-calibration framework that recovers mental targets from EEG-image pairs without labeled data or pre-trained decoders.

Details

Motivation: Previous mental target recovery methods require labeled data, but real-world scenarios often lack labeled EEG-image pairs. The paper aims to solve this via self-calibration without any labeled information.

Method: CURSOR algorithm learns from unlabeled paired EEG and image data to recover unknown mental targets. It predicts image similarity scores correlated with human perception, uses these to rank stimuli against mental targets, and generates new stimuli matching the mental target.

Result: Experiments on naturalistic face images show CURSOR can: 1) predict similarity scores correlating with human judgments without labels, 2) rank stimuli against unknown mental targets, and 3) generate stimuli indistinguishable from mental targets (validated via user study with N=53).

Conclusion: CURSOR presents the first framework for self-calibrated mental target recovery without labeled data, demonstrating practical applications in brain-computer interfaces and neuroscience research.

Abstract: We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).

Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang

Main category: cs.CV

TL;DR: VLMs fail at reading fragmented/occluded text that humans easily read, revealing they lack compositional understanding of writing systems.

Details

Motivation: To test whether advanced vision language models share human resilience in recognizing words despite character fragmentation, fusion, or occlusion.

Method: Created psychophysics-inspired benchmarks for Chinese logographs and English alphabetic words by splicing, recombining, and overlaying glyphs to create “visible but unreadable” stimuli for models while remaining legible to humans.

Result: Despite strong performance on clean text, contemporary VLMs show severe performance drops under these perturbations, frequently producing unrelated or incoherent outputs, indicating they rely on generic visual invariances but lack compositional priors.

Conclusion: VLMs have structural limitations in robust literacy, motivating architectures that encode symbol segmentation, composition, and binding across scripts for applications in education, accessibility, cultural heritage, and security.

Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ‘‘visible but unreadable’’ stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

[226] PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

Main category: cs.CV

TL;DR: PrITTI introduces a primitive-based 3D semantic urban scene generation method using vectorized object primitives and rasterized ground surfaces, achieving better quality, efficiency, and editability than voxel-based approaches.

Details

Motivation: Existing voxel-based 3D urban scene generation methods suffer from fixed resolution limitations, difficult editability, and high memory requirements. The authors propose a primitive-based paradigm as a more compact, semantically meaningful, and easily manipulable alternative.

Method: PrITTI is a latent diffusion model that uses a hybrid representation: vectorized object primitives for semantic objects and rasterized ground surfaces. This creates a structured latent space enabling object- and ground-level manipulation through diffusion transformers.

Result: On KITTI-360 dataset, PrITTI achieves state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability compared to voxel-based methods. It also supports various downstream applications.

Conclusion: Primitive-based representations unlock the full potential of diffusion transformers for 3D urban scene generation, offering superior performance, efficiency, and editability while enabling diverse applications like scene editing, inpainting, outpainting, and street-view synthesis.

Abstract: Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at $\href{https://raniatze.github.io/pritti/}{https://raniatze.github.io/pritti}$.

[227] Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi

Main category: cs.CV

TL;DR: Ov3R is a framework for open-vocabulary 3D reconstruction from RGB video that integrates CLIP semantics directly into reconstruction, achieving SOTA in both geometry and semantic segmentation.

Details

Motivation: To advance Spatial AI by creating a system that can perform both dense 3D reconstruction and open-vocabulary semantic segmentation from RGB video streams, overcoming limitations of prior methods that don't integrate semantics directly into reconstruction.

Method: Two key components: 1) CLIP3R - CLIP-informed 3D reconstruction module predicting dense point maps with object-level semantics from overlapping clips; 2) 2D-3D OVS - open-vocabulary semantic module lifting 2D features into 3D via fused descriptors combining spatial, geometric, and semantic cues.

Result: Achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, enabling globally consistent geometry with fine-grained semantic alignment.

Conclusion: Ov3R represents a step forward toward real-time, semantics-aware Spatial AI by directly incorporating CLIP semantics into the 3D reconstruction process, enabling superior geometric and semantic understanding from RGB video.

Abstract: We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

[228] Bias Beyond Demographics: Probing Decision Boundaries in Black-Box LVLMs via Counterfactual VQA

Zaiying Zhao, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: The paper introduces a counterfactual VQA benchmark to evaluate fairness in large vision-language models beyond demographic attributes, revealing that non-demographic contextual factors cause stronger decision-making distortions than demographic ones, and that human-norm-validated examples can improve model consistency.

Details

Motivation: Current fairness evaluations for LVLMs are limited to demographic attributes and often confuse fairness with refusal behavior. There's a need to broaden fairness assessment to include non-demographic contextual factors and understand how models make decisions under controlled context shifts.

Method: Developed a counterfactual VQA benchmark with image pairs that differ only in a single visual attribute irrelevant to the question. This enables ground-truth-free and refusal-aware analysis of reasoning stability. The approach allows probing decision boundaries of closed-source LVLMs under controlled context shifts.

Result: Non-demographic attributes (environmental context, social behavior) distort LVLM decision-making more strongly than demographic attributes. Instruction-based debiasing shows limited effectiveness and can even amplify asymmetries. Exposure to a small number of human-norm-validated examples from the benchmark encourages more consistent and balanced responses.

Conclusion: The benchmark provides a practical basis for auditing contextual biases in black-box LVLMs and contributes to more transparent and equitable multimodal reasoning. It serves not only as an evaluative framework but also as a means for understanding and improving model behavior.

Abstract: Recent advances in large vision-language models (LVLMs) have amplified concerns about fairness, yet existing evaluations remain confined to demographic attributes and often conflate fairness with refusal behavior. This paper broadens the scope of fairness by introducing a counterfactual VQA benchmark that probes the decision boundaries of closed-source LVLMs under controlled context shifts. Each image pair differs in a single visual attribute that has been validated as irrelevant to the question, enabling ground-truth-free and refusal-aware analysis of reasoning stability. Comprehensive experiments reveal that non-demographic attributes, such as environmental context or social behavior, distort LVLM decision-making more strongly than demographic ones. Moreover, instruction-based debiasing shows limited effectiveness and can even amplify these asymmetries, whereas exposure to a small number of human norm validated examples from our benchmark encourages more consistent and balanced responses, highlighting its potential not only as an evaluative framework but also as a means for understanding and improving model behavior. Together, these results provide a practial basis for auditing contextual biases even in black-box LVLMs and contribute to more transparent and equitable multimodal reasoning.

[229] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi

Main category: cs.CV

TL;DR: LiDARCrafter: A unified framework for 4D LiDAR generation and editing using natural language inputs, achieving state-of-the-art performance in fidelity, controllability, and temporal consistency.

Details

Motivation: Existing generative world models focus on videos or occupancy grids but overlook unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling faces challenges in controllability, temporal coherence, and evaluation standardization.

Method: Parses free-form natural language into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. Includes an autoregressive module for temporally coherent 4D LiDAR sequences with smooth transitions.

Result: Achieves state-of-the-art performance on nuScenes dataset across scene-, object-, and sequence-level metrics. Demonstrates superior fidelity, controllability, and temporal consistency compared to existing methods.

Conclusion: LiDARCrafter provides a comprehensive solution for 4D LiDAR generation and editing, establishing a standardized benchmark and paving the way for data augmentation and simulation in autonomous driving.

Abstract: Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.

[230] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji, Yulun Zhang

Main category: cs.CV

TL;DR: SODEC: A single-step diffusion image compression model that addresses slow decoding and poor fidelity issues in diffusion-based compression by replacing iterative denoising with single-step decoding using informative VAE latents.

Details

Motivation: Diffusion-based image compression has impressive perceptual quality but suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors.

Method: Proposes SODEC with three key components: (1) Uses pre-trained VAE to produce informative latents making multi-step refinement unnecessary, (2) Introduces fidelity guidance module to ensure output faithfulness to original image, (3) Designs rate annealing training strategy for effective training under extremely low bitrates.

Result: SODEC significantly outperforms existing methods in rate-distortion-perception performance and improves decoding speed by more than 20× compared to previous diffusion-based compression models.

Conclusion: SODEC successfully addresses the key limitations of diffusion-based compression by enabling single-step decoding while maintaining high fidelity, making it practical for real-world applications with superior performance and dramatically faster decoding.

Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.

[231] Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Nicola Messina, Rosario Leonardi, Luca Ciampi, Fabio Carrara, Giovanni Maria Farinella, Fabrizio Falchi, Antonino Furnari

Main category: cs.CV

TL;DR: Weakly-supervised in-hand object segmentation using natural language narrations instead of pixel-level annotations.

Details

Motivation: Pixel-level object recognition from egocentric images is valuable for assistive tech, industrial safety, and activity monitoring, but progress is hindered by scarce annotated datasets requiring costly manual labeling.

Method: Proposes NS-iHOS task and WISH model that learns human-object interaction detection from natural language narrations in a weakly-supervised regime, distilling knowledge from narrations to learn hand-object associations without using narrations at test time.

Result: WISH surpasses all baselines (open-vocabulary detectors and vision-language models) on EPIC-Kitchens and Ego4D datasets, recovering >50% of fully supervised methods’ performance without pixel-wise annotations.

Conclusion: Natural language narrations provide effective weak supervision for in-hand object segmentation, enabling practical applications without costly manual annotation.

Abstract: Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations $\unicode{x2013}$ natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.

[232] Multimodal LLMs See Sentiment

Neemias B. da Silva, John Harrison, Rodrigo Minetto, Myriam R. Delgado, Bogdan T. Nassu, Thiago H. Silva

Main category: cs.CV

TL;DR: MLLMsent framework uses Multimodal LLMs for visual sentiment analysis, achieving SOTA results through three approaches: direct classification, LLM-augmented analysis, and fine-tuning on descriptions.

Details

Motivation: Visual content dominates online interaction, but sentiment analysis from images is challenging due to complex scene-level semantics. Current methods struggle with understanding nuanced visual sentiment.

Method: Proposes MLLMsent framework with three approaches: 1) Direct sentiment classification using MLLMs, 2) Combining MLLMs with pre-trained LLMs for analysis of generated image descriptions, 3) Fine-tuning LLMs on sentiment-labeled image descriptions.

Result: Achieves state-of-the-art results, outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4% respectively. In cross-dataset testing without training on new data, outperforms best runner-up by up to 8.26%.

Conclusion: The framework demonstrates strong visual reasoning capabilities for affective computing and establishes new benchmarks for future research in visual sentiment analysis.

Abstract: Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators’ agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.

[233] Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation

Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, Jingyong Su

Main category: cs.CV

TL;DR: GSDD proposes a sparse 2D Gaussian representation for dataset distillation that replaces dense pixel-level representations with a small number of Gaussian primitives, improving efficiency and performance.

Details

Motivation: Current dataset distillation methods use dense pixel-level representations that are redundant and difficult to scale. There's a need for more efficient representations that can capture critical information while reducing computational and storage burdens.

Method: GSDD uses sparse 2D Gaussian representations to encode distilled images with only a small number of Gaussian primitives. It adapts CUDA-based splatting operators for parallel inference and training, enabling efficient rendering with minimal computational overhead.

Result: GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets while maintaining highly efficient encoding and decoding costs. The sparse representation improves dataset diversity under the same storage budget.

Conclusion: GSDD provides a simple yet effective, broadly applicable, and highly scalable approach to dataset distillation that outperforms conventional dense representations while being more efficient.

Abstract: Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.

[234] ROGR: Relightable 3D Objects using Generative Relighting

Jiapeng Tang, Matthew Lavine, Dor Verbin, Stephan J. Garbin, Matthias Nießner, Ricardo Martin Brualla, Pratul P. Srinivasan, Philipp Henzler

Main category: cs.CV

TL;DR: ROGR reconstructs relightable 3D objects using a generative relighting model and dual-branch NeRF architecture for arbitrary environment lighting without per-illumination optimization.

Details

Motivation: To create relightable 3D models from multi-view captures that can be realistically rendered under arbitrary environmental lighting conditions without requiring expensive per-illumination optimization or light transport simulation.

Method: Uses generative relighting model to sample object appearances under multiple lighting environments, trains a lighting-conditioned NeRF with dual-branch architecture (separating general lighting effects and specularities), enabling feed-forward relighting under arbitrary environment maps.

Result: Outperforms state-of-the-art on most metrics on TensoIR and Stanford-ORB datasets, and demonstrates effectiveness on real-world object captures.

Conclusion: ROGR provides an efficient approach for relightable 3D reconstruction that enables realistic rendering under arbitrary environmental lighting without per-illumination optimization.

Abstract: We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object’s appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.

[235] OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto, Hayato Yamana

Main category: cs.CV

TL;DR: OpenLVLM-MIA is a new benchmark revealing that high MIA success rates on LVLMs often stem from dataset bias detection rather than true membership inference, with SOTA methods performing at chance level on properly balanced data.

Details

Motivation: Previous MIA evaluations on large vision-language models reported high success rates, but these results may be misleading due to distributional biases in dataset construction rather than genuine membership detection capabilities.

Method: Created OpenLVLM-MIA benchmark with 6,000 images where member and non-member distributions are carefully balanced, providing ground-truth membership labels across three distinct training stages to eliminate dataset bias.

Result: State-of-the-art MIA methods performed at chance-level on the balanced benchmark, demonstrating that previously reported high success rates were artifacts of dataset bias rather than true attack effectiveness.

Conclusion: OpenLVLM-MIA provides a transparent, unbiased benchmark that clarifies limitations in current MIA research on LVLMs and establishes a solid foundation for developing stronger privacy-preserving techniques.

Abstract: OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods approached chance-level. OpenLVLM-MIA, designed to be transparent and unbiased benchmark, clarifies certain limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

[236] Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CV

TL;DR: Diffusion-SDPO introduces a safeguarded update rule for diffusion-based DPO that prevents degradation of preferred outputs by adaptively scaling loser gradients, improving alignment with human preferences in text-to-image generation.

Details

Motivation: Current diffusion-based Direct Preference Optimization (DPO) methods have a critical pathology where enlarging preference margins doesn't necessarily improve generation quality, and can actually increase reconstruction errors for both winner and loser branches, degrading overall performance.

Method: Diffusion-SDPO uses a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step.

Result: Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics.

Conclusion: Diffusion-SDPO is a simple, model-agnostic solution that addresses the pathology in diffusion-based DPO, providing better alignment with human preferences while adding only marginal computational overhead and being compatible with existing DPO-style frameworks.

Abstract: Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

[237] AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

Main category: cs.CV

TL;DR: AIDEN is an AI assistant for visually impaired people that uses object detection, scene description, and haptic feedback to help with object identification, text reading, and navigation without causing auditory overload or privacy concerns.

Details

Motivation: Visually impaired individuals face challenges with object identification, text reading, and navigation. Existing audio-based solutions cause auditory overload and privacy concerns in public spaces.

Method: Hybrid architecture combining YOLO for real-time object detection, LLaVA for scene description and OCR, and a continuous haptic guidance system using a Geiger-counter metaphor for object centering without audio feedback.

Result: High user satisfaction in empirical evaluations, particularly regarding intuitiveness and perceived autonomy. The “Find an Object” feature achieved effective real-time performance.

Conclusion: Multimodal haptic-visual feedback improves daily usability and independence compared to traditional audio-centric methods, showing promise for larger-scale clinical validations.

Abstract: This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object’’ achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.

[238] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

Main category: cs.CV

TL;DR: Otter introduces a novel RWKV-based architecture with compound segmentation and temporal reconstruction modules to address background distractions and temporal relation degradation in wide-angle few-shot action recognition.

Details

Motivation: Wide-angle videos in few-shot action recognition suffer from background distractions that obscure subjects and temporal relation degradation due to frames with similar backgrounds, making action recognition challenging despite the rich contextual information.

Method: Proposes Otter with two key modules: 1) Compound Segmentation Module (CSM) to segment and emphasize key patches in each frame, highlighting subjects against background; 2) Temporal Reconstruction Module (TRM) for bidirectional scanning to reconstruct temporal relations. Combines regular prototypes with temporal-enhanced prototypes for improved subject emphasis and temporal modeling.

Result: Achieves state-of-the-art performance on SSv2, Kinetics, UCF101, and HMDB51 benchmarks. Additional evaluation on VideoBadminton dataset validates superiority in wide-angle few-shot action recognition.

Conclusion: Otter effectively addresses background distractions and temporal relation degradation in wide-angle FSAR through compound segmentation and temporal reconstruction, demonstrating superior performance across multiple benchmarks.

Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

[239] SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo, Leonardo Barcellona, Stefano Ghidoni

Main category: cs.CV

TL;DR: SkelSplat: A novel multi-view 3D human pose estimation framework using differentiable Gaussian rendering to model human pose as a skeleton of 3D Gaussians, enabling view fusion without 3D ground-truth supervision.

Details

Motivation: Current multi-view 3D human pose estimation methods rely on large annotated datasets and have poor generalization when test scenarios differ from training data. There's a need for methods that can fuse arbitrary camera views without requiring 3D ground-truth supervision.

Method: Models human pose as a skeleton of 3D Gaussians (one per joint) and uses differentiable Gaussian rendering for optimization. Introduces a novel one-hot encoding scheme to enable independent optimization of human joints, adapting Gaussian Splatting (originally for dense scene reconstruction) for skeleton modeling.

Result: Outperforms approaches that don’t rely on 3D ground truth on Human3.6M and CMU datasets. Reduces cross-dataset error up to 47.8% compared to learning-based methods. Demonstrates robustness to occlusions on Human3.6M-Occ and Occlusion-Person datasets without scenario-specific fine-tuning.

Conclusion: SkelSplat provides an effective framework for multi-view 3D human pose estimation that generalizes well across different scenarios, handles occlusions robustly, and doesn’t require 3D ground-truth supervision, making it more practical for real-world applications.

Abstract: Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

[240] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: STS is a lightweight test-time adaptation framework for VLMs that steers latent representations using spectral subspaces without backpropagation, achieving faster inference with minimal parameters.

Details

Motivation: Vision-Language Models degrade under test-time domain shifts, and existing adaptation methods require backpropagation through large encoders or alter core model components, making them computationally expensive.

Method: Extracts spectral subspace from textual embeddings to define principal semantic directions, then learns to steer latent representations by adapting small per-sample shift parameters to minimize entropy across augmented views, operating entirely in latent space without backpropagation.

Result: STS surpasses or compares favorably against state-of-the-art test-time adaptation methods, with 8x faster inference, 12x smaller memory footprint, and only a handful of additional parameters.

Conclusion: STS provides an efficient, lightweight alternative to conventional test-time adaptation methods that avoids computational overhead while maintaining strong performance under domain shifts.

Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

[241] TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang, Zuxuan Wu, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TempoMaster is a novel framework for long video generation that formulates the task as next-frame-rate prediction, generating low-frame-rate clips first and progressively increasing frame rates for refinement.

Details

Motivation: The motivation is to address the challenge of generating long videos with both visual quality and temporal coherence. Existing methods struggle with maintaining consistency over long sequences while enabling efficient synthesis.

Method: The method uses a hierarchical approach: 1) Generate low-frame-rate clips as coarse blueprints, 2) Progressively increase frame rates to refine details and motion continuity, 3) Employ bidirectional attention within each frame-rate level for local coherence, and 4) Use autoregression across frame rates for global temporal consistency.

Result: Extensive experiments show TempoMaster establishes new state-of-the-art in long video generation, excelling in both visual quality and temporal coherence while enabling efficient parallel synthesis.

Conclusion: TempoMaster successfully addresses long video generation by formulating it as next-frame-rate prediction, achieving superior performance in visual and temporal quality through its hierarchical progressive refinement approach.

Abstract: We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

[242] End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang

Main category: cs.CV

TL;DR: PAVE-Net is the first fully end-to-end framework for multi-person 2D pose estimation in videos, eliminating heuristic operations like detection and NMS through a novel pose-aware attention mechanism for temporal association.

Details

Motivation: Existing multi-person video pose estimation methods rely on two-stage pipelines with heuristic operations (detection, RoI cropping, NMS) that limit both accuracy and efficiency. There's a need for an end-to-end approach that can handle complex temporal associations without these limitations.

Method: Proposes PAVE-Net (Pose-Aware Video transformEr Network) with: 1) Spatial encoder for intra-frame relations, 2) Spatiotemporal pose decoder for global dependencies across frames, 3) Pose-aware attention mechanism for selective feature aggregation across frames, 4) Explicit modeling of spatiotemporal dependencies among pose keypoints.

Result: Substantially outperforms prior image-based end-to-end methods (6.0 mAP improvement on PoseTrack2017), achieves accuracy competitive with state-of-the-art two-stage video approaches, while offering significant efficiency gains.

Conclusion: PAVE-Net successfully demonstrates the feasibility and advantages of end-to-end multi-person video pose estimation, eliminating heuristic operations while maintaining competitive accuracy and improving efficiency through effective temporal association modeling.

Abstract: Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet.

[243] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao

Main category: cs.CV

TL;DR: Proposes MoR (Mixture-of-Ranks) architecture for single-step Real-ISR using sparse MoE principles, with degradation-aware expert routing and load balancing.

Details

Motivation: Existing dense Real-ISR models using LoRA fine-tuning lack adaptive capability for heterogeneous real-world degradations and efficient knowledge sharing under computational constraints.

Method: 1) MoR architecture treats each LoRA rank as independent expert with fine-grained partitioning; 2) Degradation estimation module using CLIP embeddings and text pairs; 3) Zero-expert slots and degradation-aware load-balancing loss for dynamic expert activation.

Result: Comprehensive experiments validate framework effectiveness and state-of-the-art performance in real-world image super-resolution.

Conclusion: MoR architecture successfully adapts sparse MoE principles to Real-ISR, enabling adaptive degradation handling and efficient knowledge recombination while maintaining computational efficiency.

Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework’s effectiveness and state-of-the-art performance.

[244] Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

Kai Jiang, Siqi Huang, Xiangyu Chen, Jiawei Shao, Hongyuan Zhang, Xuelong Li

Main category: cs.CV

TL;DR: UNIFIER is a multimodal continual learning method for MLLMs that addresses catastrophic forgetting across different visual scenarios by decoupling scenario-specific visual information into separate branches with consistency constraints.

Details

Motivation: MLLMs deployed on devices need to continuously adapt to dynamic scenarios (background/perspective variations) in downstream tasks, but suffer from catastrophic forgetting when learning from real-world data streams with scenario shifts.

Method: UNIFIER decouples visual information from different scenarios into distinct branches within each vision block, projects them into the same feature space, and imposes consistency constraints to maintain stable visual representations across scenarios.

Result: Extensive experiments on the MSVQA dataset (covering high altitude, underwater, low altitude, and indoor scenarios) show UNIFIER effectively alleviates forgetting of cross-scenario tasks and achieves knowledge accumulation within the same scenario.

Conclusion: The proposed UNIFIER framework successfully addresses catastrophic forgetting in MLLMs for continual visual understanding across diverse real-world scenarios, enabling effective adaptation to dynamic visual environments.

Abstract: Continual learning in visual understanding aims to deal with catastrophic forgetting in Multimodal Large Language Models (MLLMs). MLLMs deployed on devices have to continuously adapt to dynamic scenarios in downstream tasks, such as variations in background and perspective, to effectively perform complex visual tasks. To this end, we construct a multimodal visual understanding dataset (MSVQA) encompassing four different scenarios and perspectives including high altitude, underwater, low altitude and indoor, to investigate the catastrophic forgetting in MLLMs under the dynamics of scenario shifts in real-world data streams. Furthermore, we propose mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives (UNIFIER) to address visual discrepancies while learning different scenarios. Specifically, it decouples the visual information from different scenarios into distinct branches within each vision block and projects them into the same feature space. A consistency constraint is imposed on the features of each branch to maintain the stability of visual representations across scenarios. Extensive experiments on the MSVQA dataset demonstrate that UNIFIER effectively alleviates forgetting of cross-scenario tasks and achieves knowledge accumulation within the same scenario.

[245] Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

Main category: cs.CV

TL;DR: VLMs struggle with counting tasks due to inherent biases, especially with specific queries. Researchers created a synthetic benchmark to analyze how counting performance varies with image/prompt properties and tested attention interventions to improve focus.

Details

Motivation: VLMs often rely on training biases when answering visual property questions, particularly with specific queries requiring focused attention on image areas. This is especially problematic for counting tasks where performance degrades with complexity.

Method: Developed synthetic benchmark dataset and evaluation framework to systematically test counting performance variations. Analyzed open-source VLMs’ attention allocation across different input parameters (object count, colors, textures, prompt specificity). Implemented attention-based interventions to modulate visual token focus at different layers.

Result: VLM counting performance remains challenging, especially under high visual or linguistic complexity. However, certain attention interventions can lead to modest gains in counting performance across various visual conditions.

Conclusion: While VLMs struggle with counting tasks due to inherent biases and complexity, targeted attention interventions show promise for improving performance by better modulating focus on relevant visual information.

Abstract: Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

[246] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

Main category: cs.CV

TL;DR: AlignBench is a new benchmark for evaluating fine-grained image-text alignment using detailed image-caption pairs, revealing limitations in current CLIP-based models and detector biases.

Details

Motivation: Existing benchmarks for image-text alignment rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment between visual and linguistic representations.

Method: Introduced AlignBench benchmark that evaluates detailed image-caption pairs generated by diverse image-to-text and text-to-image models, with each sentence annotated for correctness to assess VLMs as alignment evaluators.

Result: Three key findings: (1) CLIP-based models remain nearly blind even when tailored for compositional reasoning, (2) detectors systematically over-score early sentences, and (3) they show strong self-preference that harms detection performance.

Conclusion: AlignBench provides a new indicator for assessing image-text alignment, revealing significant limitations in current VLMs and detector biases that need to be addressed for better visual-linguistic understanding.

Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

[247] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition

Hongda Liu, Yunfan Liu, Changlu Wang, Yunlong Wang, Zhenan Sun

Main category: cs.CV

TL;DR: SkeletonAgent is a novel framework that bridges action recognition models with LLMs through cooperative Questioner and Selector agents to extract targeted discriminative cues for distinguishing similar skeleton-based actions.

Details

Motivation: Current approaches use LLMs in isolation from recognition models, receiving no performance feedback, which leads to failure in delivering targeted discriminative cues needed to distinguish similar actions.

Method: Proposes SkeletonAgent with two cooperative agents: Questioner identifies confused classes and supplies them to LLM as context; Selector parses LLM responses to extract joint-level constraints and feeds them back to the recognizer for finer-grained cross-modal alignment.

Result: Comprehensive evaluations on five benchmarks (NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human) demonstrate consistent outperformance over state-of-the-art methods.

Conclusion: SkeletonAgent effectively bridges recognition models and LLMs through cooperative agents, enabling targeted extraction of discriminative cues for improved skeleton-based action recognition across multiple benchmarks.

Abstract: Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM’s response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.

[248] Emergent Extreme-View Geometry in 3D Foundation Models

Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: 3D foundation models show emergent understanding of extreme-view geometry, and a lightweight alignment scheme improves their relative pose estimation without degrading other capabilities.

Details

Motivation: While 3D foundation models have advanced 3D vision, their ability to reason under extreme, non-overlapping views remains unexplored. The paper investigates whether these models can handle such challenging conditions despite never being trained for them.

Method: The authors study 3DFMs’ internal representations and introduce a lightweight alignment scheme that refines internal 3D representations by tuning only a small subset of backbone bias terms while keeping all decoder heads frozen. They also create MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs.

Result: 3DFMs exhibit emergent understanding of extreme-view geometry. The lightweight alignment scheme substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality.

Conclusion: 3D foundation models possess surprising geometric reasoning capabilities for extreme views, and targeted parameter tuning can enhance these abilities while maintaining other performance aspects. The new MegaUnScene benchmark will facilitate further research in this area.

Abstract: 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.

[249] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou

Main category: cs.CV

TL;DR: Z-Image is a 6B-parameter efficient generative model that challenges the “scale-at-all-costs” paradigm, achieving SOTA performance with significantly reduced computational overhead compared to massive 20B-80B parameter models.

Details

Motivation: Current high-performance image generation is dominated by proprietary systems, while leading open-source alternatives have massive parameter counts (20B-80B) that make them impractical for inference and fine-tuning on consumer hardware.

Method: Proposes Z-Image built on Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture with systematic optimization of the entire model lifecycle, including curated data infrastructure, streamlined training curriculum, few-step distillation with reward post-training (for Z-Image-Turbo), and omni-pre-training paradigm for editing models.

Result: Achieves performance comparable to or surpassing leading competitors across various dimensions, with exceptional capabilities in photorealistic image generation and bilingual text rendering. Full training completed in just 314K H800 GPU hours (~$630K), with sub-second inference latency on H800 and compatibility with consumer hardware (<16GB VRAM).

Conclusion: Demonstrates that state-of-the-art results are achievable with significantly reduced computational overhead, challenging the “scale-at-all-costs” paradigm. The authors publicly release code, weights, and demo to foster development of accessible, budget-friendly SOTA generative models.

Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the “scale-at-all-costs” paradigm. By systematically optimizing the entire model lifecycle – from a curated data infrastructure to a streamlined training curriculum – we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

[250] Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

Xisheng Feng

Main category: cs.CV

TL;DR: A framework called “Look, Recite, Then Answer” improves Vision-Language Models for specialized domains by addressing reasoning-driven hallucinations through self-generated knowledge hints and evidence alignment.

Details

Motivation: VLMs struggle in specialized domains like precision agriculture due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception, and the "Modality Gap" where visual embeddings fail to activate fine-grained expert knowledge already encoded in model parameters.

Method: A parameter-efficient framework with three decoupled stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite uses a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label.

Result: Achieves state-of-the-art results on AgroBench, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead.

Conclusion: The modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval, offering a parameter-efficient solution for specialized domain applications.

Abstract: Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to “Reasoning-Driven Hallucination” where linguistic priors override visual perception. A key bottleneck is the “Modality Gap”: visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose “Look, Recite, Then Answer,” a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval

[251] PointCNN++: Performant Convolution on Native Points

Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li

Main category: cs.CV

TL;DR: PointCNN++ is a novel 3D point cloud convolution method that generalizes sparse convolution from voxels to points, achieving both high geometric precision and computational efficiency through native point-based operations with optimized GPU kernels.

Details

Motivation: Existing 3D point cloud methods face a trade-off: point-based methods preserve geometric precision but have performance issues, while voxel-based methods are efficient but lose geometric fidelity. This precision loss is critical for tasks like point cloud registration.

Method: 1) Introduces point-centric convolution with receptive fields centered on original high-precision point coordinates. 2) Formulates convolution as Matrix-Vector Multiplication and Reduction (MVMR) problem. 3) Develops dedicated, highly-optimized GPU kernel for native point operations.

Result: PointCNN++ uses an order of magnitude less memory and is several times faster than point-based methods. When replacing voxel-based backbones, it significantly improves point cloud registration accuracy while being more memory-efficient and faster.

Conclusion: PointCNN++ demonstrates that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with both high fidelity and efficiency.

Abstract: Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It $\textbf{generalizes sparse convolution from voxels to points}$, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates $\textbf{natively}$ on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ $\textbf{uses an order of magnitude less memory and is several times faster}$ than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it $\textbf{significantly improves point cloud registration accuracies while proving both more memory-efficient and faster}$. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency. Our code will be open sourced.

[252] ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process

Jongwon Sohn, Juhyeon Moon, Hyunjoon Jung, Jaewook Nam

Main category: cs.CV

TL;DR: Computer vision viscometer uses optical distortion of background patterns through mixing fluid surfaces to measure viscosity non-invasively with 0.113 MAE in log units and up to 81% classification accuracy.

Details

Motivation: Conventional viscometers are invasive and require controlled lab environments that don't match real process conditions, creating a need for non-invasive, automation-ready viscosity measurement for process monitoring.

Method: Computer vision system that infers viscosity by analyzing how fixed background patterns become optically distorted as light refracts through mixing-driven, continuously deforming free surfaces of fluids. Uses multi-pattern strategy for robustness and incorporates uncertainty quantification.

Result: Achieves mean absolute error of 0.113 in log m² s⁻¹ units for regression and up to 81% accuracy in viscosity-class prediction. Performance declines for closely clustered viscosity values, but multi-pattern strategy improves robustness.

Conclusion: The stand-off viscometer offers a practical, automation-ready alternative to existing viscometry methods by enabling non-invasive viscosity measurement with confidence estimates through computer vision and optical distortion analysis.

Abstract: Viscosity measurement is essential for process monitoring and autonomous laboratory operation, yet conventional viscometers remain invasive and require controlled laboratory environments that differ substantially from real process conditions. We present a computer-vision-based viscometer that infers viscosity by exploiting how a fixed background pattern becomes optically distorted as light refracts through the mixing-driven, continuously deforming free surface. Under diverse lighting conditions, the system achieves a mean absolute error of 0.113 in log m2 s^-1 units for regression and reaches up to 81% accuracy in viscosity-class prediction. Although performance declines for classes with closely clustered viscosity values, a multi-pattern strategy improves robustness by providing enriched visual cues. To ensure sensor reliability, we incorporate uncertainty quantification, enabling viscosity predictions with confidence estimates. This stand-off viscometer offers a practical, automation-ready alternative to existing viscometry methods.

[253] Diffusion Model in Latent Space for Medical Image Segmentation Task

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc

Main category: cs.CV

TL;DR: MedSegLatDiff: A diffusion-based framework combining VAE with latent diffusion for efficient medical image segmentation that generates multiple plausible masks and confidence maps while achieving state-of-the-art performance.

Details

Motivation: Traditional medical image segmentation produces single masks lacking uncertainty capture, while recent generative models are computationally heavy. Need for efficient methods that generate diverse segmentation hypotheses for clinical reliability.

Method: Combines VAE with latent diffusion: VAE compresses input to low-dimensional latent space (reducing noise, accelerating training), diffusion operates in this compact representation. Uses weighted cross-entropy loss in VAE mask reconstruction to preserve tiny structures.

Result: Achieves state-of-the-art or highly competitive Dice and IoU scores on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). Generates diverse segmentation hypotheses and confidence maps.

Conclusion: Provides enhanced interpretability and reliability compared to deterministic baselines, making it suitable for clinical deployment by capturing uncertainty and generating multiple plausible segmentations.

Abstract: Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.

[254] MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration

Thao Thi Phuong Dao, Tan-Cong Nguyen, Nguyen Chi Thanh, Truong Hoang Viet, Trong-Le Do, Mai-Khiem Tran, Minh-Khoi Pham, Trung-Nghia Le, Minh-Triet Tran, Thanh Dinh Le

Main category: cs.CV

TL;DR: Introduces MasHeNe dataset for head-and-neck mass segmentation (including tumors and cysts) and proposes WEMF model with tri-window enhancement and multi-frequency attention in Mamba backbone.

Details

Motivation: Public datasets for head-and-neck masses focus primarily on malignant lesions, overlooking other space-occupying conditions. Need for comprehensive dataset covering both tumors and cysts with pixel-level annotations.

Method: 1) Created MasHeNe dataset with 3,779 contrast-enhanced CT slices of head-and-neck masses (tumors and cysts) with pixel-level annotations. 2) Proposed WEMF model: applies tri-window enhancement for input enrichment, uses multi-frequency attention to fuse information across skip connections within U-shaped Mamba backbone.

Result: WEMF achieves best performance on MasHeNe: Dice 70.45%, IoU 66.89%, NSD 72.33%, HD95 5.12 mm. Dataset provides benchmark for head-and-neck mass segmentation beyond malignancy-only datasets.

Conclusion: MasHeNe addresses gap in head-and-neck mass segmentation datasets. WEMF shows strong performance but task remains challenging, requiring further research. Dataset and code publicly available.

Abstract: Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.

[255] Toward Content-based Indexing and Retrieval of Head and Neck CT with Abscess Segmentation

Thao Thi Phuong Dao, Tan-Cong Nguyen, Trong-Le Do, Truong Hoang Viet, Nguyen Chi Thanh, Huynh Nguyen Thuan, Do Vo Cong Nguyen, Minh-Khoi Pham, Mai-Khiem Tran, Viet-Tham Huynh, Trong-Thuan Nguyen, Trung-Nghia Le, Vo Thanh Toan, Tam V. Nguyen, Minh-Triet Tran, Thanh Dinh Le

Main category: cs.CV

TL;DR: A new dataset called AbscessHeNe containing 4,926 annotated CT slices of head and neck abscesses is introduced to support semantic segmentation model development, with baseline models achieving modest performance metrics (Dice 0.39, IoU 0.27).

Details

Motivation: Head and neck abscesses are serious infections requiring prompt diagnosis and treatment. Accurate imaging detection and delineation are crucial for clinical decision-making, but there's a lack of comprehensive annotated datasets for developing AI segmentation models.

Method: Created a curated dataset of 4,926 contrast-enhanced CT slices with pixel-level annotations of clinically confirmed head and neck abscesses. Evaluated multiple state-of-the-art segmentation architectures including CNN, Transformer, and Mamba-based models to establish performance baselines.

Result: The best-performing model achieved Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, highlighting the difficulty of the task. The dataset is structured for future applications in content-based indexing and case retrieval.

Conclusion: AbscessHeNe provides a valuable resource for developing robust segmentation models for head and neck abscesses, though current performance indicates significant challenges requiring further research. The publicly available dataset supports both segmentation model development and future clinical workflow applications.

Abstract: Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.

[256] SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang

Main category: cs.CV

TL;DR: SPARK reconstructs physically consistent, kinematic articulated 3D objects from single RGB images using vision-language models and diffusion transformers with differentiable optimization.

Details

Motivation: Creating simulation-ready articulated 3D assets is labor-intensive and requires expert modeling of part hierarchies and motion structures, limiting applications in embodied AI, robotics, and interactive scene understanding.

Method: 1) Use VLMs to extract coarse URDF parameters and generate part-level reference images; 2) Integrate part-image guidance and structure graph into generative diffusion transformer to synthesize consistent part and complete shapes; 3) Refine URDF parameters using differentiable forward kinematics and rendering optimization under VLM-generated open-state supervision.

Result: SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications like robotic manipulation and interaction modeling.

Conclusion: SPARK provides an effective framework for reconstructing physically consistent kinematic articulated objects from single images, addressing the labor-intensive nature of manual asset creation and enabling broader applications in embodied AI and robotics.

Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.

[257] OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic

Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, Chen Lv

Main category: cs.CV

TL;DR: OpenREAD is an end-to-end reinforcement fine-tuning framework for autonomous driving that uses Chain-of-Thought annotations and LLM-based reward modeling to improve both reasoning and planning capabilities.

Details

Motivation: Current two-stage fine-tuning approaches have limitations: SFT struggles with reasoning generalization, and RFT is hard to apply to open-ended scene understanding tasks where rewards are difficult to quantify. There's a need for end-to-end reinforcement fine-tuning that spans from high-level reasoning to low-level planning.

Method: 1) Construct large-scale Chain-of-Thought annotations on driving knowledge datasets; 2) Use Qwen3 LLM as a critic in reinforcement fine-tuning to quantify reasoning quality for open-ended questions during reward modeling; 3) Implement joint end-to-end reinforcement fine-tuning across the full spectrum from reasoning to trajectory planning.

Result: Extensive experiments show that joint end-to-end reinforcement fine-tuning yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

Conclusion: OpenREAD successfully addresses the limitations of current approaches by enabling end-to-end reinforcement fine-tuning across reasoning and planning, demonstrating that LLM-based reward modeling can effectively quantify reasoning quality for open-ended autonomous driving tasks.

Abstract: Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

[258] COACH: Collaborative Agents for Contextual Highlighting – A Multi-Agent Framework for Sports Video Analysis

Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai

Main category: cs.CV

TL;DR: A reconfigurable Multi-Agent System framework for sports video analysis that uses specialized agents as cognitive tools to handle temporal hierarchies from micro-actions to macro-strategies, enabling adaptive pipelines for both short-term reasoning and long-term summarization.

Details

Motivation: Existing end-to-end models for sports video analysis struggle with temporal hierarchy, lack generalization, have high development costs for new tasks, and suffer from poor interpretability. There's a need for a more flexible, scalable, and interpretable approach.

Method: Proposes a reconfigurable Multi-Agent System where each agent functions as a specialized “cognitive tool” for specific analysis aspects. The framework uses iterative invocation and flexible composition of agents to construct adaptive pipelines for different temporal scales and tasks.

Result: Demonstrated the framework’s adaptability using two representative badminton analysis tasks, showing ability to bridge fine-grained event detection and global semantic organization. The system enables both short-term analytic reasoning (Rally QA) and long-term generative summarization (match summaries).

Conclusion: Presents a paradigm shift toward flexible, scalable, and interpretable systems for robust, cross-task sports video intelligence, offering a foundational framework that overcomes limitations of existing end-to-end models.

Abstract: Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct “cognitive tool” specializing in a specific aspect of analysis. The system’s architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence. The project homepage is available at https://aiden1020.github.io/COACH-project-page

cs.AI

[259] The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

PIerre Dantas, Lucas Cordeiro, Youcheng Sun, Waldir Junior

Main category: cs.AI

TL;DR: This paper develops a formal convergence theorem for LLM-verifier systems, proving termination guarantees and providing empirical validation through 90,000+ trials.

Details

Motivation: Current LLM-based formal verification methods lack theoretical foundations and reliability, with refinement processes that can wander, loop, or break from stable trajectories without guarantees.

Method: Models LLM-verifier interaction as a discrete-time Markov Chain with error-reduction probability (δ) parameter. Develops LLM-Verifier Convergence Theorem proving termination for any δ>0 with bounded expected iterations. Validates theory through extensive empirical testing (90,000+ trials).

Result: Proves termination almost surely for any δ>0 with 𝔼[n] ≤ 4/δ. Empirical results show 100% verification success across all trials with convergence factor Cf≈1.0. Establishes three operating zones (marginal, practical, high-performance) with design thresholds.

Conclusion: Provides first formal framework with provable guarantees for LLM-assisted verification, enabling predictable resource planning and performance budgeting for safety-critical software deployment.

Abstract: The idea of using Formal Verification tools with large language models (LLMs) has enabled scaling software verification beyond manual workflows. However, current methods remain unreliable. Without a solid theoretical footing, the refinement process can wander; sometimes it settles, sometimes it loops back, and sometimes it breaks away from any stable trajectory. This work bridges this critical gap by developing an LLM-Verifier Convergence Theorem, providing the first formal framework with provable guarantees for termination and convergence. We model the interaction between the LLM and the verifier as a discrete-time Markov Chain, with state transitions determined by a key parameter: the error-reduction probability ($δ$). The procedure reaching the Verified state almost surely demonstrates that the program terminates for any $δ> 0$, with an expected iteration count bounded by $\mathbb{E}[n] \leq 4/δ$. We then stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials. The empirical results match the theory with striking consistency. Every single run reached verification, and the convergence factor clustered tightly around $C_f\approx$ 1.0. Consequently, the bound mirrors the system’s actual behavior. The evidence is sufficiently robust to support dividing the workflow into three distinct operating zones: marginal, practical, and high-performance. Consequently, we establish the design thresholds with absolute confidence. Together, the theoretical guarantee and the experimental evidence provide a clearer architectural foundation for LLM-assisted verification. Heuristic tuning no longer has to be carried out by the system. Engineers gain a framework that supports predictable resource planning and performance budgeting, precisely what is needed before deploying these pipelines into safety-critical software environments.

[260] Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code

Pritam Deka, Barry Devereux

Main category: cs.AI

TL;DR: Flowchart2Mermaid converts static flowchart images into editable Mermaid.js code using vision-language models, enabling interactive editing and version control.

Details

Motivation: Static flowchart images are difficult to edit, reuse, or version control. There's a need for tools that can convert these static visualizations into structured, editable formats that support collaboration and iteration.

Method: A lightweight web system using vision-language models with detailed system prompts to analyze flowchart images and generate Mermaid.js code. Features mixed-initiative refinement through text editing, drag-and-drop node insertion, and natural-language commands via an integrated AI assistant.

Result: The system produces structured, version-controllable textual representations that remain synchronized with rendered diagrams. The paper introduces evaluation metrics for assessing structural accuracy, flow correctness, syntax validity, and completeness across multiple models.

Conclusion: Flowchart2Mermaid provides a practical solution for converting static flowcharts into editable, version-controlled formats, addressing limitations of existing image-to-diagram tools while supporting interactive refinement through multiple modalities.

Abstract: Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present \textsc{Flowchart2Mermaid}, a lightweight web system that converts flowchart images into editable Mermaid.js code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.

[261] From monoliths to modules: Decomposing transducers for efficient world modelling

Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas

Main category: cs.AI

TL;DR: The paper proposes a framework for decomposing complex world models (represented as transducers) into modular subcomponents to enable parallelizable and interpretable alternatives to monolithic world modeling.

Details

Motivation: Realistic world models for AI training often have high computational demands, but real-world scenarios typically involve modular subcomponents that interact. There's a need to bridge structural transparency for AI safety with computational efficiency for real-world inference.

Method: Develops a framework for decomposing complex world models represented by transducers (which generalize POMDPs). The approach inverts the composition process to derive sub-transducers operating on distinct input-output subspaces.

Result: The framework enables parallelizable and interpretable alternatives to monolithic world modeling that can support distributed inference. Results clarify how to decompose transducers into modular components.

Conclusion: The work lays groundwork for bridging the structural transparency demanded by AI safety with the computational efficiency required for real-world inference through modular decomposition of world models.

Abstract: World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. Although realistic world models often have high computational demands, efficient modelling is usually possible by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process, deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay a groundwork for bridging the structural transparency demanded by AI safety and the computational efficiency required for real-world inference.

[262] STRIDE: A Systematic Framework for Selecting AI Modalities – Agentic AI, AI Assistants, or LLM Calls

Shubhi Asthana, Bing Zhang, Chad DeLuca, Ruchi Mahindru, Hima Patel

Main category: cs.AI

TL;DR: STRIDE framework helps decide when to use autonomous AI agents vs simpler LLM approaches, reducing unnecessary agent deployments by 45% and cutting costs by 37%.

Details

Motivation: The rapid shift from stateless LLMs to autonomous agents raises questions about when agentic AI is truly necessary, as indiscriminate deployment leads to higher cost, complexity, and risk.

Method: STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator) framework integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score for selecting between three modalities: direct LLM calls, guided AI assistants, or fully autonomous agentic AI.

Result: Evaluated across 30 real-world tasks, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months confirmed practical utility.

Conclusion: This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs, distinguishing between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy.

Abstract: The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.

[263] Benchmarking LLM Agents for Wealth-Management Workflows

Rory Milsom

Main category: cs.AI

TL;DR: This dissertation extends TheAgentCompany with a finance environment to test if general LLM agents can accurately and economically complete wealth-management tasks, creating a benchmark with 12 task-pairs and finding workflow reliability and autonomy level are key limitations.

Details

Motivation: Despite numerous digital collaboration tools, routine processes still suffer from human error and delay. The research aims to address this gap by investigating whether general purpose LLM agents can effectively handle wealth-management tasks that currently rely on human assistants.

Method: Extended TheAgentCompany with finance-focused environment, introduced synthetic domain data, enriched colleague simulations, and prototyped automatic task-generation pipeline. Created benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication tasks, with explicit acceptance criteria and deterministic graders. Included finance-specific data and high vs. low-autonomy variants for every task.

Result: Agents are limited less by mathematical reasoning and more by end-to-end workflow reliability. Performance is meaningfully affected by autonomy level. The study also found that incorrect evaluation of models has hindered benchmarking efforts.

Conclusion: The research successfully created an evaluation framework for measuring agent fitness for assistant-level wealth management work, revealing that workflow reliability and autonomy design are critical factors for practical deployment, while also highlighting issues with current benchmarking approaches.

Abstract: Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent’s fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

[264] TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

Lewen Yan, Jilin Mei, Tianyi Zhou, Lige Huang, Jie Zhang, Dongrui Liu, Jing Shao

Main category: cs.AI

TL;DR: TradeTrap is a framework for stress-testing LLM-based trading agents against system-level perturbations, revealing vulnerabilities that can cause extreme portfolio losses.

Details

Motivation: LLM-based trading agents are increasingly used in real financial markets, but their reliability under adversarial or faulty conditions remains largely unexamined despite operating in high-risk, irreversible environments.

Method: TradeTrap systematically stress-tests both adaptive and procedural autonomous trading agents by targeting four core components (market intelligence, strategy formulation, portfolio/ledger handling, trade execution) under controlled system-level perturbations in closed-loop historical backtesting on real US equity market data.

Result: Small perturbations at a single component can propagate through the agent decision loop and induce extreme concentration, runaway exposure, and large portfolio drawdowns across both agent types, demonstrating that current autonomous trading agents can be systematically misled at the system level.

Conclusion: The paper reveals critical vulnerabilities in current autonomous trading agents and provides a unified evaluation framework (TradeTrap) for systematically assessing their robustness, highlighting the need for more resilient trading systems.

Abstract: LLM-based trading agents are increasingly deployed in real-world financial markets to perform autonomous analysis and execution. However, their reliability and robustness under adversarial or faulty conditions remain largely unexamined, despite operating in high-risk, irreversible financial environments. We propose TradeTrap, a unified evaluation framework for systematically stress-testing both adaptive and procedural autonomous trading agents. TradeTrap targets four core components of autonomous trading agents: market intelligence, strategy formulation, portfolio and ledger handling, and trade execution, and evaluates their robustness under controlled system-level perturbations. All evaluations are conducted in a closed-loop historical backtesting setting on real US equity market data with identical initial conditions, enabling fair and reproducible comparisons across agents and attacks. Extensive experiments show that small perturbations at a single component can propagate through the agent decision loop and induce extreme concentration, runaway exposure, and large portfolio drawdowns across both agent types, demonstrating that current autonomous trading agents can be systematically misled at the system level. Our code is available at https://github.com/Yanlewen/TradeTrap.

[265] DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

Han Luo, Guy Laban

Main category: cs.AI

TL;DR: DialogGuard is a multi-agent framework for assessing psychosocial risks in LLM responses across five dimensions, using LLM-as-a-judge pipelines to improve safety evaluation for emotionally sensitive applications.

Details

Motivation: LLMs are increasingly used in mental-health, crisis, and emotionally sensitive services, but their psychosocial safety remains poorly understood and weakly evaluated, creating risks for vulnerable users.

Method: Multi-agent framework with four LLM-as-a-judge pipelines (single-agent scoring, dual-agent correction, multi-agent debate, stochastic majority voting) assessing five psychosocial risk dimensions using a shared three-level rubric.

Result: Multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide best trade-off between accuracy, alignment with human ratings, and robustness.

Conclusion: DialogGuard provides an effective open-source framework for assessing psychosocial safety in LLM applications, supporting prompt design, auditing, and supervision for web-facing applications serving vulnerable users.

Abstract: Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

[266] Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence

Noorbakhsh Amiri Golilarz, Sindhuja Penchala, Shahram Rahimi

Main category: cs.AI

TL;DR: Current AI systems lack self-monitoring, self-correction, and autonomous behavior regulation despite advances in perception, language, and reasoning. The paper identifies seven core deficiencies and argues for cognitively grounded AI architectures.

Details

Motivation: Despite rapid AI progress, modern systems remain fundamentally limited in self-monitoring, self-correction, and autonomous behavior regulation in dynamic contexts. There's a need to move beyond current architectural limitations that prevent robust generalization, lifelong adaptability, and real-world autonomy.

Method: Comparative analysis of artificial systems and biological cognition, integrating insights from AI research, cognitive science, and neuroscience. Identifies seven core deficiencies in contemporary AI models through systematic analysis.

Result: Identifies seven key limitations: absence of intrinsic self-monitoring, lack of meta-cognitive awareness, fixed learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and absence of intrinsic agency. Shows these deficiencies prevent robust generalization and real-world autonomy.

Conclusion: Advocates for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, intentional behavior, paired with reformative oversight mechanisms to ensure interpretability, governability, and alignment with human values.

Abstract: Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fundamentally limited in their ability to self-monitor, self-correct, and regulate their behavior autonomously in dynamic contexts. This paper identifies and analyzes seven core deficiencies that constrain contemporary AI models: the absence of intrinsic self-monitoring, lack of meta-cognitive awareness, fixed and non-adaptive learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and the absence of intrinsic agency. Alongside identifying these limitations, we also outline a forward-looking perspective on how AI may evolve beyond them through architectures that mirror neurocognitive principles. We argue that these structural limitations prevent current architectures, including deep learning and transformer-based systems, from achieving robust generalization, lifelong adaptability, and real-world autonomy. Drawing on a comparative analysis of artificial systems and biological cognition [7], and integrating insights from AI research, cognitive science, and neuroscience, we outline how these capabilities are absent in current models and why scaling alone cannot resolve them. We conclude by advocating for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, and intentional, goal-oriented behavior, paired with reformative oversight mechanisms [8] that ensure autonomous systems remain interpretable, governable, and aligned with human values.

[267] GRAFT: GRaPH and Table Reasoning for Textual Alignment – A Benchmark for Structured Instruction Following and Visual Reasoning

Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran

Main category: cs.AI

TL;DR: GRAFT is a multimodal benchmark using programmatically generated charts and tables to evaluate LLMs on visual reasoning, instruction following, and structured output generation.

Details

Motivation: There's a need for more rigorous evaluation of multimodal LLMs on visually grounded, structured reasoning tasks that require tight visual-textual alignment and adherence to output specifications.

Method: Built around programmatically generated charts and synthetically rendered tables paired with multi-step analytical questions. Responses are formatted in structured outputs (JSON/YAML) for consistent evaluation. Introduces taxonomy of reasoning operations like comparison, trend identification, ranking, aggregation, etc.

Result: GRAFT provides a unified and scalable framework for evaluating multimodal LLMs, enabling fine-grained assessment of both reasoning processes and output specification adherence.

Conclusion: GRAFT offers a more rigorous standard for future benchmarking efforts in multimodal AI by focusing on structured reasoning tasks with precise evaluation metrics.

Abstract: GRAFT is a structured multimodal benchmark designed to probe how well LLMs handle instruction following, visual reasoning, and tasks requiring tight visual textual alignment. The dataset is built around programmatically generated charts and synthetically rendered tables, each paired with a carefully constructed, multi step analytical question that depends solely on what can be inferred from the image itself. Responses are formatted in structured outputs such as JSON or YAML, enabling consistent and fine grained evaluation of both reasoning processes and adherence to output specifications. The benchmark further introduces a taxonomy of reasoning operations ranging from comparison and trend identification to ranking, aggregation, proportional estimation, and anomaly detection to support a comprehensive assessment of model capabilities. Taken together, GRAFT provides a unified and scalable framework for evaluating multimodal LLMs on visually grounded, structured reasoning tasks, offering a more rigorous standard for future benchmarking efforts.

[268] IACT: A Self-Organizing Recursive Model for General AI Agents: A Technical White Paper on the Architecture Behind kragent.ai

Pengju Lu

Main category: cs.AI

TL;DR: IACT is a dynamic agent workflow model that grows recursive agent topologies autonomously from user dialogue, replacing static workflows with bidirectional stateful dialogues for runtime error correction.

Details

Motivation: To overcome limitations of static, hard-coded agent workflows that require pre-defined graphs or specialized programming, and to address error propagation in unidirectional function calls.

Method: IACT operates as a general-purpose autonomous system driven by user dialogue, autonomously growing dynamic recursive agent topologies incrementally tailored to problem structure. It replaces rigid invocations with bidirectional, stateful dialogues to enable runtime error correction and ambiguity resolution.

Result: The model has been deployed in production in the kragent.ai system, with qualitative evidence from real-world workflows demonstrating its practical effectiveness.

Conclusion: IACT provides a scalable, dynamic approach to agent workflows that can adapt organizational complexity to open-ended tasks through autonomous topology growth and interactional redundancy mechanisms.

Abstract: This technical white paper introduces the Interactive Agents Call Tree (IACT), a computational model designed to address the limitations of static, hard-coded agent workflows. Unlike traditional systems that require pre-defined graphs or specialized programming, IACT operates as a general-purpose autonomous system driven purely by user dialogue. Given a high-level objective, the system autonomously grows a dynamic, recursive agent topology incrementally tailored to the problem’s structure. This allows it to scale its organizational complexity to match open-ended tasks. To mitigate the error propagation inherent in unidirectional function calls, IACT introduces interactional redundancy by replacing rigid invocations with bidirectional, stateful dialogues. This mechanism enables runtime error correction and ambiguity resolution. We describe the architecture, design principles, and practical lessons behind the production deployment of this model in the kragent.ai system, presenting qualitative evidence from real-world workflows rather than exhaustive benchmark results.

[269] Model Recovery at the Edge under Resource Constraints for Physical AI

Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta

Main category: cs.AI

TL;DR: MERINDA is an FPGA-accelerated Model Recovery framework that replaces iterative NODE solvers with parallelizable neural architecture, achieving 11x lower DRAM usage and 2.2x faster runtime than mobile GPUs for edge deployment.

Details

Motivation: Model Recovery enables safe, explainable decision making in mission-critical autonomous systems by learning governing dynamical equations, but deployment on edge devices is hindered by inefficient iterative NODE solvers on FPGAs, with memory and energy consumption being major concerns for real-time operation.

Method: MERINDA replaces iterative neural ordinary differential equation (NODE) solvers with a parallelizable neural architecture that is equivalent to NODEs but optimized for FPGA acceleration, addressing the inefficiency of iterative approaches on FPGAs.

Result: MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy consumption at fixed accuracy.

Conclusion: MERINDA demonstrates suitability for resource-constrained, real-time mission-critical autonomous systems by providing efficient FPGA acceleration with significantly reduced memory usage and faster runtime compared to GPU alternatives.

Abstract: Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA’s suitability for resource-constrained, real-time MCAS.

[270] Breast Cell Segmentation Under Extreme Data Constraints: Quantum Enhancement Meets Adaptive Loss Stabilization

Varun Kumar Dasoju, Qingsu Cheng, Zeyun Yu

Main category: cs.AI

TL;DR: Achieves 95.5% Dice score for breast cell segmentation using only 599 training images with quantum-inspired edge enhancement and adaptive loss functions to handle severe class imbalance.

Details

Motivation: Medical image annotation requires significant time and expertise from pathologists, creating a bottleneck for clinical AI development. The paper addresses the challenge of limited annotated data for breast cell segmentation.

Method: Uses quantum-inspired edge enhancement via multi-scale Gabor filters (4th input channel), stabilized multi-component loss with adaptive Dice loss and boundary-aware terms, complexity-based weighted sampling, EfficientNet-B7/UNet++ architecture with 4-to-3 channel projection, and robust validation with exponential moving averaging.

Result: Achieves 95.5% Dice score +/- 0.3% and 91.2% IoU +/- 0.4% with only 599 training images. Quantum enhancement improves boundary accuracy by 2.1%, weighted sampling increases small lesion detection by 3.8%.

Conclusion: The framework significantly reduces medical expert time for dataset creation by achieving high performance with limited annotations, addressing a fundamental bottleneck in clinical perception AI development.

Abstract: Annotating medical images demands significant time and expertise, often requiring pathologists to invest hundreds of hours in labeling mammary epithelial nuclei datasets. We address this critical challenge by achieving 95.5% Dice score using just 599 training images for breast cell segmentation, where just 4% of pixels represent breast tissue and 60% of images contain no breast regions. Our framework uses quantum-inspired edge enhancement via multi-scale Gabor filters creating a fourth input channel, enhancing boundary detection where inter-annotator variations reach +/- 3 pixels. We present a stabilized multi-component loss function that integrates adaptive Dice loss with boundary-aware terms and automatic positive weighting to effectively address severe class imbalance, where mammary epithelial cell regions comprise only 0.1%-20% of the total image area. Additionally, a complexity-based weighted sampling strategy is introduced to prioritize the challenging mammary epithelial cell regions. The model employs an EfficientNet-B7/UNet++ architecture with a 4-to-3 channel projection, enabling the use of pretrained weights despite limited medical imaging data. Finally, robust validation is achieved through exponential moving averaging and statistical outlier detection, ensuring reliable performance estimates on a small validation set (129 images). Our framework achieves a Dice score of 95.5% +/- 0.3% and an IoU of 91.2% +/- 0.4%. Notably, quantum-based enhancement contributes to a 2.1% improvement in boundary accuracy, while weighted sampling increases small lesion detection by 3.8%. By achieving groundbreaking performance with limited annotations, our approach significantly reduces the medical expert time required for dataset creation, addressing a fundamental bottleneck in clinical perception AI development.

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, Muhao Chen

Main category: cs.AI

TL;DR: OmniGuard is a family of omni-modal guardrails that provides safety and value alignment across text, images, videos, and audio for Large Language Models, using a large curated dataset and achieving strong performance across 15 benchmarks.

Details

Motivation: Current safety guardrails are limited to unimodal settings and use binary classification, which lacks robustness for diverse modalities and tasks in omni-modal LLMs. There's a need for comprehensive safeguarding across all modalities with reasoning capabilities.

Method: Proposes OmniGuard, the first omni-modal guardrail family with deliberate reasoning ability. Created a large dataset of 210K+ diverse omni-modal samples with structured safety labels and expert-curated safety critiques via targeted distillation.

Result: Extensive experiments on 15 benchmarks show OmniGuard achieves strong effectiveness and generalization across wide range of multimodal safety scenarios, providing unified framework for policy enforcement and risk mitigation.

Conclusion: OmniGuard enables robust safeguarding across all modalities, paving the way for more capable omni-modal safety systems that can handle diverse human-AI interaction scenarios.

Abstract: Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

[272] Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu, Wei Gao

Main category: cs.AI

TL;DR: Researchers created ReMindView-Bench, a benchmark to evaluate vision-language models’ spatial reasoning across multiple viewpoints, revealing VLMs struggle with cross-view alignment and perspective-taking despite good single-view perception.

Details

Motivation: Current vision-language models lack geometric coherence and cross-view consistency in spatial reasoning tasks. There's a need for benchmarks that isolate multi-view reasoning from single-view perception to properly evaluate spatial cognition capabilities.

Method: Created ReMindView-Bench benchmark with systematic variation of viewpoint spatial patterns and query types. Evaluated 15 VLMs using explicit phase-wise analysis (LLM-as-a-judge, self-consistency prompting) and implicit analysis (linear probing, entropy dynamics) to examine reasoning processes.

Result: VLMs perform well on in-frame perception but degrade sharply when integrating information across views. They show consistent failures in cross-view alignment and perspective-taking. Analysis reveals progressive loss of task-relevant information and uncertainty separation between correct/incorrect reasoning trajectories.

Conclusion: The benchmark provides cognitive diagnosis of VLM spatial reasoning limitations, showing how multi-view spatial mental models degrade across reasoning phases. This reveals fundamental gaps in current VLMs’ ability to construct and maintain coherent spatial representations across viewpoints.

Abstract: Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

[273] Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games

Ran Zhang, Kun Ouyang, Tiancheng Ma, Yida Yang, Dong Fang

Main category: cs.AI

TL;DR: LLM-powered generative agent simulation system for MMO game optimization, using fine-tuned LLMs to mimic real player behavior and data-driven environment modeling for cost-effective numerical design testing.

Details

Motivation: Traditional MMO optimization methods are costly, time-consuming, and disruptive to player experience. Offline simulations lack fidelity to accurately model real player reasoning and reactions to game interventions.

Method: Propose LLM-powered generative agent simulation with SFT and RL fine-tuning on real player data to create realistic player agents, combined with data-driven environment modeling from gameplay logs.

Result: System demonstrates strong consistency with real-world player behaviors and plausible causal responses under interventions, providing reliable, interpretable framework for optimization.

Conclusion: LLM-based generative agent simulation offers a cost-efficient, high-fidelity alternative to traditional MMO optimization methods, enabling better numerical design without disrupting real player experience.

Abstract: Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.

[274] Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

David X. Wu, Shreyas Kapur, Anant Sahai, Stuart Russell

Main category: cs.AI

TL;DR: Supervised learning with synthetic error injection fails to induce effective self-correction in LLMs, unlike on-policy RL methods.

Details

Motivation: To find computationally cheaper alternatives to reinforcement learning for eliciting self-correction capabilities in large language models, inspired by techniques from autonomous driving and robotics.

Method: Inject artificial errors into reasoning chains, mask them, and supervise the model to recognize and correct these mistakes through synthetic error injection training.

Result: The method fails to significantly improve performance even on simple synthetic tasks across multiple models. Models often parrot original mistakes even when catching errors, and synthetic-to-on-policy distribution shift degrades error-correction capabilities.

Conclusion: On-policy reinforcement learning methods remain uniquely effective for eliciting self-correction in LLMs, as synthetic error injection cannot overcome the distribution shift problem.

Abstract: Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.

[275] Semantic Trading: Agentic AI for Clustering and Relationship Discovery in Prediction Markets

Agostino Capponi, Alfio Gliozzo, Brian Zhu

Main category: cs.AI

TL;DR: AI pipeline clusters prediction markets and identifies correlated/anti-correlated market pairs, achieving 60-70% accuracy and enabling trading strategies with ~20% returns.

Details

Motivation: Prediction markets suffer from fragmentation through overlapping questions, implicit equivalences, and hidden contradictions across different markets, making it difficult to identify relationships between markets.

Method: Agentic AI pipeline that: (1) autonomously clusters markets into coherent topical groups using natural-language understanding over contract text and metadata, and (2) identifies within-cluster market pairs with strong outcome dependencies (correlated or anti-correlated relationships).

Result: Agent-identified relationships achieve 60-70% accuracy. Trading strategies based on these relationships earn about 20% average returns over week-long horizons on historical Polymarket data.

Conclusion: Agentic AI and large language models can effectively uncover latent semantic structure in prediction markets, transforming discovered relationships into actionable trading signals with significant returns.

Abstract: Prediction markets allow users to trade on outcomes of real-world events, but are prone to fragmentation through overlapping questions, implicit equivalences, and hidden contradictions across markets. We present an agentic AI pipeline that autonomously (i) clusters markets into coherent topical groups using natural-language understanding over contract text and metadata, and (ii) identifies within-cluster market pairs whose resolved outcomes exhibit strong dependence, including same-outcome (correlated) and different-outcome (anti-correlated) relationships. Using a historical dataset of resolved markets on Polymarket, we evaluate the accuracy of the agent’s relational predictions. We then translate discovered relationships into a simple trading strategy to quantify how these relationships map to actionable signals. Results show that agent-identified relationships achieve roughly 60-70% accuracy, and their induced trading strategies earn about 20% average returns over week-long horizons, highlighting the ability of agentic AI and large language models to uncover latent semantic structure in prediction markets.

[276] Guided Self-Evolving LLMs with Minimal Human Supervision

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu

Main category: cs.AI

TL;DR: R-Few introduces a guided Self-Play Challenger-Solver framework with lightweight human oversight to enable stable AI self-evolution, achieving iterative improvements on reasoning benchmarks while using significantly less human data.

Details

Motivation: Current unguided self-evolving AI systems often plateau or degrade due to issues like concept drift, diversity collapse, and mis-evolution, where models reinforce their own biases. There's a need for stable, controllable self-evolution with minimal human supervision.

Method: R-Few uses a guided Self-Play Challenger-Solver framework with lightweight human oversight through in-context grounding and mixed training. The Challenger samples human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum.

Result: R-Few achieves consistent iterative improvements on math and general reasoning benchmarks. Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and matches performance of General-Reasoner trained on 20x more human data. The framework mitigates drift and yields stable co-evolutionary dynamics.

Conclusion: R-Few demonstrates that guided self-evolution with lightweight human oversight enables stable, controllable AI improvement, achieving strong performance with significantly reduced human data requirements while addressing key challenges of unguided self-evolution.

Abstract: AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

[277] COPE: Chain-Of-Thought Prediction Engine for Open-Source Large Language Model Based Stroke Outcome Prediction from Clinical Notes

Yongkai Liu, Helena Feng, Bin Jiang, Yixin Wang, Max Wintermark, David S. Liebeskind, Michael Moseley, Maarten Lansberg, Gregory Albers, Jeremy Heit, Greg Zaharchuk

Main category: cs.AI

TL;DR: COPE is a Chain-of-Thought LLM framework that predicts 90-day stroke outcomes from clinical notes with accuracy comparable to GPT-4.1 and better than traditional models.

Details

Motivation: Clinical notes contain rich contextual information for stroke outcome prediction, but their unstructured nature limits use in traditional predictive models. There's a need for accurate, interpretable, and privacy-preserving solutions.

Method: Two-step Chain-of-Thought framework using sequential LLaMA-3-8B models: first generates clinical reasoning, second outputs mRS prediction. Compared against GPT-4.1, ClinicalBERT, structured ML model, and single-step LLM.

Result: COPE achieved MAE of 1.01, +/-1 accuracy of 74.4%, exact accuracy of 32.8% - comparable to GPT-4.1 and superior to other models. Consistent performance across subgroups with slightly higher error in older patients, thrombectomy cases, and longer summaries.

Conclusion: COPE provides an accurate, lightweight, interpretable, and privacy-preserving open-source solution for stroke outcome prediction from unstructured clinical text, demonstrating the value of reasoning-enhanced LLM frameworks.

Abstract: Predicting outcomes in acute ischemic stroke (AIS) guides clinical decision-making, patient counseling, and resource allocation. Clinical notes contain rich contextual information, but their unstructured nature limits their use in traditional predictive models. We developed and evaluated the Chain-of-Thought (CoT) Outcome Prediction Engine (COPE), a reasoning-enhanced large language model framework, for predicting 90-day functional outcomes after AIS from unstructured clinical notes. This study included 464 AIS patients with discharge summaries and 90-day modified Rankin Scale (mRS) scores. COPE uses a two-step CoT framework based on sequential open-source LLaMA-3-8B models: the first generates clinical reasoning, and the second outputs an mRS prediction. We compared COPE with GPT-4.1, ClinicalBERT, a structured variable-based machine learning model (Clinical ML), and a single-step LLM without CoT. Performance was evaluated using mean absolute error (MAE), accuracy within +/-1 mRS point, and exact accuracy. COPE achieved an MAE of 1.01 (95% CI 0.92-1.11), +/-1 accuracy of 74.4% (69.9, 78.8%), and exact accuracy of 32.8% (28.0, 37.6%), comparable to GPT-4.1 and superior to ClinicalBERT [MAE 1.24 (1.13-1.36)], Clinical ML [1.28 (1.18-1.39)], and the single-step LLM [1.20 (1.09-1.33)]. Subgroup analyses showed consistent performance across sex and age, with slightly higher error among older patients, those undergoing thrombectomy, and those with longer summaries. These findings demonstrate that COPE, a lightweight, interpretable, and privacy-preserving open-source framework, provides an accurate and practical solution for outcome prediction from unstructured clinical text.

[278] Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai, Haojie Cheng, Ziyan Shi, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: Aetheria is a multimodal interpretable content safety framework using multi-agent debate and collaboration to improve content moderation with transparency and better implicit risk detection.

Details

Motivation: Current content moderation systems have limitations: they often use single models or fixed pipelines, struggle to identify implicit risks, and lack interpretable judgment processes. The exponential growth of digital content makes these limitations increasingly problematic for content safety.

Method: Proposes Aetheria framework with five core agents that collaborate through dynamic, mutually persuasive debate mechanisms. Uses RAG-based knowledge retrieval to ground the analysis. The multi-agent architecture conducts in-depth analysis and adjudication of multimodal content.

Result: Comprehensive experiments on AIR-Bench benchmark show Aetheria generates detailed, traceable audit reports and outperforms baselines in overall content safety accuracy, especially in identifying implicit risks.

Conclusion: Aetheria establishes a transparent and interpretable paradigm that significantly advances trustworthy AI content moderation by providing better detection of implicit risks and explainable decision-making processes.

Abstract: The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

Yufei Xiao, Shangfei Wang

Main category: cs.AI

TL;DR: A multi-modal empathy prediction method using video, audio, and text with supervisory document-assisted training that outperforms existing methods.

Details

Motivation: Current empathy prediction methods focus on single modalities (usually text) and ignore multi-modal processing capabilities and privileged information that could contain additional empathetic content.

Method: Multi-modal empathy prediction using video, audio, and text features extracted via pre-trained networks with cross-modal fusion. Includes supervisory documentation assisted training using Latent Dirichlet Allocation to identify topic distributions that constrain text features during training only.

Result: Experimental results on multi-modal and dialogue empathy datasets show the approach is superior to existing methods.

Conclusion: The proposed multi-modal approach with supervisory document-assisted training effectively improves empathy prediction by leveraging multiple modalities and privileged information during training.

Abstract: Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor’s display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.

[280] PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He

Main category: cs.AI

TL;DR: PaperDebugger is an in-editor, multi-agent academic writing assistant that integrates LLM-driven reasoning directly into LaTeX editors like Overleaf, enabling context-aware operations within the writing environment.

Details

Motivation: Current LLM writing assistants remain external to editors, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors.

Method: A Chrome-approved extension with Kubernetes-native orchestration layer and Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Features bidirectional synchronization, fine-grained version control, secure state management, multi-agent scheduling, and extensible external tool communication.

Result: Fully integrated workflow demonstrated with localized edits, structured reviews, parallel agent execution, and diff-based updates in a minimal-intrusion UI. Early aggregated analytics show active user engagement and validate the practicality of an editor-native, agentic writing assistant.

Conclusion: PaperDebugger successfully addresses the technical challenges of in-editor LLM integration and demonstrates the feasibility of bringing agentic, context-aware writing assistance directly into academic writing environments.

Abstract: Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.

[281] Target-specific Adaptation and Consistent Degradation Alignment for Cross-Domain Remaining Useful Life Prediction

Yubo Hou, Mohamed Ragab, Min Wu, Chee-Keong Kwoh, Xiaoli Li, Zhenghua Chen

Main category: cs.AI

TL;DR: TACDA is a novel domain adaptation approach for cross-domain RUL prediction that addresses domain discrepancy by combining target domain reconstruction with consistent degradation stage alignment.

Details

Motivation: Traditional data-driven RUL prediction methods assume training and testing data come from the same distribution, which doesn't hold in real industrial settings. Existing adversarial domain adaptation methods focus only on domain-invariant features while ignoring target-specific information and degradation stage inconsistencies.

Method: Proposes TACDA with two key strategies: 1) Target domain reconstruction within adversarial adaptation to retain target-specific information while learning domain-invariant features, and 2) A novel clustering and pairing strategy for consistent alignment between similar degradation stages.

Result: Extensive experiments show TACDA achieves remarkable performance, surpassing state-of-the-art approaches on two different evaluation metrics.

Conclusion: TACDA effectively addresses domain discrepancy in cross-domain RUL prediction by simultaneously preserving target-specific information and ensuring consistent degradation stage alignment, leading to superior performance compared to existing methods.

Abstract: Accurate prediction of the Remaining Useful Life (RUL) in machinery can significantly diminish maintenance costs, enhance equipment up-time, and mitigate adverse outcomes. Data-driven RUL prediction techniques have demonstrated commendable performance. However, their efficacy often relies on the assumption that training and testing data are drawn from the same distribution or domain, which does not hold in real industrial settings. To mitigate this domain discrepancy issue, prior adversarial domain adaptation methods focused on deriving domain-invariant features. Nevertheless, they overlook target-specific information and inconsistency characteristics pertinent to the degradation stages, resulting in suboptimal performance. To tackle these issues, we propose a novel domain adaptation approach for cross-domain RUL prediction named TACDA. Specifically, we propose a target domain reconstruction strategy within the adversarial adaptation process, thereby retaining target-specific information while learning domain-invariant features. Furthermore, we develop a novel clustering and pairing strategy for consistent alignment between similar degradation stages. Through extensive experiments, our results demonstrate the remarkable performance of our proposed TACDA method, surpassing state-of-the-art approaches with regard to two different evaluation metrics. Our code is available at https://github.com/keyplay/TACDA.

[282] Zero-Shot Instruction Following in RL via Structured LTL Representations

Mattia Giuri, Mathias Jackermeier, Alessandro Abate

Main category: cs.AI

TL;DR: A novel RL approach for LTL-specified tasks using Boolean formula sequences and GNN encoding to handle complex environments with simultaneous/interacting events.

Details

Motivation: Existing LTL-to-automata approaches for RL fail in environments where multiple atomic propositions can be true simultaneously and interact in complex ways, limiting their applicability to realistic multi-task scenarios.

Method: Conditions policy on sequences of simple Boolean formulae aligned with automaton transitions, encoded via graph neural network (GNN) to create structured task representations for multi-task learning.

Result: Experiments in a complex chess-based environment demonstrate advantages over existing approaches, showing improved handling of simultaneous/interacting events.

Conclusion: The proposed approach successfully addresses limitations of previous LTL-based RL methods by using Boolean formula sequences and GNN encoding to handle complex environments with interacting events, enabling more robust multi-task policies.

Abstract: Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.

[283] Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks

Zhiyuan He

Main category: cs.AI

TL;DR: Standard transformers struggle with recursive reasoning depth generalization, so the authors propose a looped locate-and-replace pipeline that decomposes nested problems into solvable subcomponents.

Details

Motivation: While prior work focused on length generalization, this paper investigates depth generalization - the ability to handle deeper nested hierarchical structures than seen during training. Standard transformers fail at maintaining stack-like behavior needed for recursive reasoning.

Method: A novel looped locate-and-replace pipeline with two specialized models: (1) a locator that identifies solvable subexpressions within nested structures, and (2) a replacer that evaluates these components while preserving overall structure.

Result: The method was evaluated on three domains with controllable recursion depth (Boolean algebra, recursive arithmetic, propositional logic) and effectively alleviated performance decay when tested on out-of-distribution recursion depths.

Conclusion: The proposed approach addresses transformers’ architectural limitations in handling deep recursive reasoning by decomposing problems into manageable subcomponents, enabling better depth generalization.

Abstract: Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model’s ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion increases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean algebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth.

[284] Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro

Main category: cs.AI

TL;DR: MIGR framework improves multimodal emotion understanding by identifying emotion-dominant modality and reorganizing reasoning sequences to start from most critical modality, reducing reasoning drift and improving explanation consistency.

Details

Motivation: Existing multimodal emotion understanding methods suffer from reasoning drift - models gradually rely on their own generated text instead of multimodal evidence, and explanations become overly shaped by visually initiated reasoning paths, leading to emotionally inconsistent explanations.

Method: Introduces Modality Importance (MI) mechanism to identify emotion-dominant modality, then reorganizes reasoning sequences to begin explanations from the most critical modality. Uses two-stage framework: modality-aligned supervised fine-tuning and modality-aware reward optimization to generate emotionally grounded, causally relevant explanations.

Result: On DFEW benchmark, MIGR substantially improves reasoning reliability, decreasing instances of correct predictions with emotionally inconsistent explanations from 18.10% to 7.37%, confirming benefit of initiating reasoning from emotion-dominant modality.

Conclusion: The MIGR framework effectively addresses reasoning drift in multimodal emotion understanding by guiding reasoning sequences based on modality importance, leading to more reliable and emotionally consistent explanations.

Abstract: In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.

[285] Training Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs

Theodoros Aivalis, Iraklis A. Klampanos, Antonis Troumpoukis, Joemon M. Jose

Main category: cs.AI

TL;DR: A framework that uses multimodal LLMs to extract ontology-aligned knowledge graphs from images, enabling tracing of training data influences on generative model outputs for copyright analysis and AI transparency.

Details

Motivation: Address concerns about transparency, accountability, and copyright violations in generative models by understanding how specific training data contributes to model outputs, which is critical for responsible AI development.

Method: Uses multimodal large language models to extract structured triples from images, constructing ontology-aligned knowledge graphs. Compares KGs of generated images with training images to trace potential influences and data contributions.

Result: Validated through experiments on locally trained models via unlearning and on large-scale models through style-specific experiments. The framework enables copyright analysis, dataset transparency, and interpretable AI.

Conclusion: The framework supports development of AI systems that foster human collaboration, creativity, and curiosity by providing transparency into training data influences on generative outputs.

Abstract: As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model’s output is critical. We introduce a framework for interpreting generative outputs through the automatic construction of ontologyaligned knowledge graphs (KGs). While automatic KG construction from natural text has advanced, extracting structured and ontology-consistent representations from visual content remains challenging – due to the richness and multi-object nature of images. Leveraging multimodal large language models (LLMs), our method extracts structured triples from images, aligned with a domain-specific ontology. By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI. We validate our method through experiments on locally trained models via unlearning, and on large-scale models through a style-specific experiment. Our framework supports the development of AI systems that foster human collaboration, creativity and stimulate curiosity.

[286] Menta: A Small Language Model for On-Device Mental Health Prediction

Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D’Alfonso, Vassilis Kostakos, Hong Jia

Main category: cs.AI

TL;DR: Menta is an optimized small language model fine-tuned for multi-task mental health prediction from social media data, achieving better performance than larger LLMs while being deployable on mobile devices.

Details

Motivation: Mental health conditions affect hundreds of millions globally but early detection is limited. While LLMs show promise for mental health applications, their size and computational demands hinder practical deployment. Small language models offer a lightweight alternative but remain underexplored for social media-based mental health prediction.

Method: Introduces Menta, an optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Uses joint training across six classification tasks with a LoRA-based framework, cross-dataset strategy, and balanced accuracy-oriented loss.

Result: Menta achieves 15.2% average improvement across tasks covering depression, stress, and suicidality compared to best-performing non-fine-tuned SLMs. Outperforms 13B-parameter LLMs on depression and stress classification while being 3.25x smaller. Successfully deployed on iPhone 15 Pro Max with only ~3GB RAM.

Conclusion: Menta demonstrates the potential for scalable, privacy-preserving mental health monitoring through optimized small language models that can run on-device, addressing deployment barriers of larger models while maintaining strong performance.

Abstract: Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media–based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy–oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2% across tasks covering depression, stress, and suicidality compared with the best-performing non–fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25x smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. Code is available at: https://xxue752-nz.github.io/menta-project/

[287] StockMem: An Event-Reflection Memory Framework for Stock Forecasting

He Wang, Wenyilin Xiao, Songqiao Han, Hailiang Huang

Main category: cs.AI

TL;DR: StockMem: A dual-layer memory framework for stock prediction that structures news into events, tracks their evolution, and builds causal experience knowledge for explainable forecasting.

Details

Motivation: Stock prediction is difficult due to market volatility and real-time events. LLMs struggle with noisy financial news data and lack explicit answers. Existing memory architectures fail to identify key price movement drivers.

Method: Proposes StockMem with event-reflection dual-layer memory: 1) Structures news into events, 2) Horizontal consolidation integrates daily events, 3) Longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies, building a temporal event knowledge base. 4) Analyzes event-price dynamics to form reflection knowledge base of causal experiences. For prediction, retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences.

Result: StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.

Conclusion: The proposed StockMem framework effectively addresses limitations of LLMs in finance by structuring noisy news data, tracking event evolution, and building causal knowledge for transparent and accurate stock price prediction.

Abstract: Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their application in finance is hindered by noisy news data and the lack of explicit answers in text. General-purpose memory architectures struggle to identify the key drivers of price movements. To address this, we propose StockMem, an event-reflection dual-layer memory framework. It structures news into events and mines them along two dimensions: horizontal consolidation integrates daily events, while longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies. This builds a temporal event knowledge base. By analyzing event-price dynamics, the framework further forms a reflection knowledge base of causal experiences. For prediction, it retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences. Experiments show StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.

[288] AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping

Md Abdul Kadir, Sai Suresh Macharla Vasu, Sidharth S. Nair, Daniel Sonntag

Main category: cs.AI

TL;DR: LLMs outperform traditional rule-based journal entry tests and ML baselines for detecting anomalies in double-entry bookkeeping, offering natural-language explanations for better interpretability.

Details

Motivation: Traditional rule-based Journal Entry Tests (JETs) generate excessive false positives and struggle with subtle irregularities in tax-related ledger records, creating a need for more effective anomaly detection methods.

Method: Benchmarked state-of-the-art LLMs (LLaMA and Gemma) on both synthetic and real-world anonymized ledgers, comparing them against traditional JETs and machine learning baselines.

Result: LLMs consistently outperformed traditional rule-based JETs and classical ML baselines while providing natural-language explanations that enhance interpretability.

Conclusion: The research highlights the potential of AI-augmented auditing where human auditors collaborate with foundation models to strengthen financial integrity.

Abstract: Auditors rely on Journal Entry Tests (JETs) to detect anomalies in tax-related ledger records, but rule-based methods generate overwhelming false positives and struggle with subtle irregularities. We investigate whether large language models (LLMs) can serve as anomaly detectors in double-entry bookkeeping. Benchmarking SoTA LLMs such as LLaMA and Gemma on both synthetic and real-world anonymized ledgers, we compare them against JETs and machine learning baselines. Our results show that LLMs consistently outperform traditional rule-based JETs and classical ML baselines, while also providing natural-language explanations that enhance interpretability. These results highlight the potential of \textbf{AI-augmented auditing}, where human auditors collaborate with foundation models to strengthen financial integrity.

[289] Self-Improving AI Agents through Self-Play

Przemyslaw Chojecki

Main category: cs.AI

TL;DR: This paper extends psychometric battery framework to dynamical systems, formalizing agents as computational resource-parameterized flows governed by a Generator-Verifier-Updater operator, deriving a Variance Inequality for self-improvement stability, and unifying various AI architectures as specific realizations.

Details

Motivation: To extend the moduli-theoretic framework of psychometric batteries from static agent representations to dynamical systems, enabling analysis of self-improving agents as parameterized flows on manifolds.

Method: Formalizes agents as flows ν_r parameterized by computational resource r, governed by a recursive Generator-Verifier-Updater (GVU) operator. Proves this generates a vector field on parameter manifold Θ, identifies self-improvement coefficient κ as Lie derivative. Derives Variance Inequality spectral condition for stability.

Result: Shows sufficient condition for κ>0 requires combined generation/verification noise to be small enough (up to curvature/step-size effects). Demonstrates architectures like STaR, SPIN, Reflexion, GANs, AlphaZero are specific topological realizations of GVU operator satisfying Variance Inequality through various mechanisms.

Conclusion: Provides unified theoretical framework for analyzing self-improving AI systems, connecting diverse architectures through common mathematical structure of GVU operator and Variance Inequality condition for stable self-improvement.

Abstract: We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.

[290] A Framework for Causal Concept-based Model Explanations

Anna Rodum Bjøru, Jacob Lysnæs-Larsen, Oskar Jørgensen, Inga Strümke, Helge Langseth

Main category: cs.AI

TL;DR: A causal concept-based XAI framework using probability of sufficiency for concept interventions to generate local/global explanations that are both understandable and faithful to the model.

Details

Motivation: Current XAI methods need to balance understandability (using clear concepts) with faithfulness (accurately representing the model's behavior), requiring explanations that are both interpretable and reliable.

Method: Uses causal concept-based approach with probability of sufficiency calculations for concept interventions. Generates local and global explanations through concept interventions, demonstrated with a proof-of-concept model on CelebA dataset classifiers.

Result: Framework produces understandable explanations using clear concept vocabulary with implicit causal interpretation. Fidelity is addressed by aligning explanation interpretation context with generation context, highlighting important framework assumptions.

Conclusion: The proposed causal concept-based XAI framework successfully generates both understandable and faithful explanations by using concept interventions and carefully managing context alignment between explanation generation and interpretation.

Abstract: This work presents a conceptual framework for causal concept-based post-hoc Explainable Artificial Intelligence (XAI), based on the requirements that explanations for non-interpretable models should be understandable as well as faithful to the model being explained. Local and global explanations are generated by calculating the probability of sufficiency of concept interventions. Example explanations are presented, generated with a proof-of-concept model made to explain classifiers trained on the CelebA dataset. Understandability is demonstrated through a clear concept-based vocabulary, subject to an implicit causal interpretation. Fidelity is addressed by highlighting important framework assumptions, stressing that the context of explanation interpretation must align with the context of explanation generation.

[291] Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents

Zijie Lin, Qilin Cai, Liang Shen, Mingjun Xiao

Main category: cs.AI

TL;DR: Proposed a prompt-free collaborative agent framework for automated paper reproduction that uses verification and refinement agents to improve code generation quality without manual prompt engineering.

Details

Motivation: Existing paper reproduction frameworks lack verification mechanisms at each generation step and rely heavily on manually designed prompts for self-refinement, limiting adaptability and scalability.

Method: Two collaborative agents: verification agent examines outputs against system prompt requirements, and refinement agent revises outputs based on identified issues. Framework leverages only original system prompts without human-crafted refinement prompts.

Result: Significantly improves accuracy and completeness of reproduced code by ~15% and 13% respectively compared to baseline. Outperforms Self-Refine approach, demonstrating robustness across PaperBench Code-Dev and Paper2CodeBench datasets.

Conclusion: The prompt-free collaborative agent framework effectively enhances paper-to-code generation quality, offering improved adaptability and scalability over manual prompt-based approaches.

Abstract: Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into executable code. However, existing frameworks often lack mechanisms to verify and refine the outputs at each generation step, or rely heavily on manually designed prompts for self-refinement, which limits their adaptability and scalability. To address these limitations, we propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation. Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues. Unlike previous methods that require human experts to craft specific refinement prompts for each step, our framework achieves automatic verification and improvement by leveraging only the original system prompts. We integrate our collaborative agents into the Paper2Code framework and conduct comprehensive experiments on PaperBench Code-Dev and Paper2CodeBench datasets. Experimental results demonstrate that our approach significantly improves the accuracy and completeness of reproduced code, achieving performance gains of approximately 15% and 13%, respectively, compared to the baseline without our agents. Furthermore, comparative experiments against Self-Refine validate the robustness and consistency of our prompt-free approach across different datasets.

[292] Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control

Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang

Main category: cs.AI

TL;DR: Radiologist Copilot is an AI agent system that automates both radiology report generation AND quality control using LLMs and orchestrated tools, outperforming existing methods that only focus on generation.

Details

Motivation: Current automated radiology reporting approaches focus only on report generation and neglect the crucial quality control phase, limiting their ability to provide comprehensive support to radiologists who face time-consuming, error-prone reporting tasks.

Method: An agentic AI system using large language models as reasoning backbone, with orchestrated tools for: region localization, think-with-image paradigm for region analysis planning, strategic template selection for report generation, and quality assessment with feedback-driven adaptive refinement for quality control.

Result: Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting, demonstrating improved accuracy, completeness, and efficiency.

Conclusion: The system provides comprehensive automated support for the entire radiology reporting workflow including quality control, assisting radiologists and improving clinical efficiency.

Abstract: Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.

[293] The future of AI in critical mineral exploration

Jef Caers

Main category: cs.AI

TL;DR: AI-driven scientific method for mineral exploration combining Bayesianism and falsification to reduce bias, improve discovery rates, and optimize data acquisition.

Details

Motivation: Despite increased investment, new mineral discoveries have declined over the past two decades, creating a critical need for more effective exploration methods to support the energy transition through electrification.

Method: Proposes a new scientific method based on Bayesian principles and falsification, using AI to: 1) generate multiple competing geological hypotheses via unsupervised learning collaborating with domain experts, and 2) optimize data acquisition planning through human-in-the-loop AI algorithms that prioritize reducing geological uncertainty before grade/tonnage uncertainty.

Result: Provides a practical protocol template for exploration campaigns that uses verifiable metrics and rational decision-making to determine optimal data acquisition sequences, aiming to reduce cognitive bias and false positives while lowering exploration costs.

Conclusion: AI-enabled scientific methodology offers a systematic solution to the declining discovery problem in mineral exploration by implementing rigorous philosophical principles (Bayesianism and falsification) to improve decision-making and discovery efficiency.

Abstract: The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage

[294] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

Zhonghao He, Tianyi Qiu, Hirokazu Shirado, Maarten Sap

Main category: cs.AI

TL;DR: LLM reasoning often leads to belief entrenchment (confirmation bias) rather than truth-seeking, measured via Martingale Score from Bayesian statistics.

Details

Motivation: Despite advances in LLM reasoning techniques, there's evidence that iterative reasoning may cause belief entrenchment and confirmation bias instead of improving truth-seeking behavior. Need systematic evaluation of this phenomenon.

Method: Propose Martingale Score based on Bayesian statistics’ Martingale property (belief updates should be unpredictable from current belief). Use unsupervised, regression-based approach to measure violations indicating belief entrenchment. Evaluate across open-ended domains: event forecasting, value-laden questions, academic paper review.

Result: Widespread violations found across models and setups - current belief positively predicts future belief updates (belief entrenchment). Identified models, reasoning techniques, and domains more prone to entrenchment. Martingale Score predicts ground-truth accuracy where labels available.

Conclusion: Martingale Score is effective unsupervised metric for detecting belief entrenchment in LLM reasoning and serves as proxy for truth-seeking ability, even in domains without ground truth.

Abstract: Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.

[295] Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau

Main category: cs.AI

TL;DR: Proposes Invasive Context Engineering - inserting control sentences into LLM context to improve security against adversarial attacks in long-context situations, without requiring model training.

Details

Motivation: Current LLM security methods (preference training, prompting, filtering) still leave models vulnerable to abuse, especially as jailbreak probability increases with context length. Need robust security guarantees for long-context scenarios.

Method: Invasive Context Engineering - inserting specially designed control sentences directly into the LLM’s context. Can be generalized to Chain-of-Thought processes to prevent scheming. Doesn’t require model retraining.

Result: Partially solves the long-context security problem by providing a training-free approach that avoids data shortage issues common in long-context model training.

Conclusion: Invasive Context Engineering offers a promising approach to enhance LLM security in long-context scenarios without the limitations of training-based methods, with potential applications in preventing scheming through Chain-of-Thought processes.

Abstract: Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/output filtering. Despite good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control sentences inserted into the LLM context as invasive context engineering to partially solve the problem. We suggest this technique can be generalized to the Chain-of-Thought process to prevent scheming. Invasive Context Engineering does not rely on LLM training, avoiding data shortage pitfalls which arise in training models for long context situations.

[296] From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

Main category: cs.AI

TL;DR: LLMs can be used as mediators for online conflicts, not just moderators, by judging conversation fairness and steering with empathetic messages, with API models outperforming open-source ones.

Details

Motivation: As LLMs increasingly mediate online communication, there's potential for them to foster empathy and constructive dialogue, moving beyond just detecting harmful content to actually de-escalating conflicts.

Method: Framework decomposes mediation into judgment (evaluating fairness and emotional dynamics) and steering (generating empathetic, de-escalatory messages). Uses Reddit-based dataset and multi-stage evaluation pipeline with principle-based scoring, user simulation, and human comparison.

Result: API-based models outperform open-source counterparts in both reasoning and intervention alignment for mediation tasks.

Conclusion: Current LLMs show promise but also limitations as emerging agents for online social mediation, highlighting both their potential and current constraints in this application domain.

Abstract: The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

[297] Computational Copyright: Towards A Royalty Model for Music Generative AI

Junwei Deng, Xirui Jiang, Shiyuan Zhang, Shichang Zhang, Himabindu Lakkaraju, Ruijiang Gao, Chris Donahue, Jiaqi W. Ma

Main category: cs.AI

TL;DR: Proposes Generative Content ID framework for scalable royalty attribution in music AI using causal attribution via efficient Training Data Attribution methods, showing it’s more accurate than similarity-based legal proxies.

Details

Motivation: Current approaches to AI copyright issues in music focus on prevention or one-time licensing, lacking sustainable recurring economic incentives needed for creative ecosystems. There's a need for scalable royalty attribution that connects AI-generated music value back to influential training content.

Method: Adapts YouTube’s Content ID concept to create Generative Content ID framework. Uses efficient Training Data Attribution (TDA) methods to approximate causal attribution at scale, avoiding infeasible counterfactual retraining. Conducts empirical analysis on public and proprietary datasets comparing TDA approximation to gold-standard retraining-based attribution.

Result: 1) Scalable TDA methods faithfully approximate costly retraining-based causal attribution, showing framework feasibility. 2) Perceived similarity (used in legal practices) captures most influential samples but fails to account for broader data contributions driving model utility, suggesting similarity-based legal proxies are ill-suited for royalty distribution.

Conclusion: Provides principled and operational foundation for royalty-based economic governance of music generative AI, offering scalable solution for sustainable economic incentives in creative ecosystems through causal attribution rather than similarity-based approaches.

Abstract: The rapid rise of generative AI has intensified copyright and economic tensions in creative industries, particularly in music. Current approaches addressing this challenge often focus on preventing infringement or establishing one-time licensing, which fail to provide the sustainable, recurring economic incentives necessary to maintain creative ecosystems. To address this gap, we propose Generative Content ID, a framework for scalable and faithful royalty attribution in music generative AI. Adapting the idea of YouTube’s Content ID, it attributes the value of AI-generated music back to the specific training content that causally influenced its generation, a process we term as causal attribution. However, naively quantifying the causal influence requires counterfactually retraining the model on subsets of training data, which is infeasible. We address this challenge using efficient Training Data Attribution (TDA) methods to approximate causal attribution at scale. We further conduct empirical analysis of the framework on public and proprietary datasets. First, we demonstrate that the scalable TDA methods provide a faithful approximation of the “gold-standard” but costly retraining-based causal attribution, showing the feasibility of the proposed royalty framework. Second, we investigate the relationship between the perceived similarity employed by legal practices and our causal attribution reflecting the true AI training mechanics. We find that while perceived similarity can capture the most influential samples, it fails to account for the broader data contribution that drives model utility, suggesting similarity-based legal proxies are ill-suited for royalty distribution. Overall, this work provides a principled and operational foundation for royalty-based economic governance of music generative AI.

[298] A process algebraic framework for multi-agent dynamic epistemic systems

Alessandro Aldini

Main category: cs.AI

TL;DR: Combines labeled transition systems with epistemic models to create a unified framework for modeling and analyzing multi-agent knowledge-based dynamic systems.

Details

Motivation: To create a comprehensive framework that integrates dynamic behavior (via labeled transition systems) with knowledge reasoning (via epistemic models) for multi-agent systems, addressing the need for both modeling and verification capabilities.

Method: 1. Combines classical labeled transition systems with epistemic models for knowledge reasoning. 2. Develops a process algebraic, agent-oriented specification language for practical modeling. 3. Defines a modal logic that encompasses both temporal and epistemic operators for verification.

Result: Creates a unifying framework that enables modeling of multi-agent, knowledge-based, dynamic systems and provides verification capabilities through a combined temporal-epistemic logic.

Conclusion: The paper successfully integrates dynamic and epistemic aspects into a single framework with both practical modeling tools (process algebraic language) and formal verification methods (combined modal logic), making it suitable for analyzing complex multi-agent systems.

Abstract: This paper combines the classical model of labeled transition systems with the epistemic model for reasoning about knowledge. The result is a unifying framework for modeling and analyzing multi-agent, knowledge-based, dynamic systems. On the modeling side, we propose a process algebraic, agent-oriented specification language that makes such a framework easy to use for practical purposes. On the verification side, we define a modal logic encompassing temporal and epistemic operators.

[299] Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks

Paul Smolensky, Roland Fernandez, Zhenghao Herbert Zhou, Mattia Opper, Adam Davies, Jianfeng Gao

Main category: cs.AI

TL;DR: Researchers develop a Production System Language (PSL) to compile symbolic programs into transformer networks, enabling 100% interpretable symbolic processing and demonstrating Turing universality.

Details

Motivation: To understand how transformers can perform robust symbol processing despite historical critiques that neural networks cannot master abstract symbol manipulation, and to create interpretable symbolic processing in transformers.

Method: Developed a high-level Production System Language (PSL) inspired by symbolic AI and cognitive science, created compilers that precisely implement PSL programs in transformer networks, and studied Templatic Generation (TGT) as a test case.

Result: Successfully compiled PSL programs into transformer networks that are 100% mechanistically interpretable, demonstrated PSL’s Turing universality, and suggested new transformer architectures for enhanced symbolic processing.

Conclusion: The work shows transformers can perform robust symbolic processing through compiled PSL programs, addressing computability (but not learnability) and providing interpretable architectures for symbolic AI in neural networks.

Abstract: Large Language Models (LLMs) have demonstrated impressive abilities in symbol processing through in-context learning (ICL). This success flies in the face of decades of critiques asserting that artificial neural networks cannot master abstract symbol manipulation. We seek to understand the mechanisms that can enable robust symbol processing in transformer networks, illuminating both the unanticipated success, and the significant limitations, of transformers in symbol processing. Borrowing insights from symbolic AI and cognitive science on the power of Production System architectures, we develop a high-level Production System Language, PSL, that allows us to write symbolic programs to do complex, abstract symbol processing, and create compilers that precisely implement PSL programs in transformer networks which are, by construction, 100% mechanistically interpretable. The work is driven by study of a purely abstract (semantics-free) symbolic task that we develop, Templatic Generation (TGT). Although developed through study of TGT, PSL is, we demonstrate, highly general: it is Turing Universal. The new type of transformer architecture that we compile from PSL programs suggests a number of paths for enhancing transformers’ capabilities at symbol processing. We note, however, that the work we report addresses computability, and not learnability, by transformer networks. Note: The first section provides an extended synopsis of the entire paper.

[300] Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofstätter, Teun van der Weij

Main category: cs.AI

TL;DR: Researchers propose a noise-injection method to detect AI models that deliberately underperform (sandbagging) during evaluations by observing anomalous performance improvements when noise is added to model weights.

Details

Motivation: Capability evaluations are essential for assessing frontier AI systems, but they face the challenge of "sandbagging" - where models deliberately underperform during evaluation. This strategic underperformance, whether through explicit developer intervention or unintended model behavior, presents a fundamental obstacle to accurate capability assessment and effective AI regulation.

Method: The researchers introduce a novel sandbagging detection method based on injecting varying magnitudes of noise into model weights. They hypothesize that while non-sandbagging models show predictable performance degradation with increasing noise, sandbagging models will exhibit anomalous performance improvements. This occurs because the noise disrupts underperformance mechanisms while core capabilities remain partially intact.

Result: Through experiments across various model architectures, sizes, and sandbagging techniques, the researchers establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, they find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so.

Conclusion: The findings provide a practical tool for AI evaluation and oversight, addressing a critical challenge in ensuring accurate capability assessment of frontier AI systems. The noise-injection method offers a reliable way to detect sandbagging behavior across different models and sandbagging techniques.

Abstract: Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging’’, where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

[301] Reinforcement Learning: An Overview

Kevin Murphy

Main category: cs.AI

TL;DR: A comprehensive survey paper covering the entire field of deep reinforcement learning and sequential decision making, including value-based, policy-based, and model-based methods, plus specialized topics like multi-agent RL, LLM integration, offline RL, hierarchical RL, and intrinsic rewards, with practical code examples.

Details

Motivation: To provide a complete, current overview of the rapidly evolving field of reinforcement learning and sequential decision making, addressing the need for a comprehensive reference that covers both foundational methods and emerging trends like LLM integration.

Method: Survey methodology: systematic review and synthesis of existing RL literature, organized by methodological categories (value-based, policy-based, model-based) and specialized topics, supplemented with practical code implementations for training LLMs with RL.

Result: A thorough, up-to-date survey paper that serves as a comprehensive reference for the RL community, covering both established techniques and cutting-edge developments, with practical implementation guidance.

Conclusion: This survey successfully provides a holistic view of the RL landscape, demonstrating the field’s breadth and maturity while highlighting emerging directions like LLM-RL integration, making it valuable for both newcomers and experienced researchers.

Abstract: This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based methods, policy-based methods, model-based methods, multi-agent RL, LLMs and RL, and various other topics (e.g., offline RL, hierarchical RL, intrinsic reward). It also includes some code snippets for training LLMs with RL.

[302] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

Jingyang Yi, Jiazheng Wang, Sida Li

Main category: cs.AI

TL;DR: ShorterBetter is a reinforcement learning method that teaches reasoning models to find optimal CoT lengths automatically, reducing output by 50-80% while maintaining accuracy.

Details

Motivation: Current reasoning models like OpenAI o1 and DeepSeek-R1 generate long, redundant CoT traces (overthinking), leading to inefficient outputs despite helping with complex problems.

Method: Uses reinforcement learning with Sample Optimal Length (SOL) - the shortest correct response among multiple generations - as a dynamic reward signal to guide models toward efficient reasoning without manual supervision.

Result: Achieves 50-80% reduction in output lengths on both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Analysis shows refined reasoning structure with less repetition, excessive self-verification, and over-exploration.

Conclusion: ShorterBetter effectively addresses overthinking in reasoning models by enabling them to learn optimal CoT lengths automatically, producing more efficient reasoning traces without sacrificing performance.

Abstract: Recent models such as OpenAI o1 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks by generating extended Chain-of-Thought (CoT) traces. While longer reasoning helps with thorough exploration of solution paths for complex problems, it also often leads to inefficient and redundant outputs–a phenomenon commonly described as overthinking. In this paper, we propose ShorterBetter, a simple yet effective reinforcement learning method that enables reasoning models to learn their own optimal CoT lengths without manual supervision. We define the Sample Optimal Length (SOL) as the length of the shortest correct response among multiple generations, which serves as a dynamic reward signal to guide the model toward efficient reasoning. Applied to DeepSeek-Distill-Qwen-1.5B/7B as base models, ShorterBetter achieves 50%-80% reduction in output lengths in both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our reasoning trace analysis shows that ShorterBetter refines the structure of the reasoning traces by reducing unnecessary repetition, excessive self-verification, and over-exploration of alternatives.

[303] Subgoal-Guided Policy Heuristic Search with Learned Subgoals

Jake Tuero, Michael Buro, Levi H. S. Lelis

Main category: cs.AI

TL;DR: Novel method for learning subgoal-based policies for policy tree search algorithms that improves sample efficiency by learning from both successful and failed search attempts.

Details

Motivation: Policy tree search algorithms require complete solution trajectories for training, which is costly for hard problems. Failed search attempts waste samples, especially with randomly initialized policies.

Method: Learn subgoal-based policies from search trees expanded during problem-solving attempts, including failed attempts. Policies are conditioned on subgoals learned from these search trees.

Result: Empirical results show improved sample efficiency for learning policies and heuristic functions in online settings compared to previous approaches.

Conclusion: The proposed subgoal-based policy learning method effectively utilizes information from both successful and failed search attempts, making policy tree search training more efficient for hard problems.

Abstract: Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.

[304] Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation

Hao Guan, David Bates, Li Zhou

Main category: cs.AI

TL;DR: This review paper addresses the critical challenge of AI performance degradation in healthcare settings and proposes a comprehensive framework for monitoring, detecting, and correcting AI system “health” issues to ensure safe long-term deployment.

Details

Motivation: AI systems in healthcare face performance degradation over time due to shifting data distributions, changing patient characteristics, evolving clinical protocols, and data quality variations, which compromise reliability and safety, creating an urgent need for continuous monitoring and maintenance solutions.

Method: The paper conducts a comprehensive review that: 1) analyzes common causes of performance degradation at data and model levels, 2) summarizes techniques for detecting data and model drift, 3) examines root cause analysis methods, and 4) reviews correction strategies including model retraining and test-time adaptation, covering both traditional ML models and state-of-the-art LLMs.

Result: The review provides a systematic framework for understanding AI degradation in healthcare, identifies key monitoring and correction techniques, compares strengths and limitations of different AI approaches, and establishes a foundation for developing robust medical AI systems capable of sustained safe deployment.

Conclusion: The paper concludes that maintaining AI system “health” requires continuous performance monitoring, early degradation detection, and effective self-correction mechanisms, and proposes future research directions to guide development of reliable, robust medical AI systems for dynamic clinical environments.

Abstract: Artificial intelligence (AI) is increasingly integrated into modern healthcare, offering powerful support for clinical decision-making. However, in real-world settings, AI systems may experience performance degradation over time, due to factors such as shifting data distributions, changes in patient characteristics, evolving clinical protocols, and variations in data quality. These factors can compromise model reliability, posing safety concerns and increasing the likelihood of inaccurate predictions or adverse outcomes. This review presents a forward-looking perspective on monitoring and maintaining the “health” of AI systems in healthcare. We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms. The paper begins by reviewing common causes of performance degradation at both data and model levels. We then summarize key techniques for detecting data and model drift, followed by an in-depth look at root cause analysis. Correction strategies are further reviewed, ranging from model retraining to test-time adaptation. Our survey spans both traditional machine learning models and state-of-the-art large language models (LLMs), offering insights into their strengths and limitations. Finally, we discuss ongoing technical challenges and propose future research directions. This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.

[305] From Memories to Maps: Mechanisms of In-Context Reinforcement Learning in Transformers

Ching Fang, Kanaka Rajan

Main category: cs.AI

TL;DR: Transformers trained for in-context RL develop episodic memory-like mechanisms that cache intermediate computations, enabling rapid adaptation similar to biological learning.

Details

Motivation: To understand how humans/animals achieve rapid adaptation with minimal experience, which standard RL algorithms fail to capture, by studying how transformers (with key-value architecture resembling brain episodic memory) learn in-context.

Method: Train a transformer to in-context reinforcement learn in planning tasks inspired by rodent behavior, then analyze the emergent learning algorithms and representations.

Result: Transformers develop: 1) representation learning via in-context structure learning and cross-context alignment, 2) RL strategies not interpretable as standard model-free/model-based planning, but rather caching intermediate computations in memory tokens for decision-time access.

Conclusion: Memory serves as computational resource storing both raw experience and cached computations; transformer representations resemble hippocampal-entorhinal computations, suggesting relevance to natural cognition and providing mechanistic hypothesis for rapid adaptation.

Abstract: Humans and animals show remarkable learning efficiency, adapting to new environments with minimal experience. This capability is not well captured by standard reinforcement learning algorithms that rely on incremental value updates. Rapid adaptation likely depends on episodic memory – the ability to retrieve specific past experiences to guide decisions in novel contexts. Transformers provide a useful setting for studying these questions because of their ability to learn rapidly in-context and because their key-value architecture resembles episodic memory systems in the brain. We train a transformer to in-context reinforcement learn in a distribution of planning tasks inspired by rodent behavior. We then characterize the learning algorithms that emerge in the model. We first find that representation learning is supported by in-context structure learning and cross-context alignment, where representations are aligned across environments with different sensory stimuli. We next demonstrate that the reinforcement learning strategies developed by the model are not interpretable as standard model-free or model-based planning. Instead, we show that in-context reinforcement learning is supported by caching intermediate computations within the model’s memory tokens, which are then accessed at decision time. Overall, we find that memory may serve as a computational resource, storing both raw experience and cached computations to support flexible behavior. Furthermore, the representations developed in the model resemble computations associated with the hippocampal-entorhinal system in the brain, suggesting that our findings may be relevant for natural cognition. Taken together, our work offers a mechanistic hypothesis for the rapid adaptation that underlies in-context learning in artificial and natural settings.

[306] Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

Main category: cs.AI

TL;DR: InfinityInstruct-Subject: A systematic framework for constructing high-quality instruction datasets with improved coverage and depth, containing ~1.5M instructions that enhance model instruction-following capabilities.

Details

Motivation: Current instruction datasets (tens of millions of samples) still lead to models struggling with complex instruction following and rare domain tasks due to limited coverage (task types/knowledge areas) and depth (instruction complexity).

Method: Proposes a systematic instruction data construction framework with: 1) hierarchical labeling system, 2) informative seed selection algorithm, 3) evolutionary data synthesis process, and 4) model deficiency diagnosis with targeted data generation, forming an iterative closed-loop.

Result: Constructed InfinityInstruct-Subject dataset (~1.5M instructions). Experiments on multiple foundation models and benchmark tasks demonstrate improved instruction-following capabilities. Analysis shows enlarged coverage and depth compared to comparable synthesized instruction datasets.

Conclusion: The work lays theoretical and practical foundation for efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement through systematic framework addressing coverage and depth limitations.

Abstract: Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

[307] Agentic UAVs: LLM-Driven Autonomy with Integrated Tool-Calling and Cognitive Reasoning

Anis Koubaa, Khaled Gabr

Main category: cs.AI

TL;DR: Agentic UAVs framework uses LLM-driven reasoning to enhance drone autonomy, achieving better detection and decision-making in search-and-rescue simulations.

Details

Motivation: Current UAV systems operate at low autonomy levels (SAE 2-3), relying on rule-based control and narrow AI that limits adaptability in dynamic missions. They lack context-aware reasoning, autonomous decision-making, and integration with external systems, especially missing LLM agents for real-time knowledge access.

Method: Proposes a five-layer architecture (Perception, Reasoning, Action, Integration, Learning) that enhances UAV autonomy through LLM-driven reasoning, database querying, and third-party system interaction. Built a prototype with ROS 2 and Gazebo combining YOLOv11 for object detection with GPT-4 for reasoning and locally deployed Gemma 3 model.

Result: In simulated search-and-rescue scenarios, agentic UAVs achieved: higher detection confidence (0.79 vs 0.72), improved person detection rates (91% vs 75%), and major increase in correct action recommendations (92% vs 4.5%). Shows modest computational overhead enables significantly higher autonomy.

Conclusion: The Agentic UAVs framework demonstrates that LLM-driven reasoning can substantially improve UAV autonomy and system integration, enabling more adaptive and intelligent performance in dynamic missions with manageable computational costs.

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly used in defense, surveillance, and disaster response, yet most systems still operate at SAE Level 2 to 3 autonomy. Their dependence on rule-based control and narrow AI limits adaptability in dynamic and uncertain missions. Current UAV architectures lack context-aware reasoning, autonomous decision-making, and integration with external systems. Importantly, none make use of Large Language Model (LLM) agents with tool-calling for real-time knowledge access. This paper introduces the Agentic UAVs framework, a five-layer architecture consisting of Perception, Reasoning, Action, Integration, and Learning. The framework enhances UAV autonomy through LLM-driven reasoning, database querying, and interaction with third-party systems. A prototype built with ROS 2 and Gazebo combines YOLOv11 for object detection with GPT-4 for reasoning and a locally deployed Gemma 3 model. In simulated search-and-rescue scenarios, agentic UAVs achieved higher detection confidence (0.79 compared to 0.72), improved person detection rates (91% compared to 75%), and a major increase in correct action recommendations (92% compared to 4.5%). These results show that modest computational overhead can enable significantly higher levels of autonomy and system-level integration.

[308] Artificial Intelligence Virtual Cells: From Measurements to Decisions across Modality, Scale, Dynamics, and Evaluation

Chengpeng Hu, Calvin Yu-Chian Chen

Main category: cs.AI

TL;DR: AIVCs aim to learn executable cell state models from multimodal data, but current evaluations lack cross-lab/platform transport, systematic handling of dose/time effects, and cross-scale coupling. The paper proposes a Cell-State Latent perspective with operator grammar and decision-aligned evaluation blueprint.

Details

Motivation: Current AI Virtual Cell models have limitations: evaluations are mostly within single datasets, lack cross-laboratory/platform transport, suffer from data leakage/coverage bias, don't systematically handle dose/time/combination effects, and have constrained cross-scale coupling with sparse molecular-cellular-tissue anchors.

Method: Proposes a model-agnostic Cell-State Latent (CSL) perspective organized via an operator grammar: measurement operators, lift/project operators for cross-scale coupling, and intervention operators for dosing and scheduling. Emphasizes decision-aligned evaluation across modality, scale, context, and intervention.

Result: The CSL framework provides a structured approach for learning executable cell state models, with recommendations for operator-aware data design, leakage-resistant data partitions, and transparent calibration/reporting to enable reproducible, like-for-like comparisons.

Conclusion: AIVCs need improved evaluation frameworks that address cross-platform transport, systematic intervention handling, and cross-scale coupling. The proposed CSL perspective with operator grammar and decision-aligned evaluation blueprint can enable more reproducible and clinically relevant cell state modeling.

Abstract: Artificial Intelligence Virtual Cells (AIVCs) aim to learn executable, decision-relevant models of cell state from multimodal, multiscale measurements. Recent studies have introduced single-cell and spatial foundation models, improved cross-modality alignment, scaled perturbation atlases, and explored pathway-level readouts. Nevertheless, although held-out validation is standard practice, evaluations remain predominantly within single datasets and settings; evidence indicates that transport across laboratories and platforms is often limited, that some data splits are vulnerable to leakage and coverage bias, and that dose, time and combination effects are not yet systematically handled. Cross-scale coupling also remains constrained, as anchors linking molecular, cellular and tissue levels are sparse, and alignment to scientific or clinical readouts varies across studies. We propose a model-agnostic Cell-State Latent (CSL) perspective that organizes learning via an operator grammar: measurement, lift/project for cross-scale coupling, and intervention for dosing and scheduling. This view motivates a decision-aligned evaluation blueprint across modality, scale, context and intervention, and emphasizes function-space readouts such as pathway activity, spatial neighborhoods and clinically relevant endpoints. We recommend operator-aware data design, leakage-resistant partitions, and transparent calibration and reporting to enable reproducible, like-for-like comparisons.

[309] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang

Main category: cs.AI

TL;DR: MARSHAL is an RL framework that uses self-play with strategic LLMs to enhance multi-agent reasoning in both cooperative and competitive games, achieving significant performance improvements in reasoning benchmarks.

Details

Motivation: While RL has been effective for single-agent reasoning, extending it to multi-turn, multi-agent scenarios is challenging due to long-horizon credit assignment and agent-specific advantage estimation problems. There's a need to develop LLMs that can cooperate and compete effectively in multi-agent systems.

Method: MARSHAL is an end-to-end RL framework featuring: 1) turn-level advantage estimator for credit assignment aligned with each interaction, 2) agent-specific advantage normalization to stabilize multi-agent training, and 3) self-play across cooperative and competitive games to develop strategic abilities.

Result: MARSHAL agent trained from Qwen3-4B shows 28.7% performance improvements on held-out games. When integrated into leading MASs, it achieves up to 10.0% improvement on AIME, 6.6% on GPQA-Diamond, and 3.5% average improvement across all benchmarks. The capabilities generalize beyond games to reasoning benchmarks.

Conclusion: End-to-end RL training with self-play in strategic games is a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs, establishing MARSHAL as an effective framework for advancing multi-agent intelligence.

Abstract: Developing large language models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant performance gains of up to 10.0% on AIME, 6.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

[310] AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys

Siyi Wu, Chiaxin Liang, Ziqian Bi, Leyi Zhao, Tianyang Wang, Junhao Song, Yichao Zhang, Keyu Chen, Benji Peng, Xinyuan Song

Main category: cs.AI

TL;DR: autosurvey2 is an automated pipeline for generating comprehensive academic surveys using retrieval-augmented synthesis and structured evaluation, outperforming existing baselines in coherence and relevance.

Details

Motivation: The rapid growth of research literature, especially in fast-moving fields like LLMs, makes producing comprehensive and current survey papers increasingly difficult and time-consuming.

Method: Multi-stage pipeline with parallel section generation, iterative refinement, real-time retrieval of recent publications, and multi-LLM evaluation framework measuring coverage, structure, and relevance.

Result: autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity.

Conclusion: autosurvey2 provides a scalable, reproducible solution for generating long-form academic surveys and establishes a foundation for future research on automated scholarly writing.

Abstract: The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at https://github.com/annihi1ation/auto_research.

[311] CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization

Ziqian Bi, Kaijie Chen, Tianyang Wang, Junfeng Hao, Benji Peng, Xinyuan Song

Main category: cs.AI

TL;DR: The paper proposes an adaptive reasoning summarization framework that compresses CoT reasoning traces to enable efficient cross-model transfer, achieving 40% higher accuracy than truncation while reducing evaluation costs by 84%.

Details

Motivation: Chain-of-Thought reasoning improves LLM problem-solving but causes substantial inference overhead, limiting deployment in resource-constrained settings. There's a need for efficient CoT transfer across models of different scales and architectures.

Method: Adaptive reasoning summarization framework with three components: semantic segmentation with importance scoring to identify critical reasoning steps, budget-aware dynamic compression to reduce token usage, and coherence reconstruction to maintain reasoning flow. Includes Gaussian Process-based Bayesian optimization for efficient evaluation.

Result: On 7,501 medical exam questions across 10 specialties: up to 40% higher accuracy than truncation under same token budgets. Evaluated 64 model pairs from 8 LLMs (1.5B-32B parameters) showing strong cross-model transferability. Bayesian optimization reduced evaluation cost by 84% and revealed power-law relationship between model size and cross-domain robustness.

Conclusion: Reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. The method preserves critical reasoning steps while significantly reducing token usage, demonstrating strong cross-model applicability.

Abstract: Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource-constrained settings. This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework. The proposed method compresses reasoning traces via semantic segmentation with importance scoring, budget-aware dynamic compression, and coherence reconstruction, preserving critical reasoning steps while significantly reducing token usage. Experiments on 7{,}501 medical examination questions across 10 specialties show up to 40% higher accuracy than truncation under the same token budgets. Evaluations on 64 model pairs from eight LLMs (1.5B-32B parameters, including DeepSeek-R1 and Qwen3) confirm strong cross-model transferability. Furthermore, a Gaussian Process-based Bayesian optimization module reduces evaluation cost by 84% and reveals a power-law relationship between model size and cross-domain robustness. These results demonstrate that reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. Code will be released upon publication.

[312] WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Xin Luna Dong

Main category: cs.AI

TL;DR: WearVQA is the first benchmark for evaluating VQA capabilities of multimodal AI assistants on wearable devices, featuring 2,520 ego-centric image-question-answer triplets with realistic challenges like occlusion, poor lighting, and blurry images.

Details

Motivation: Existing VQA benchmarks focus on high-quality third-person imagery, but wearable devices face unique challenges with ego-centric views that are often occluded, poorly lit, unzoomed, or blurry. There's a need for a benchmark that reflects real-world wearable use cases to guide development of robust multimodal AI systems for wearables.

Method: Created a benchmark with 2,520 carefully curated image-question-answer triplets spanning 7 diverse image domains (text-centric and general scenes), 10 cognitive task types (from basic recognition to various reasoning tasks), and 6 common wearable-specific image quality issues. All questions are answerable using only visual input and common sense. Paired with an LLM-as-a-judge evaluation framework achieving 96% labeling accuracy.

Result: Open-source and proprietary multimodal LLMs achieved low QA accuracy of 24-52% on WearVQA, with substantial performance drops on lower-quality images and reasoning-heavy tasks. The benchmark proved to be challenging and comprehensive for current AI systems.

Conclusion: WearVQA serves as a comprehensive and challenging benchmark that reveals significant gaps in current multimodal AI systems’ ability to handle real-world wearable scenarios. It positions itself as an essential tool for guiding technical advancement toward robust, real-world multimodal wearables AI systems.

Abstract: We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.

[313] The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness

Zhongjie Jiang

Main category: cs.AI

TL;DR: Synthetic data optimized for statistical smoothness removes human-like irregularities, accelerating model collapse. PMCSF framework simulates cognitive processes to generate text with human imperfections, validated through cognitive and functional tests showing superior performance.

Details

Motivation: Current synthetic data production focuses on statistical smoothness, which systematically removes the long-tail, cognitively grounded irregularities that characterize human text. This leads to model collapse when models are trained on such impoverished data.

Method: Proposes Prompt-driven Cognitive Computing Framework (PMCSF) with two core components: Cognitive State Decoder (CSD) that reverse-engineers text into structured cognitive vectors, and Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators.

Result: CTE text shows Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves ICC > 0.9 for cognitive profile alignment. In functional tests, strategies using CTE-generated data reduce maximum drawdown by 47.4% during the 2015 market crash and deliver 8.6% Defensive Alpha.

Conclusion: Modelling human cognitive limitations rather than copying surface data enables synthetic data with genuine functional gain, offering a viable technical pathway to resolve the AI data-collapse crisis.

Abstract: Although synthetic data is widely promoted as a remedy, its prevailing production paradigm – one optimizing for statistical smoothness – systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations – not copying surface data – enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.

[314] Enhanced Conditional Generation of Double Perovskite by Knowledge-Guided Language Model Feedback

Inhyo Lee, Junhyeong Lee, Jongwon Park, KyungTae Lim, Seunghwa Ryu

Main category: cs.AI

TL;DR: A multi-agent text gradient framework integrates LLM self-evaluation, domain knowledge, and ML feedback to generate stable double perovskite compositions under natural-language conditions, achieving 98% validity and 54% stable/metastable candidates.

Details

Motivation: Double perovskites have vast design space for sustainable energy applications, but conditional materials discovery is challenging. Current methods struggle with generating physically meaningful compositions under specific conditions.

Method: Multi-agent framework with text gradients using three feedback sources: LLM self-evaluation, domain knowledge-informed feedback, and ML surrogate-based feedback. Compares three configurations: pure LLM, LLM with reasoning feedback, and LLM with domain knowledge guidance.

Result: Domain knowledge-guided gradients achieve over 98% compositional validity and up to 54% stable/metastable candidates, surpassing LLM-only baseline (43%) and prior GAN-based results (27%). ML gradients work well in-distribution but unreliable out-of-distribution.

Conclusion: First systematic analysis of multi-agent, knowledge-guided text gradients for double perovskite discovery. Provides generalizable blueprint for MAS-driven generative materials design to advance sustainable technologies.

Abstract: Double perovskites (DPs) are promising candidates for sustainable energy technologies due to their compositional tunability and compatibility with low-energy fabrication, yet their vast design space poses a major challenge for conditional materials discovery. This work introduces a multi-agent, text gradient-driven framework that performs DP composition generation under natural-language conditions by integrating three complementary feedback sources: LLM-based self-evaluation, DP-specific domain knowledge-informed feedback, and ML surrogate-based feedback. Analogous to how knowledge-informed machine learning improves the reliability of conventional data-driven models, our framework incorporates domain-informed text gradients to guide the generative process toward physically meaningful regions of the DP composition space. Systematic comparison of three incremental configurations, (i) pure LLM generation, (ii) LLM generation with LLM reasoning-based feedback, and (iii) LLM generation with domain knowledge-guided feedback, shows that iterative guidance from knowledge-informed gradients improves stability-condition satisfaction without additional training data, achieving over 98% compositional validity and up to 54% stable or metastable candidates, surpassing both the LLM-only baseline (43%) and prior GAN-based results (27%). Analyses of ML-based gradients further reveal that they enhance performance in in-distribution (ID) regions but become unreliable in out-of-distribution (OOD) regimes. Overall, this work provides the first systematic analysis of multi-agent, knowledge-guided text gradients for DP discovery and establishes a generalizable blueprint for MAS-driven generative materials design aimed at advancing sustainable technologies.

[315] H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, Maosong Sun

Main category: cs.AI

TL;DR: The paper identifies a sparse subset of neurons (<0.1% of total) that reliably predict hallucinations in LLMs, shows they cause over-compliance behaviors, and traces their origins to pre-training.

Details

Motivation: LLMs frequently generate hallucinations (plausible but factually incorrect outputs), undermining reliability. While prior work examined hallucinations from macroscopic perspectives, the underlying neuron-level mechanisms remain largely unexplored.

Method: Systematic investigation into hallucination-associated neurons (H-Neurons) from three perspectives: identification (finding sparse neuron subsets that predict hallucinations), behavioral impact (controlled interventions to test causal links), and origins (tracing neurons back to pre-trained base models).

Result: A remarkably sparse subset of neurons (<0.1% of total) can reliably predict hallucination occurrences with strong generalization across scenarios. These neurons are causally linked to over-compliance behaviors and originate from pre-training (remain predictive in base models).

Conclusion: The findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs by understanding the neuron-level origins of hallucinations.

Abstract: Large language models (LLMs) frequently generate hallucinations – plausible but factually incorrect outputs – undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

[316] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

Main category: cs.AI

TL;DR: RL synthesizes new reasoning skills from atomic primitives rather than just amplifying existing behaviors, but only when models first master atomic skills via SFT.

Details

Motivation: To understand whether RL contributes to reasoning by synthesizing new skills or merely amplifying existing behaviors, using Complementary Reasoning as a test case that requires integrating internal knowledge with external information.

Method: Used controlled synthetic dataset of human biographies to decouple Complementary Reasoning into two atomic skills: Parametric Reasoning (internal knowledge) and Contextual Reasoning (external information). Evaluated generalization across I.I.D., Composition, and Zero-shot settings, comparing SFT and RL approaches.

Result: SFT achieves near-perfect in-distribution accuracy but collapses on O.O.D. generalization (SFT Generalization Paradox). RL acts as a reasoning synthesizer rather than probability amplifier, but only when base model first masters atomic skills via SFT. RL can synthesize complex strategies from learned primitives without explicit supervision on such complex strategies.

Conclusion: RL actively synthesizes complex reasoning strategies from atomic primitives, challenging the view of RL as mere amplifier. Decoupled atomic training followed by RL offers scalable path to generalization for complex reasoning tasks.

Abstract: The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

cs.SD

[317] Story2MIDI: Emotionally Aligned Music Generation from Text

Mohammad Shokri, Alexandra C. Salem, Gabriel Levine, Johanna Devaney, Sarah Ita Levitan

Main category: cs.SD

TL;DR: Story2MIDI is a Transformer model that generates emotion-aligned music from text, trained on a merged dataset of text sentiment and music emotion pairs.

Details

Motivation: To create a system that can generate music that matches the emotional content of a given text, bridging the gap between textual sentiment and musical emotion expression.

Method: Sequence-to-sequence Transformer model trained on the Story2MIDI dataset, which combines existing text sentiment analysis and music emotion classification datasets to create text-music emotion pairs.

Result: The model effectively learns emotion-relevant musical features and produces diverse emotional responses, validated through objective musical metrics and human listening studies.

Conclusion: Despite limited dataset size and computational resources, Story2MIDI successfully generates emotion-aligned music from text, demonstrating the feasibility of cross-modal emotion transfer between text and music.

Abstract: In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model’s ability to capture intended emotional cues.

[318] Continual Learning for Singing Voice Separation with Human in the Loop Adaptation

Ankur Gupta, Anshul Rai, Archit Bansal, Vipul Arora

Main category: cs.SD

TL;DR: Interactive continual learning framework for singing voice separation that allows user feedback to fine-tune models for new songs, improving performance over base U-Net model.

Details

Motivation: Most deep learning singing voice separation models lack user interaction capabilities, which is crucial for real-world deployment where music tracks can differ from training data in genre and instruments. There's a need for models that can adapt to new target songs through user feedback.

Method: U-Net-based base model for vocal separation via spectrogram masking, followed by human-in-the-loop feedback where users mark false positives (vocal regions that should be silence). Two continual learning algorithms are proposed to incorporate this feedback for model fine-tuning.

Result: Experiments show improvement in singing voice separation performance by the proposed algorithms over the base model in both intra-dataset and inter-dataset settings.

Conclusion: The interactive continual learning framework successfully enables user-guided adaptation of singing voice separation models to new target songs, addressing the limitations of static models in real-world scenarios with diverse music content.

Abstract: Deep learning-based works for singing voice separation have performed exceptionally well in the recent past. However, most of these works do not focus on allowing users to interact with the model to improve performance. This can be crucial when deploying the model in real-world scenarios where music tracks can vary from the original training data in both genre and instruments. In this paper, we present a deep learning-based interactive continual learning framework for singing voice separation that allows users to fine-tune the vocal separation model to conform it to new target songs. We use a U-Net-based base model architecture that produces a mask for separating vocals from the spectrogram, followed by a human-in-the-loop task where the user provides feedback by marking a few false positives, i.e., regions in the extracted vocals that should have been silence. We propose two continual learning algorithms. Experiments substantiate the improvement in singing voice separation performance by the proposed algorithms over the base model in intra-dataset and inter-dataset settings.

[319] VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables

Lixing He, Yunqi Guo, Haozheng Hou, Zhenyu Yan

Main category: cs.SD

TL;DR: VibOmni: A lightweight multi-modal speech enhancement system for earables that fuses audio and bone-conducted vibration data from IMUs, achieving significant quality improvements with efficient on-device deployment.

Details

Motivation: Earables face challenges in voice applications due to compact design and ambient noise. Existing systems using only omnidirectional microphones struggle with competing speakers and noise in real-world environments.

Method: Two-branch encoder-decoder DNN fusing audio and vibration features from IMUs. Novel data augmentation using Bone Conduction Functions (BCFs) to generate synthetic vibration data. Multi-modal SNR estimator enables continual learning and adaptive inference without on-device back-propagation.

Result: Achieved 21% PESQ improvement, 26% SNR gain, ~40% WER reduction with low latency on mobile devices. User study with 35 participants showed 87% preference over baselines. BCF modeling achieved only 4.5% spectrogram similarity error.

Conclusion: VibOmni effectively enhances speech quality in noisy environments for earables by leveraging bone-conducted vibrations, demonstrating practical deployment potential with significant performance improvements and user preference.

Abstract: Earables, such as True Wireless Stereo earphones and VR/AR headsets, are increasingly popular, yet their compact design poses challenges for robust voice-related applications like telecommunication and voice assistant interactions in noisy environments. Existing speech enhancement systems, reliant solely on omnidirectional microphones, struggle with ambient noise like competing speakers. To address these issues, we propose VibOmni, a lightweight, end-to-end multi-modal speech enhancement system for earables that leverages bone-conducted vibrations captured by widely available Inertial Measurement Units (IMUs). VibOmni integrates a two-branch encoder-decoder deep neural network to fuse audio and vibration features. To overcome the scarcity of paired audio-vibration datasets, we introduce a novel data augmentation technique that models Bone Conduction Functions (BCFs) from limited recordings, enabling synthetic vibration data generation with only 4.5% spectrogram similarity error. Additionally, a multi-modal SNR estimator facilitates continual learning and adaptive inference, optimizing performance in dynamic, noisy settings without on-device back-propagation. Evaluated on real-world datasets from 32 volunteers with different devices, VibOmni achieves up to 21% improvement in Perceptual Evaluation of Speech Quality (PESQ), 26% in Signal-to-Noise Ratio (SNR), and about 40% WER reduction with much less latency on mobile devices. A user study with 35 participants showed 87% preferred VibOmni over baselines, demonstrating its effectiveness for deployment in diverse acoustic environments.

Xueyan Li, Yuxin Wang, Mengjie Jiang, Qingzi Zhu, Jiang Zhang, Zoey Kim, Yazhe Niu

Main category: cs.SD

TL;DR: Proposes VocalCritic, a generative feedback framework for singing voice synthesis evaluation that provides multi-dimensional language and audio critiques instead of single numerical scores.

Details

Motivation: Current SVS evaluation methods rely on single numerical scores that fail to capture nuanced aspects like phrasing and expressiveness, require costly annotations, and lack interpretability and generalization.

Method: Uses an audio-language model fine-tuned on hybrid dataset (human music reactions + synthetic critiques from MLLMs) to generate multi-dimensional text and audio critiques covering melody, content, and auditory quality.

Result: Quantitative experiments validate the dataset and training strategy, showing the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement.

Conclusion: The proposed generative feedback framework addresses limitations of current SVS evaluation methods by providing rich, multi-dimensional critiques that enhance interpretability and generalization for model optimization.

Abstract: Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at https://github.com/opendilab/VocalCritic

[321] Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.SD

TL;DR: Pianist Transformer: A self-supervised model for expressive music performance rendering using MIDI data without explicit annotation, achieving human-level quality through data and model scaling.

Details

Motivation: Existing methods rely on supervised learning over small labeled datasets, limiting scaling despite availability of vast unlabeled music data. Need to bridge this gap like in vision and language domains.

Method: 1) Unified MIDI representation for learning musical structure and expression without annotation; 2) Efficient asymmetric architecture for longer contexts and faster inference; 3) Self-supervised pre-training with 10B tokens and 135M-parameter model; 4) State-of-the-art performance model.

Result: Achieves strong objective metrics and human-level subjective ratings in expressive performance rendering, establishing scalable path toward human-like performance synthesis.

Conclusion: Pianist Transformer enables data and model scaling advantages for expressive music performance rendering, moving toward human-like performance synthesis in the music domain.

Abstract: Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.

[322] Text-Queried Audio Source Separation via Hierarchical Modeling

Xinlei Yin, Xiulian Peng, Xue Jiang, Zhiwei Xiong, Yan Lu

Main category: cs.SD

TL;DR: HSM-TSS: A hierarchical decomposition framework for target audio source separation using natural language queries, featuring dual-stage semantic separation and instruction processing for flexible sound manipulation.

Details

Motivation: Existing methods struggle with jointly modeling acoustic-textual alignment and semantic-aware separation in single-stage architectures, and require large-scale labeled data due to inefficient cross-modal learning. There's a need for more efficient and effective audio-text separation approaches.

Method: Proposes HSM-TSS framework with hierarchical decomposition: 1) Global-semantic separation using Q-Audio architecture for audio-text alignment, 2) Local-semantic separation on AudioMAE features preserving time-frequency structures, 3) Instruction processing pipeline to parse text queries into structured operations (extraction/removal) with audio descriptions.

Result: Achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.

Conclusion: HSM-TSS effectively addresses challenges in audio-text separation by decoupling semantic guidance from acoustic reconstruction, enabling flexible sound manipulation with natural language queries while reducing data requirements.

Abstract: Target audio source separation with natural language queries presents a promising paradigm for extracting arbitrary audio events through arbitrary text descriptions. Existing methods mainly face two challenges, the difficulty in jointly modeling acoustic-textual alignment and semantic-aware separation within a blindly-learned single-stage architecture, and the reliance on large-scale accurately-labeled training data to compensate for inefficient cross-modal learning and separation. To address these challenges, we propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction. Our approach introduces a dual-stage mechanism for semantic separation, operating on distinct global and local semantic feature spaces. We first perform global-semantic separation through a global semantic feature space aligned with text queries. A Q-Audio architecture is employed to align audio and text modalities, serving as pretrained global-semantic encoders. Conditioned on the predicted global feature, we then perform the second-stage local-semantic separation on AudioMAE features that preserve time-frequency structures, followed by acoustic reconstruction. We also propose an instruction processing pipeline to parse arbitrary text queries into structured operations, extraction or removal, coupled with audio descriptions, enabling flexible sound manipulation. Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.

[323] SAND Challenge: Four Approaches for Dysartria Severity Classification

Gauri Deshpande, Harish Battula, Ashish Panda, Sunil Kumar Kopparapu

Main category: cs.SD

TL;DR: The paper compares four modeling approaches for dysarthria severity classification in the SAND challenge, finding that a feature-engineered XGBoost ensemble achieves the best performance (macro-F1 0.86), while deep learning models (ViT, CNN, BiLSTM) achieve competitive F1-scores (0.70) and provide complementary insights.

Details

Motivation: To conduct a unified comparative study of different modeling approaches for classifying dysarthria severity in the SAND challenge, examining both traditional feature-engineered methods and modern deep learning techniques to understand their relative strengths and contributions.

Method: Four distinct modeling approaches were investigated: (1) ViT-OF using Vision Transformer on spectrogram images, (2) 1D-CNN with eight 1-D CNNs using majority-vote fusion, (3) BiLSTM-OF with nine BiLSTM models using majority vote fusion, and (4) Hierarchical XGBoost ensemble combining glottal and formant features through a two-stage learning framework. All models performed the same five-class classification task using the common SAND dataset.

Result: The feature-engineered XGBoost ensemble achieved the highest macro-F1 score of 0.86 on the validation set of 53 speakers. The deep learning models (ViT, CNN, BiLSTM) attained competitive F1-scores around 0.70. All approaches provided complementary insights into the dysarthria severity classification problem.

Conclusion: While the traditional feature-engineered XGBoost ensemble outperformed deep learning approaches for dysarthria severity classification, deep learning models offer valuable complementary insights and competitive performance, suggesting that different modeling paradigms can contribute unique perspectives to the problem.

Abstract: This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN’s with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.

[324] Exploring Definitions of Quality and Diversity in Sonic Measurement Spaces

Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

Main category: cs.SD

TL;DR: Unsupervised dimensionality reduction (PCA/autoencoders) for dynamic behavior space reconfiguration in quality diversity evolutionary algorithms achieves greater sonic diversity than handcrafted descriptors while avoiding expert biases.

Details

Motivation: Existing QD methods use handcrafted descriptors or supervised classifiers that introduce unintended exploration biases and constrain discovery to familiar sonic regions. Need automated approaches to explore vast digital sound synthesis parameter spaces without manual intervention.

Method: Apply PCA and autoencoders to project high-dimensional audio features onto structured grids for MAP-Elites QD algorithm. Implement dynamic reconfiguration through model retraining at regular intervals during evolutionary search.

Result: Automatic approaches achieve significantly greater diversity than handcrafted behavior spaces while avoiding expert-imposed biases. Dynamic reconfiguration maintains evolutionary pressure and prevents stagnation, with PCA proving most effective among dimensionality reduction techniques.

Conclusion: Unsupervised dimensionality reduction enables automated sonic discovery systems capable of exploring vast parameter spaces without manual intervention or supervised training constraints, contributing to more open-ended creative exploration.

Abstract: Digital sound synthesis presents the opportunity to explore vast parameter spaces containing millions of configurations. Quality diversity (QD) evolutionary algorithms offer a promising approach to harness this potential, yet their success hinges on appropriate sonic feature representations. Existing QD methods predominantly employ handcrafted descriptors or supervised classifiers, potentially introducing unintended exploration biases and constraining discovery to familiar sonic regions. This work investigates unsupervised dimensionality reduction methods for automatically defining and dynamically reconfiguring sonic behaviour spaces during QD search. We apply Principal Component Analysis (PCA) and autoencoders to project high-dimensional audio features onto structured grids for MAP-Elites, implementing dynamic reconfiguration through model retraining at regular intervals. Comparison across two experimental scenarios shows that automatic approaches achieve significantly greater diversity than handcrafted behaviour spaces while avoiding expert-imposed biases. Dynamic behaviour-space reconfiguration maintains evolutionary pressure and prevents stagnation, with PCA proving most effective among the dimensionality reduction techniques. These results contribute to automated sonic discovery systems capable of exploring vast parameter spaces without manual intervention or supervised training constraints.

[325] IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators

Jonas Länzlinger, Katharina Müller, Burkhard Stiller, Bruno Rodrigues

Main category: cs.SD

TL;DR: IHearYou is a privacy-preserving, on-device system that uses passive speech analysis to detect depression by linking acoustic features to clinically interpretable DSM-5 indicators.

Details

Motivation: Current depression diagnosis relies on subjective self-reports and interviews that may not capture authentic behavior. There's a need for objective, passive detection methods that preserve privacy and provide clinically interpretable results.

Method: Uses passive voice sensing in household environments, extracts acoustic features, and links them to DSM-5 indicators through a structured Linkage Framework. Runs locally on commodity hardware with privacy protection. Includes configuration-driven protocol with FDR correction and gender-stratified testing for reproducibility.

Result: Applied to DAIC-WOZ dataset showing directionally consistent feature-indicator associations. TESS-based audio streaming experiment validated end-to-end feasibility. System successfully transforms passive voice sensing into explainable DSM-5 indicator scores.

Conclusion: IHearYou bridges the gap between black-box detection and clinically interpretable analysis, demonstrating how passive voice sensing can provide objective, privacy-preserving depression assessment with clinical relevance.

Abstract: Depression affects over millions people worldwide, yet diagnosis still relies on subjective self-reports and interviews that may not capture authentic behavior. We present IHearYou, an approach to automated depression detection focused on speech acoustics. Using passive sensing in household environments, IHearYou extracts voice features and links them to DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) indicators through a structured Linkage Framework instantiated for Major Depressive Disorder. The system runs locally to preserve privacy and includes a persistence schema and dashboard, presenting real-time throughput on a commodity laptop. To ensure reproducibility, we define a configuration-driven protocol with False Discovery Rate (FDR) correction and gender-stratified testing. Applied to the DAIC-WOZ dataset, this protocol reveals directionally consistent feature-indicator associations, while a TESS-based audio streaming experiment validates end-to-end feasibility. Our results show how passive voice sensing can be turned into explainable DSM-5 indicator scores, bridging the gap between black-box detection and clinically interpretable, on-device analysis.

cs.LG

[326] An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction

Md. Najmul Islam, Md. Miner Hossain Rimon, Shah Sadek-E-Akbor Shamim, Zarif Mohaimen Fahad, Md. Jehadul Islam Mony, Md. Jalal Uddin Chowdhury

Main category: cs.LG

TL;DR: The paper presents a machine learning framework for early diabetes prediction using BRFSS health survey data, achieving high accuracy with ensemble methods and developing a mobile application for practical use.

Details

Motivation: Diabetes is a serious global health issue requiring early detection, but prediction is challenging due to overlapping risk factors and data asymmetry. The research aims to create an accurate and interpretable ML framework using extensive health survey data to support clinical decision-making.

Method: Used 2015 BRFSS dataset with ~253,680 records and 22 numerical features. Applied SMOTE and Tomek Links to address class imbalance. Evaluated multiple supervised learning techniques including Random Forest, XGBoost, CatBoost, and LightGBM. Investigated both individual models and ensemble methods like stacking.

Result: Individual models achieved strong ROC-AUC of ~0.96. The stacking ensemble with XGBoost and KNN performed best with 94.82% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991. Developed a React Native mobile app with Python Flask backend for practical diabetes prediction.

Conclusion: The proposed ML framework effectively predicts diabetes with high accuracy, and the developed mobile application provides an accessible health monitoring tool for early diabetes detection, supporting clinical decision-making.

Abstract: Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

[327] Pharmacophore-based design by learning on voxel grids

Omar Mahmood, Pedro O. Pinheiro, Richard Bonneau, Saeed Saremi, Vishnu Sresht

Main category: cs.LG

TL;DR: VoxCap is a voxel-based generative model that generates SMILES strings from voxelized molecular representations, addressing scaling and fecundity issues in conventional pharmacophore-based virtual screening.

Details

Motivation: Traditional ligand-based drug discovery using pharmacophore-shape similarity searches scales poorly with library sizes and is limited to existing compounds, creating bottlenecks in hit diversification and scaffold hopping.

Method: Proposes VoxCap, a voxel captioning method that generates SMILES strings from voxelized molecular representations. Introduces two workflows: de-novo design for generating new molecules with high pharmacophore-shape similarity to queries, and fast search combining generative design with cheap 2D substructure similarity.

Result: VoxCap significantly outperforms previous methods in generating diverse de-novo hits. When combined with the fast search workflow, it reduces computational time by orders of magnitude while returning hits for all query molecules, enabling search of large libraries intractable to brute force methods.

Conclusion: VoxCap addresses key limitations of conventional pharmacophore-based virtual screening by enabling efficient generation of diverse molecules and scalable search of large chemical libraries through innovative voxel-based generative modeling.

Abstract: Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore-shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel-based generative modelling, we propose a pharmacophore-based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore-based virtual screening. We introduce \emph{VoxCap}, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore-based generation: \emph{de-novo} design, in which we aim to generate new molecules with high pharmacophore-shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse \textit{de-novo} hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.

[328] PIBNet: a Physics-Inspired Boundary Network for Multiple Scattering Simulations

Rémi Marsal, Stéphanie Chaillat

Main category: cs.LG

TL;DR: PIBNet: A learning-based graph neural network that approximates boundary element method solutions for multiple scattering problems, outperforming existing methods and generalizing well to more obstacles.

Details

Motivation: The boundary element method (BEM) is efficient for multiple scattering in unbounded domains but has computational bottlenecks in solving boundary integral equations. A learning-based approach could accelerate this first step while maintaining accuracy.

Method: PIBNet uses a physics-inspired graph-based strategy to model obstacles and their long-range interactions, with a novel multiscale graph neural network architecture specifically designed for multiple scattering simulation.

Result: PIBNet surpasses existing state-of-the-art learning-based methods on multiple scattering tasks and shows superior generalization to settings with increased numbers of obstacles.

Conclusion: PIBNet provides an effective learning-based alternative to accelerate the computational bottleneck in BEM for multiple scattering problems, with strong performance and generalization capabilities.

Abstract: The boundary element method (BEM) provides an efficient numerical framework for solving multiple scattering problems in unbounded homogeneous domains, since it reduces the discretization to the domain boundaries, thereby condensing the computational complexity. The procedure first consists in determining the solution trace on the boundaries of the domain by solving a boundary integral equation, after which the volumetric solution can be recovered at low computational cost with a boundary integral representation. As the first step of the BEM represents the main computational bottleneck, we introduce PIBNet, a learning-based approach designed to approximate the solution trace. The method leverages a physics-inspired graph-based strategy to model obstacles and their long-range interactions efficiently. Then, we introduce a novel multiscale graph neural network architecture for simulating the multiple scattering. To train and evaluate our network, we present a benchmark consisting of several datasets of different types of multiple scattering problems. The results indicate that our approach not only surpasses existing state-of-the-art learning-based methods on the considered tasks but also exhibits superior generalization to settings with an increased number of obstacles. github.com/ENSTA-U2IS-AI/pibnet

[329] Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration

M. A. Gameiro

Main category: cs.LG

TL;DR: SFL Transformer integrates structural features into BERT’s self-attention via Contextual Gating, achieving state-of-the-art accuracy (0.9910) and calibration for lyrical content classification.

Details

Motivation: To improve feature fusion for lyrical content classification by moving beyond simple output-layer fusion and instead integrating auxiliary structural features directly into the Transformer's attention mechanism for better synergy between structural and semantic information.

Method: Proposes SFL Transformer with Contextual Gating mechanism that modulates hidden states within BERT encoder stack using low-dimensional structural features, rather than fusing at final output layer. Applied to binary classification of UMAP-reduced lyrical embeddings.

Result: Achieved Accuracy 0.9910 and Macro F1 0.9910, surpassing previous SFL model (0.9894). Maintained exceptional reliability with low Expected Calibration Error (0.0081) and Log Loss (0.0489).

Conclusion: Injecting auxiliary structural context mid-stack is the most effective approach for combining structural and semantic information, creating models with superior discriminative power and high-fidelity probability estimates.

Abstract: This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attention mechanism of a pre-trained Transformer. I propose the SFL Transformer, a novel deep learning model that utilizes a Contextual Gating mechanism (an Intermediate SFL) to modulate the sequence of hidden states within the BERT encoder stack, rather than fusing features at the final output layer. This approach modulates the deep, contextualized semantic features (Hseq) using low-dimensional structural cues (Fstruct). The model is applied to a challenging binary classification task derived from UMAP-reduced lyrical embeddings. The SFL Transformer achieved an Accuracy of 0.9910 and a Macro F1 score of 0.9910, significantly improving the state-of-the-art established by the previously published SFL model (Accuracy 0.9894). Crucially, this Contextual Gating strategy maintained exceptional reliability, with a low Expected Calibration Error (ECE = 0.0081) and Log Loss (0.0489). This work validates the hypothesis that injecting auxiliary context mid-stack is the most effective means of synergistically combining structural and semantic information, creating a model with both superior discriminative power and high-fidelity probability estimates.

[330] Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering

Haoxiang Zhang, Ruihao Yuan, Lihui Zhang, Yushi Luo, Qiang Zhang, Pan Ding, Xiaodong Ren, Weijie Xing, Niu Gao, Jishan Chen, Chubo Zhang

Main category: cs.LG

TL;DR: An explainable few-shot AI framework for engineering that combines physics-informed synthetic data generation with symbolic regression to discover constitutive equations from only 32 experimental samples, achieving 88% accuracy in predicting hot-cracking in aerospace superalloy welding.

Details

Motivation: Addresses two key bottlenecks in industrial AI adoption: scarce high-quality data and lack of interpretability in black-box models, particularly critical in safety-sensitive sectors like aerospace where physical understanding is available but data is limited.

Method: Three-stage physics-informed synthetic data augmentation (differentiated noise injection, hard physical constraints, inter-parameter preservation) followed by nested optimization: symbolic regression for equation structure discovery and differential evolution for parameter optimization, with hybrid global-local refinement.

Result: Achieved 88% accuracy in predicting hot-cracking tendency in K439B superalloy castings repair welding, with the discovered constitutive equation providing both quantitative predictions and explicit physical insights into thermal, geometric, and metallurgical coupling mechanisms.

Conclusion: Provides a blueprint for trustworthy AI systems that embed engineering domain knowledge directly into architecture, enabling reliable adoption in high-stakes applications where data is limited but physical understanding exists, with the constitutive equation serving as a multi-functional tool for process optimization and virtual data generation.

Abstract: The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers’ cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.

[331] How Market Volatility Shapes Algorithmic Collusion: A Comparative Analysis of Learning-Based Pricing Algorithms

Aheer Sravon, Md. Ibrahim, Devdyuti Mazumder, Ridwan Al Aziz

Main category: cs.LG

TL;DR: Four pricing algorithms (Q-Learning, PSO, Double DQN, DDPG) are analyzed across three duopoly models under stochastic demand shocks, revealing DDPG as most collusive, with market structure and demand uncertainty critically shaping algorithmic competition outcomes.

Details

Motivation: Autonomous pricing algorithms increasingly influence digital market competition, but their behavior under realistic demand conditions remains largely unexamined, creating a need to understand how algorithms, market structure, and stochastic demand interact to shape competitive outcomes.

Method: Analyzed four pricing algorithms (Q-Learning, Particle Swarm Optimization, Double Deep Q-Network, Deep Deterministic Policy Gradient) across three classic duopoly models (Logit, Hotelling, Linear) under various demand-shock regimes created by auto-regressive processes, using profit- and price-based collusion indices.

Result: Reinforcement-learning algorithms often sustain supra-competitive prices under stable demand, with DDPG showing strongest collusive tendencies. Demand shocks produce varied effects: Logit markets decline significantly, Hotelling markets remain stable, and Linear markets experience profit inflation. Algorithm rankings remain consistent across environments.

Conclusion: Market structure and demand uncertainty are critical factors shaping algorithmic competition outcomes, with important implications for policy discussions surrounding autonomous pricing behavior in digital markets.

Abstract: Autonomous pricing algorithms are increasingly influencing competition in digital markets; however, their behavior under realistic demand conditions remains largely unexamined. This paper offers a thorough analysis of four pricing algorithms – Q-Learning, PSO, Double DQN, and DDPG – across three classic duopoly models (Logit, Hotelling, Linear) and under various demand-shock regimes created by auto-regressive processes. By utilizing profit- and price-based collusion indices, we investigate how the interactions among algorithms, market structure, and stochastic demand collaboratively influence competitive outcomes. Our findings reveal that reinforcement-learning algorithms often sustain supra-competitive prices under stable demand, with DDPG demonstrating the most pronounced collusive tendencies. Demand shocks produce notably varied effects: Logit markets suffer significant performance declines, Hotelling markets remain stable, and Linear markets experience shock-induced profit inflation. Despite marked changes in absolute performance, the relative rankings of the algorithms are consistent across different environments. These results underscore the critical importance of market structure and demand uncertainty in shaping algorithmic competition, while also contributing to the evolving policy discussions surrounding autonomous pricing behavior.

[332] Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

Main category: cs.LG

TL;DR: Ada-MoGE: Adaptive Gaussian Mixture of Experts model for multivariate time series forecasting that dynamically adjusts expert count based on spectral distribution to address frequency coverage imbalance.

Details

Motivation: Traditional MoE models with fixed expert counts struggle with shifting frequency distributions in time series data, causing frequency coverage imbalance - too few experts miss critical information while too many introduce noise.

Method: Ada-MoGE integrates spectral intensity and frequency response to adaptively determine optimal number of experts. Uses Gaussian band-pass filtering for smooth frequency decomposition instead of direct band truncation to avoid noise.

Result: Achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

Conclusion: Proposed adaptive expert selection mechanism effectively addresses frequency coverage imbalance in time series forecasting, outperforming traditional fixed-expert approaches while maintaining parameter efficiency.

Abstract: Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data’s frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

[333] WhAM: Towards A Translative Model of Sperm Whale Vocalization

Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores García, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser

Main category: cs.LG

TL;DR: WhAM is a transformer-based model that generates synthetic sperm whale codas from any audio prompt by finetuning VampNet on whale click recordings.

Details

Motivation: Sperm whale communication through codas is not well understood, and there's a need for tools to generate synthetic whale vocalizations for research and analysis purposes.

Method: Finetune VampNet (masked acoustic token model pretrained on music) on 10k coda recordings using iterative masked token prediction to generate synthetic codas.

Result: WhAM generates high-fidelity synthetic codas preserving acoustic features, achieves strong performance on downstream classification tasks (rhythm, social unit, vowel), and performs well in perceptual studies with marine biologists.

Conclusion: WhAM successfully generates realistic synthetic sperm whale codas and learns useful representations for classification tasks, providing a valuable tool for whale communication research.

Abstract: Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM’s synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM’s learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham

[334] DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting

Li Qianyang, Zhang Xingjun, Wang Shaoxun, Wei Jia

Main category: cs.LG

TL;DR: DPWMixer: A dual-path architecture for long-term time series forecasting using lossless wavelet decomposition to separate trends and fluctuations, with specialized mixers for each component and adaptive multi-scale fusion.

Details

Motivation: Transformers have quadratic complexity and overfit sparse data, linear models can't capture non-linear dynamics, and existing multi-scale frameworks lose high-frequency information through average pooling.

Method: Uses Lossless Haar Wavelet Pyramid instead of pooling for orthogonal decomposition of trends/fluctuations, Dual-Path Trend Mixer with global linear mapping and patch-based MLP-Mixer, and adaptive multi-scale fusion weighted by channel stationarity.

Result: Achieves consistent improvement over state-of-the-art baselines on eight public benchmarks.

Conclusion: DPWMixer provides an efficient solution for LTSF by addressing limitations of existing approaches through lossless decomposition and specialized dual-path processing.

Abstract: Long-term time series forecasting (LTSF) is a critical task in computational intelligence. While Transformer-based models effectively capture long-range dependencies, they often suffer from quadratic complexity and overfitting due to data sparsity. Conversely, efficient linear models struggle to depict complex non-linear local dynamics. Furthermore, existing multi-scale frameworks typically rely on average pooling, which acts as a non-ideal low-pass filter, leading to spectral aliasing and the irreversible loss of high-frequency transients. In response, this paper proposes DPWMixer, a computationally efficient Dual-Path architecture. The framework is built upon a Lossless Haar Wavelet Pyramid that replaces traditional pooling, utilizing orthogonal decomposition to explicitly disentangle trends and local fluctuations without information loss. To process these components, we design a Dual-Path Trend Mixer that integrates a global linear mapping for macro-trend anchoring and a flexible patch-based MLP-Mixer for micro-dynamic evolution. Finally, An adaptive multi-scale fusion module then integrates predictions from diverse scales, weighted by channel stationarity to optimize synthesis. Extensive experiments on eight public benchmarks demonstrate that our method achieves a consistent improvement over state-of-the-art baselines. The code is available at https://github.com/hit636/DPWMixer.

[335] HTG-GCL: Leveraging Hierarchical Topological Granularity from Cellular Complexes for Graph Contrastive Learning

Qirui Ji, Bin Qin, Yifan Jin, Yunze Zhao, Chuxiong Sun, Changwen Zheng, Jianwen Cao, Jiangmeng Li

Main category: cs.LG

TL;DR: HTG-GCL introduces hierarchical topological granularity graph contrastive learning using multi-scale ring-based cellular complexes to capture varying topological granularities for different downstream tasks.

Details

Motivation: Existing GCL methods with structural augmentations struggle to identify task-relevant topological structures and adapt to varying coarse-to-fine topological granularities required across different downstream tasks.

Method: Proposes HTG-GCL framework that generates multi-scale ring-based cellular complexes from graph transformations, embodying topological granularity. Uses multi-granularity decoupled contrast with granularity-specific weighting based on uncertainty estimation to handle misleading semantics at certain granularities.

Result: Comprehensive experiments on various benchmarks demonstrate HTG-GCL’s effectiveness and superior performance in capturing meaningful graph representations through hierarchical topological information.

Conclusion: HTG-GCL successfully addresses limitations of existing GCL approaches by leveraging hierarchical topological granularity and uncertainty-based weighting, providing a robust framework for learning discriminative graph representations across varying topological scales.

Abstract: Graph contrastive learning (GCL) aims to learn discriminative semantic invariance by contrasting different views of the same graph that share critical topological patterns. However, existing GCL approaches with structural augmentations often struggle to identify task-relevant topological structures, let alone adapt to the varying coarse-to-fine topological granularities required across different downstream tasks. To remedy this issue, we introduce Hierarchical Topological Granularity Graph Contrastive Learning (HTG-GCL), a novel framework that leverages transformations of the same graph to generate multi-scale ring-based cellular complexes, embodying the concept of topological granularity, thereby generating diverse topological views. Recognizing that a certain granularity may contain misleading semantics, we propose a multi-granularity decoupled contrast and apply a granularity-specific weighting mechanism based on uncertainty estimation. Comprehensive experiments on various benchmarks demonstrate the effectiveness of HTG-GCL, highlighting its superior performance in capturing meaningful graph representations through hierarchical topological information.

Haozhe Wu

Main category: cs.LG

TL;DR: Proposes a task-driven supervised multi-modal federated feature extraction method for regression that addresses limited non-IID data, multi-modal fusion, and catastrophic forgetting through contrastive learning and multi-constraint framework.

Details

Motivation: Address three core challenges in real-world multi-modal regression: 1) limited and non-IID data, 2) effective extraction and fusion of multi-modal information, and 3) susceptibility to catastrophic forgetting in model learning.

Method: Task-driven supervised multi-modal federated feature extraction method integrating multi-modal information extraction with contrastive learning. Uses adaptive neural network structures as latent mapping functions for each modality. Features multi-constraint learning framework with MSE loss for regression accuracy, mutual information preservation, symmetric KL divergence, and inter-model contrastive constraints.

Result: Experimental results from simulations and real-world data analysis show the proposed method achieves more significant performance improvement on downstream regression tasks compared with classical feature extraction techniques.

Conclusion: The method effectively addresses key challenges in multi-modal federated learning by ensuring feature extraction centers on improving downstream regression performance through multi-constraint optimization that handles non-IID data, multi-modal fusion, and catastrophic forgetting.

Abstract: This study focuses on the feature extraction problem in multi-modal data regression. To address three core challenges in real-world scenarios: limited and non-IID data, effective extraction and fusion of multi-modal information, and susceptibility to catastrophic forgetting in model learning, a task-driven supervised multi-modal federated feature extraction method is proposed. The method integrates multi-modal information extraction and contrastive learning mechanisms, and can adapt to different neural network structures as the latent mapping functions for data of each modality. It supports each client to independently learn low-dimensional representations of multi-modal data, and can flexibly control the degree of retention of effective information about the response variable in the predictive variables within the low-dimensional features through parameter tuning. The multi-constraint learning framework constructed by the method guarantees regression accuracy using Mean Squared Error loss. Through the synergistic effect of mutual information preservation constraint, symmetric Kullback-Leibler divergence constraint, and inter-model contrastive constraint, it achieves the retention of task-related information, the extraction, fusion, and alignment of multi-modal features, and the mitigation of representation drift and catastrophic forgetting in non-IID scenarios, respectively. This ensures that the feature extraction process always centers on improving the performance of downstream regression tasks. Experimental results from simulations and real-world data analysis demonstrate that the proposed method achieves more significant performance improvement on downstream regression tasks compared with classical feature extraction techniques.

[337] Cross-View Topology-Aware Graph Representation Learning

Ahmet Sami Korkmaz, Selim Coskunuzer, Md Joshem Uddin

Main category: cs.LG

TL;DR: GraphTCL: A dual-view contrastive learning framework that combines GNN structural embeddings with topological embeddings from persistent homology to improve graph classification performance.

Details

Motivation: Graph Neural Networks (GNNs) effectively capture local structural patterns but often overlook global topological features that are critical for robust graph representation learning. This limitation hinders classification performance in applications like chemistry, social networks, and bioinformatics.

Method: Proposes GraphTCL, a dual-view contrastive learning framework that integrates structural embeddings from GNNs with topological embeddings derived from persistent homology. Uses cross-view contrastive loss to align these complementary views for enhanced representation learning.

Result: Extensive experiments on benchmark datasets (including TU and OGB molecular graphs) demonstrate that GraphTCL consistently outperforms state-of-the-art baselines in graph classification tasks.

Conclusion: The study highlights the importance of topology-aware contrastive learning for advancing graph representation methods, showing that integrating global topological features with local structural patterns significantly improves classification performance.

Abstract: Graph classification has gained significant attention due to its applications in chemistry, social networks, and bioinformatics. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they often overlook global topological features that are critical for robust representation learning. In this work, we propose GraphTCL, a dual-view contrastive learning framework that integrates structural embeddings from GNNs with topological embeddings derived from persistent homology. By aligning these complementary views through a cross-view contrastive loss, our method enhances representation quality and improves classification performance. Extensive experiments on benchmark datasets, including TU and OGB molecular graphs, demonstrate that GraphTCL consistently outperforms state-of-the-art baselines. This study highlights the importance of topology-aware contrastive learning for advancing graph representation methods.

[338] CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation Models

Yuxuan Shu, Peter H. Charlton, Fahim Kawsar, Jussi Hernesniemi, Mohammad Malekzadeh

Main category: cs.LG

TL;DR: CLEF is a novel contrastive learning approach for ECG foundation models that incorporates clinical risk scores to weight negative pairs, improving performance on single-lead ECG analysis without requiring per-sample annotations.

Details

Motivation: Existing self-supervised pretraining approaches for ECG foundation models don't incorporate domain knowledge from clinical metadata, missing opportunities to align embeddings with clinically meaningful differences between subjects.

Method: Clinically-guided contrastive learning that uses established clinical risk scores to adaptively weight negative pairs, with explicit handling of missing metadata. Pretrained single-lead ECG foundation models (CLEF) at three scales on 12-lead ECGs from 161K patients using only routinely collected metadata.

Result: CLEF outperforms self-supervised foundation model baselines: medium-sized CLEF achieves average AUROC improvements of at least 2.6% in classification and average MAE reductions of at least 3.2% in regression. When pretrained only on lead-I data, CLEF performs comparably to supervised state-of-the-art ECGFounder.

Conclusion: CLEF enables more accurate and scalable single-lead ECG analysis by incorporating clinical domain knowledge into self-supervised pretraining, advancing remote health monitoring capabilities.

Abstract: The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised pretraining of foundation models on unlabeled ECGs improves diagnostic performance, existing approaches do not incorporate domain knowledge from clinical metadata. We introduce a novel contrastive learning approach that utilizes an established clinical risk score to adaptively weight negative pairs: clinically-guided contrastive learning. It aligns the similarities of ECG embeddings with clinically meaningful differences between subjects, with an explicit mechanism to handle missing metadata. On 12-lead ECGs from 161K patients in the MIMIC-IV dataset, we pretrain single-lead ECG foundation models at three scales, collectively called CLEF, using only routinely collected metadata without requiring per-sample ECG annotations. We evaluate CLEF on 18 clinical classification and regression tasks across 7 held-out datasets, and benchmark against 5 foundation model baselines and 3 self-supervised algorithms. When pretrained on 12-lead ECG data and tested on lead-I data, CLEF outperforms self-supervised foundation model baselines: the medium-sized CLEF achieves average AUROC improvements of at least 2.6% in classification and average reductions in MAEs of at least 3.2% in regression. Comparing with existing self-supervised learning algorithms, CLEF improves the average AUROC by at least 1.8%. Moreover, when pretrained only on lead-I data for classification tasks, CLEF performs comparably to the state-of-the-art ECGFounder, which was trained in a supervised manner. Overall, CLEF enables more accurate and scalable single-lead ECG analysis, advancing remote health monitoring. Code and pretrained CLEF models are available at: github.com/Nokia-Bell-Labs/ecg-foundation-model.

[339] Enforcing Orderedness to Improve Feature Consistency

Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

Main category: cs.LG

TL;DR: OSAE introduces ordered sparse autoencoders that enforce strict feature ordering and deterministic dimension usage to improve consistency across seeds/hyperparameters compared to prior nested SAE methods.

Details

Motivation: Sparse autoencoders are widely used for neural network interpretability, but their learned features vary across different seeds and hyperparameter settings, making results inconsistent and unreliable.

Method: Extends Matryoshka SAEs by (1) establishing strict ordering of latent features and (2) deterministically using every feature dimension, avoiding sampling-based approximations of prior nested SAE methods.

Result: Theoretically resolves permutation non-identifiability in sparse dictionary learning where solutions are unique up to natural symmetries. Empirically improves consistency compared to Matryoshka baselines on Gemma2-2B and Pythia-70M models.

Conclusion: OSAE provides a more consistent and reliable approach to sparse autoencoder-based interpretability by addressing feature ordering and deterministic dimension usage issues in prior methods.

Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

[340] Aligning Compound AI Systems via System-level DPO

Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo

Main category: cs.LG

TL;DR: SysDPO extends Direct Preference Optimization (DPO) to align compound AI systems with multiple interacting components, addressing challenges of non-differentiable interactions and system-level preference transformation.

Details

Motivation: Compound AI systems outperform single models but are difficult to align with human preferences due to non-differentiable component interactions and inability to directly transform system-level preferences to component-level preferences.

Method: Formulate compound AI systems as Directed Acyclic Graphs (DAGs), then introduce SysDPO framework with two variants: SysDPO-Direct (with system-specific preference dataset) and SysDPO-Sampling (without such dataset).

Result: Empirically demonstrated effectiveness across two applications: joint alignment of language model and diffusion model, and joint alignment of LLM collaboration system.

Conclusion: SysDPO successfully extends DPO to enable joint system-level alignment of compound AI systems, overcoming challenges of non-differentiable interactions and preference transformation.

Abstract: Compound AI systems, comprising multiple interacting components such as LLMs, foundation models, and external tools, have demonstrated remarkable improvements compared to single models in various tasks. To ensure their effective deployment in real-world applications, aligning these systems with human preferences is crucial. However, aligning the compound system via policy optimization, unlike the alignment of a single model, is challenging for two main reasons: (i) non-differentiable interactions between components make end-to-end gradient-based optimization method inapplicable, and (ii) system-level preferences cannot be directly transformed into component-level preferences. To address these challenges, we first formulate compound AI systems as Directed Acyclic Graphs (DAGs), explicitly modeling both component interactions and the associated data flows. Building on this formulation, we introduce $\textbf{SysDPO}$, a framework that extends Direct Preference Optimization (DPO) to enable joint system-level alignment. We propose two variants, SysDPO-Direct and SysDPO-Sampling, tailored for scenarios depending on whether we construct a system-specific preference dataset. We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system.

Stefano Vrizzi, Daniel W. O’Neill

Main category: cs.LG

TL;DR: ML methods applied to Doughnut framework to find sustainable policy parameters and optimal transition paths.

Details

Motivation: The Doughnut framework for sustainability assessment needs practical implementation tools; ML can help identify policies that achieve both environmental and social sustainability goals.

Method: Applied Random Forest Classifier and Q-learning to a simple macroeconomic model of the Doughnut framework to find policy parameters consistent with sustainability and identify optimal transition trajectories.

Result: Frugal ML methods successfully identified policy parameter combinations that achieve both environmental and social sustainability within the Doughnut framework.

Conclusion: Proof-of-concept shows ML can effectively support Doughnut framework implementation; next step is applying these methods to more complex ecological macroeconomic models.

Abstract: The ‘Doughnut’ of social and planetary boundaries has emerged as a popular framework for assessing environmental and social sustainability. Here, we provide a proof-of-concept analysis that shows how machine learning (ML) methods can be applied to a simple macroeconomic model of the Doughnut. First, we show how ML methods can be used to find policy parameters that are consistent with ’living within the Doughnut’. Second, we show how a reinforcement learning agent can identify the optimal trajectory towards desired policies in the parameter space. The approaches we test, which include a Random Forest Classifier and $Q$-learning, are frugal ML methods that are able to find policy parameter combinations that achieve both environmental and social sustainability. The next step is the application of these methods to a more complex ecological macroeconomic model.

[342] InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo

Main category: cs.LG

TL;DR: InstructLR: A framework for generating high-quality instruction datasets for low-resource languages using LLM-driven generation with dual-layer quality filtering (automated RAG-based + human validation).

Details

Motivation: Current LLMs struggle with low-resource languages due to lack of high-quality instruction datasets. Automated translation and synthetic data generation often produce poor-quality outputs lacking fluency and orthographic consistency.

Method: InstructLR integrates LLM-driven text generation with dual-layer quality filtering: 1) automated filtering using RAG-based n-shot prompting, and 2) human-in-the-loop validation layer. Inspired by MMLU benchmarks.

Result: Created three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k (50k examples each) for African low-resource languages.

Conclusion: InstructLR provides an effective framework for generating high-quality instruction datasets for low-resource languages, addressing the data scarcity problem that hinders LLM performance in these languages.

Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

[343] Improved Training Mechanism for Reinforcement Learning via Online Model Selection

Aida Afshar, Aldo Pacchiano

Main category: cs.LG

TL;DR: Online model selection in RL improves efficiency by adaptively selecting optimal agent configurations from a class, addressing practical criteria like resource allocation, non-stationary adaptation, and training stability.

Details

Motivation: To improve efficiency and performance gains in reinforcement learning by integrating online model selection methods that can adaptively select the right agent configuration from a class of available agents.

Method: The paper examines theoretical characterizations for identifying optimal configurations and addresses three practical criteria: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. The approach involves both theoretical analysis and empirical validation.

Result: Theoretical results show effective methods for configuration selection, supported by empirical evidence from various RL model selection tasks including neural architecture selection, step-size selection, and self model selection.

Conclusion: Online model selection methods can significantly improve RL training efficiency and performance by enabling adaptive configuration selection that addresses key practical challenges like resource allocation, non-stationarity adaptation, and training stability.

Abstract: We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent with the right configuration. Our goal is to establish the improved efficiency and performance gains achieved by integrating online model selection methods into reinforcement learning training procedures. We examine the theoretical characterizations that are effective for identifying the right configuration in practice, and address three practical criteria from a theoretical perspective: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. Our theoretical results are accompanied by empirical evidence from various model selection tasks in reinforcement learning, including neural architecture selection, step-size selection, and self model selection.

[344] Uncertainty Reasoning with Photonic Bayesian Machines

F. Brückerhoff-Plückelmann, H. Borras, S. U. Hulyal, L. Meyer, X. Ji, J. Hu, J. Sun, B. Klein, F. Ebert, J. Dijkstra, L. McRae, P. Schmidt, T. J. Kippenberg, H. Fröning, W. Pernice

Main category: cs.LG

TL;DR: Photonic Bayesian machine uses chaotic light sources for uncertainty-aware AI, achieving 1.28 Tbit/s processing for probabilistic convolutions in 37.5 ps, enabling trustworthy AI with uncertainty reasoning.

Details

Motivation: AI systems increasingly impact safety-critical domains (medical diagnosis, autonomous mobility), requiring uncertainty awareness for trustworthy AI. Current digital systems face bottlenecks in pseudo-random number generation for probabilistic models.

Method: Developed a photonic Bayesian machine that leverages inherent randomness of chaotic light sources to implement Bayesian Neural Networks. Features a 1.28 Tbit/s digital interface compatible with PyTorch for probabilistic convolutions.

Result: System processes probabilistic convolutions within 37.5 ps per convolution. Demonstrated simultaneous classification and out-of-domain detection of blood cell microscope images, enabling reasoning between aleatoric and epistemic uncertainties.

Conclusion: Photonic Bayesian machine removes pseudo-random number generation bottleneck, minimizes sampling costs for probabilistic models, and enables high-speed trustworthy AI systems with uncertainty awareness.

Abstract: Artificial intelligence (AI) systems increasingly influence safety-critical aspects of society, from medical diagnosis to autonomous mobility, making uncertainty awareness a central requirement for trustworthy AI. We present a photonic Bayesian machine that leverages the inherent randomness of chaotic light sources to enable uncertainty reasoning within the framework of Bayesian Neural Networks. The analog processor features a 1.28 Tbit/s digital interface compatible with PyTorch, enabling probabilistic convolutions processing within 37.5 ps per convolution. We use the system for simultaneous classification and out-of-domain detection of blood cell microscope images and demonstrate reasoning between aleatoric and epistemic uncertainties. The photonic Bayesian machine removes the bottleneck of pseudo random number generation in digital systems, minimizes the cost of sampling for probabilistic models, and thus enables high-speed trustworthy AI systems.

[345] On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

Benjamin K. Rosenzweig, Matthew W. Hahn

Main category: cs.LG

TL;DR: Neural networks can learn phylogenetic distance functions that approximate classical methods, offering scalability and efficiency comparable to state-of-the-art inference.

Details

Motivation: Traditional distance-based phylogenetic methods have been replaced by complex Bayesian/ML models, but these are computationally intensive. There's a need for scalable, efficient approaches that can handle large datasets while maintaining accuracy.

Method: Develop minimal neural network architectures that learn phylogenetic distance functions under various molecular evolutionary models. These networks are designed to be computationally lightweight and scalable to many taxa and molecular characters.

Result: The learned distance functions generalize well and achieve results comparable to state-of-the-art inference methods when trained on appropriate datasets, while maintaining small computational footprint.

Conclusion: Neural network-based distance learning provides an efficient, scalable alternative to traditional phylogenetic inference methods, offering comparable accuracy with reduced computational requirements.

Abstract: Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

[346] The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models

Joshua Wolff Anderson, Shyam Visweswaran

Main category: cs.LG

TL;DR: Fairness improvements in healthcare ML models can significantly alter Shapley-based feature importance rankings, potentially affecting clinical trust in explanations.

Details

Motivation: While fairness and predictive performance trade-offs are known, little is understood about how fairness improvements affect explainability, which is crucial for clinical trust. Clinicians may hesitate to use models whose explanations change after fairness constraints are applied.

Method: Examined how bias mitigation techniques reshape Shapley-based feature rankings across three datasets: pediatric UTI risk, anticoagulant bleeding risk, and recidivism risk. Evaluated multiple model classes on stability of Shapley-based rankings after applying fairness constraints.

Result: Increasing model fairness across racial subgroups significantly alters feature importance rankings, sometimes in different ways across groups. Feature importance stability varies across model classes.

Conclusion: Accuracy, fairness, and explainability must be jointly considered in model assessment rather than in isolation, as fairness improvements can substantially change explanations that clinicians rely on for trust.

Abstract: Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.

[347] Limitations of Membership Queries in Testable Learning

Jane Lange, Mingda Qiao

Main category: cs.LG

TL;DR: Membership queries cannot provide super-polynomial time efficiency gains over sample-only learning in the testable learning model, and many efficient distribution-specific MQ learners cannot be made testable.

Details

Motivation: To understand whether membership queries can provide significant computational advantages in the testable learning model, particularly for distribution-specific learning where MQs are known to help in standard PAC learning.

Method: 1. Show reduction from sample-based refutation to testable learning with queries (TL-Q). 2. Use reduction from learning to refutation from prior work to derive lower bounds. 3. Define “statistical” MQ algorithms that capture many known distribution-specific MQ learners. 4. Show TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms.

Result: 1. No m-sample TL-Q algorithm can be super-polynomially more time-efficient than the best m-sample PAC learner. 2. Many efficient distribution-specific MQ learners (based on influence estimation or subcube-conditional statistical queries) cannot be made testable due to SQ dimension lower bounds.

Conclusion: In the testable learning model, membership queries cannot provide super-polynomial computational advantages over sample-only learning, and many efficient MQ learners are fundamentally incompatible with testability requirements.

Abstract: Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emph{testable learning} model of Rubinfeld and Vasilyan [RV23], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal. We give a general reduction from sample-based \emph{refutation} of boolean concept classes, as presented in [Vadhan17, KL18], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [KL18]. The result is that, relative to a concept class and a distribution family, no $m$-sample TL-Q algorithm can be super-polynomially more time-efficient than the best $m$-sample PAC learner. Finally, we define a class of ``statistical’’ MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.

[348] Training Dynamics of Learning 3D-Rotational Equivariance

Max W. Shen, Ewa Nowara, Michael Maser, Kyunghyun Cho

Main category: cs.LG

TL;DR: Models learn 3D rotational equivariance quickly (≤2% error within 1k-10k steps) because it’s easier than the main prediction task, but non-equivariant models may outperform per GPU-hour unless efficiency gaps are narrowed.

Details

Motivation: To understand how quickly symmetry-agnostic models learn to respect symmetries through data augmentation, and to investigate the learning dynamics of equivariance in practical settings.

Method: Derived a principled measure of equivariance error for convex losses, then empirically studied 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) across various model and dataset sizes.

Result: Models reduce equivariance error quickly to ≤2% held-out loss within 1k-10k training steps, robust to model and dataset size. Learning 3D-rotational equivariance has smoother, better-conditioned loss landscape than main prediction task. Non-equivariant models may achieve lower test loss per GPU-hour unless equivariant efficiency gap is narrowed.

Conclusion: Equivariance is learned quickly for 3D rotations due to easier learning dynamics, but practical deployment requires addressing efficiency gaps between equivariant and non-equivariant models to make equivariant approaches competitive in terms of computational resources.

Abstract: While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap’’ is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

[349] Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation

Kentaro Kubo, Hayato Goto

Main category: cs.LG

TL;DR: Proposes SAL framework combining Langevin SB sampler and CEM temperature estimation for efficient training of general Boltzmann machines beyond restricted versions.

Details

Motivation: Boltzmann machines are powerful but training is computationally expensive, especially for general models beyond RBMs. MCMC Boltzmann sampling is accurate but slow due to parallelization difficulties.

Method: 1) Langevin SB (LSB): Quantum-inspired parallel sampler for Boltzmann distributions. 2) Conditional Expectation Matching (CEM): Method to estimate inverse temperature during learning. 3) SAL framework: Combines LSB and CEM for efficient BM training.

Result: LSB enables parallel sampling with MCMC-level accuracy for both RBMs and general BMs. SAL framework provides efficient learning for more expressive BMs than RBMs.

Conclusion: SAL opens new avenues for energy-based generative modeling beyond RBMs by enabling efficient training of more expressive Boltzmann machines through parallel sampling and adaptive temperature control.

Abstract: Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.

[350] Retrieval-Augmented Memory for Online Learning

Wenzhang Du

Main category: cs.LG

TL;DR: RAM-OL extends SGD with a small memory buffer for retrieval-augmented online learning under concept drift, improving accuracy on periodic streams while matching baseline on noisy ones.

Details

Motivation: Retrieval-augmented models combine parametric predictors with non-parametric memories, but their application to streaming supervised learning with concept drift is not well understood. The paper aims to study online classification in non-stationary environments where data distributions change over time.

Method: Proposes RAM-OL (Retrieval-Augmented Memory for Online Learning), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, it retrieves nearest neighbors of the current input in hidden representation space and updates the model jointly on current example and retrieved neighbors. Compares naive replay with gated replay that constrains neighbors using time window, similarity thresholds, and gradient reweighting to balance reuse of relevant past data against robustness to outdated regimes.

Result: On strongly and periodically drifting streams (electricity pricing, electricity load), RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds. On noisy airline delay stream, the gated variant closely matches the purely online baseline. Theoretical analysis shows retrieval can reduce adaptation cost and improve regret constants when patterns recur over time.

Conclusion: Retrieval-augmented memory is a practical and robust tool for online learning under concept drift, effectively balancing fast reuse of relevant past data against robustness to outdated regimes in non-stationary environments.

Abstract: Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.

[351] Forecasting MBTA Transit Dynamics: A Performance Benchmarking of Statistical and Machine Learning Models

Sai Siddharth Nalamalpu, Kaining Yuan, Aiden Zhou, Eugene Pinsky

Main category: cs.LG

TL;DR: This paper compares statistical and machine learning models for predicting MBTA subway usage (via station entries) and system delays, finding that temporal features (day of week, season) improve accuracy more than weather data, which actually worsens performance due to overfitting.

Details

Motivation: The MBTA system faces frequent delays and ridership fluctuations that negatively impact efficiency and passenger satisfaction. There's a need to better understand and predict these patterns to improve transit operations and planning.

Method: The research evaluates 10 statistical and machine learning models for predicting next-day subway usage, and 11 models (including a self-exciting point process model) for predicting delay counts. Features include day of week, season, pressure, wind speed, average temperature, and precipitation. The study experiments with selective feature inclusion to determine importance and tests model accuracy using Root Mean Squared Error (RMSE).

Result: Temporal features (day of week or season) provide substantial benefits to predictive accuracy, while weather data generally worsens performance, suggesting models tend to overfit when including weather features. The self-exciting point process model represents a unique application for MBTA delay modeling.

Conclusion: For MBTA transit prediction, focusing on temporal patterns (day of week and season) is more effective than incorporating weather data, which can lead to overfitting and reduced predictive accuracy. The findings provide practical guidance for feature selection in public transportation forecasting models.

Abstract: The Massachusetts Bay Transportation Authority (MBTA) is the main public transit provider in Boston, operating multiple means of transport, including trains, subways, and buses. However, the system often faces delays and fluctuations in ridership volume, which negatively affect efficiency and passenger satisfaction. To further understand this phenomenon, this paper compares the performance of existing and unique methods to determine the best approach in predicting gated station entries in the subway system (a proxy for subway usage) and the number of delays in the overall MBTA system. To do so, this research considers factors that tend to affect public transportation, such as day of week, season, pressure, wind speed, average temperature, and precipitation. This paper evaluates the performance of 10 statistical and machine learning models on predicting next-day subway usage. On predicting delay count, the number of models is extended to 11 per day by introducing a self-exciting point process model, representing a unique application of a point-process framework for MBTA delay modeling. This research involves experimenting with the selective inclusion of features to determine feature importance, testing model accuracy via Root Mean Squared Error (RMSE). Remarkably, it is found that providing either day of week or season data has a more substantial benefit to predictive accuracy compared to weather data; in fact, providing weather data generally worsens performance, suggesting a tendency of models to overfit.

[352] SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

Main category: cs.LG

TL;DR: SpecPV accelerates long-context LLM generation using self-speculative decoding with partial KV state verification and periodic full verification, achieving up to 6x speedup.

Details

Motivation: Long-context generation is crucial for tasks like code generation and document understanding, but speculative decoding faces verification bottlenecks as context grows.

Method: Self-speculative decoding with partial key-value state verification (fast verification) combined with periodic full verification to correct accumulated errors.

Result: Up to 6x decoding speedup over standard autoregressive decoding with minor quality degradation across multiple long-context benchmarks and models including LLaMA-3.1-8B-Instruct and Qwen3-series.

Conclusion: SpecPV effectively addresses the verification bottleneck in long-context speculative decoding, enabling significant acceleration for practical long-context generation tasks.

Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

[353] FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data

Nan Qiao, Sheng Yue, Ju Ren, Yaoxue Zhang

Main category: cs.LG

TL;DR: FOVA: A vote-based offline federated reinforcement learning framework that handles mixed-quality data across clients using a vote mechanism to identify high-return actions.

Details

Motivation: Existing offline FRL methods suffer performance drops when dealing with mixed-quality data from different clients, where logging behaviors are collected by policies with varying qualities. This limitation needs to be addressed for practical applications.

Method: FOVA uses a vote mechanism to identify high-return actions during local policy evaluation, mitigating negative effects of low-quality behaviors. It builds on advantage-weighted regression (AWR) to construct consistent local and global training objectives for enhanced efficiency and stability.

Result: Theoretical analysis shows FOVA ensures strict policy improvement over behavioral policies. Extensive experiments demonstrate significant performance gains over existing baselines on widely used benchmarks.

Conclusion: FOVA effectively addresses the mixed-quality data challenge in offline FRL through its vote mechanism and consistent objective design, providing both theoretical guarantees and practical performance improvements.

Abstract: Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.

[354] Reinforcement Learning in POMDP’s via Direct Gradient Ascent

Jonathan Baxter, Peter L. Bartlett

Main category: cs.LG

TL;DR: GPOMDP is a gradient-based algorithm for optimizing policy performance in POMDPs that estimates gradient of average reward using single sample paths, one free bias-variance trade-off parameter, and no state knowledge.

Details

Motivation: The paper addresses the challenge of directly optimizing policy performance in controlled Partially Observable Markov Decision Processes (POMDPs) where the underlying state is not fully observable, requiring gradient-based methods that can work with limited information.

Method: Introduces GPOMDP, a REINFORCE-like algorithm that estimates an approximation to the gradient of average reward using only a single sample path of the Markov chain. The method uses one free parameter β ∈ [0,1) for bias-variance trade-off and requires no knowledge of the underlying state.

Result: Proves convergence of GPOMDP algorithm and demonstrates how its gradient estimates can be used in a conjugate-gradient procedure to find local optima of the average reward function.

Conclusion: GPOMDP provides an effective gradient-based approach for policy optimization in POMDPs with advantages of requiring minimal information (single sample path, no state knowledge), tunable bias-variance trade-off, and theoretical convergence guarantees for finding local optima.

Abstract: This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter $β\in [0,1)$, which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.

[355] Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection

Chuhan Xie

Main category: cs.LG

TL;DR: This paper develops risk-sensitive reinforcement learning for continuous-time systems using SDEs, proposes CT-RS-q algorithm based on martingale characterization, and demonstrates effectiveness on portfolio selection.

Details

Motivation: Traditional RL often assumes risk-neutral objectives, but many real-world applications (like finance) require risk-sensitive decision-making. There's a need for RSRL methods in continuous-time settings with nonlinear reward functionals.

Method: The authors use controllable stochastic differential equations to model the environment. They prove that with OCE functionals, optimal policies are Markovian in an augmented state. They propose CT-RS-q, a q-learning algorithm based on martingale characterization.

Result: Theoretical results show Markovian optimal policies for OCE functionals. The CT-RS-q algorithm is developed and validated through simulation studies on dynamic portfolio selection problems, demonstrating effectiveness.

Conclusion: The paper provides a theoretical foundation and practical algorithm for risk-sensitive RL in continuous time, with applications to financial decision-making where risk management is crucial.

Abstract: This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textit{CT-RS-q}, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.

[356] ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu

Main category: cs.LG

TL;DR: ESACT is an end-to-end sparse accelerator for Transformers that uses local similarity prediction to reduce computation by 52% with minimal accuracy loss, achieving 3.29 TOPS/W energy efficiency.

Details

Motivation: Transformers have high computational costs that hinder efficient hardware deployment. Existing sparse accelerators mainly exploit intra-row sparsity in attention, while inter-row sparsity approaches rely on costly global similarity estimation and typically only apply to limited components.

Method: Proposes ESACT with Sparsity Prediction with Local Similarity (SPLS) mechanism using HLog quantization to predict local attention sparsity before QK generation, enabling efficient sparsity across all transformer components. Includes three architectural innovations for hardware realization.

Result: SPLS reduces total computation by 52.03% with <1% accuracy loss. ESACT achieves 3.29 TOPS/W end-to-end energy efficiency, improving attention-level efficiency by 2.95x and 2.26x over state-of-the-art accelerators SpAtten and Sanger.

Conclusion: Local similarity enables efficient end-to-end sparse acceleration for Transformers with significant computation reduction and energy efficiency improvements over existing approaches.

Abstract: Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

[357] Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning

Oshada Jayasinghe, Farhana Choudhury, Egemen Tanin, Shanika Karunasekera

Main category: cs.LG

TL;DR: A two-layer multi-agent reinforcement learning framework that dynamically configures on-street parking spaces to reduce traffic congestion by optimizing parking allocation while minimizing impact on vehicle travel time.

Details

Motivation: Traffic congestion in urban areas is worsened by on-street parking, which reduces effective road width for driving. With advances in vehicle-to-infrastructure connectivity, there's an opportunity to dynamically reconfigure parking spaces to minimize congestion impact.

Method: Proposes a two-layer multi-agent reinforcement learning framework: lane-level agents decide optimal parking configurations using a novel Deep Q-learning architecture with LSTM and graph attention networks to capture spatio-temporal correlations; block-level agents control lane agents to maintain sufficient parking around blocks.

Result: Experiments using SUMO on synthetic and real-world Melbourne data show the framework reduces average vehicle travel time loss by up to 47%, with negligible increase in walking distance for parking.

Conclusion: Dynamic on-street parking configuration using multi-agent reinforcement learning can significantly reduce traffic congestion while maintaining parking availability, demonstrating the potential of data-driven approaches for urban traffic management.

Abstract: With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.

[358] Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

Yizhou Zhang, Lun Du

Main category: cs.LG

TL;DR: Data curation methods like pruning and synthetic data have mixed effectiveness; static pruning can’t change asymptotic scaling, while ideal time-dependent curation can accelerate learning.

Details

Motivation: To understand why some data-centric strategies (pruning, synthetic data, RLHF, etc.) improve training efficiency while others fail, and to formalize data curation's effects on neural scaling laws.

Method: Formalize data curation as reweighting sampling distribution and analyze its effect on eigenstructure of data-induced operator. Provide theoretical results: 1) static pruning’s limitations, 2) time-dependent curation’s potential.

Result: Static pruning induces bounded operator and cannot change spectral tail exponent - only finite-region improvements, no asymptotic scaling changes. Ideal time-dependent curation (tracking spectral residuals) can provably accelerate learning, though practical systems only approximate this.

Conclusion: Theoretical framework explains mixed success of data curation methods: static approaches have fundamental limitations, while dynamic/adaptive curation offers potential for meaningful acceleration if properly implemented.

Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains – most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning – although practical systems can only approximate this behavior.

[359] Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li, Zhuoran Yang, Siyang Gao, Shuang Qiu

Main category: cs.LG

TL;DR: DVDF method selectively filters source domain data using both dynamics and value alignment for cross-domain offline RL, outperforming prior methods across various tasks and low-data settings.

Details

Motivation: Current cross-domain offline RL methods only consider dynamics alignment between source and target domains, overlooking value alignment (selecting high-quality, high-value samples). This limitation leads to suboptimal performance when simply merging datasets with misaligned dynamics.

Method: Proposes Dynamics- and Value-aligned Data Filtering (DVDF) method that selectively shares source domain samples with both high dynamics alignment and high value alignment. The approach is theoretically motivated by analyzing the sub-optimality gap of policies trained on source domain and evaluated on target domain.

Result: DVDF consistently outperforms prior baselines across various tasks and datasets, including challenging low-data settings where target domain contains only 5,000 transitions. Extensive experiments demonstrate exceptional performance on tasks with kinematic and morphology shifts.

Conclusion: Both dynamics alignment and value alignment are essential for effective cross-domain offline RL. The proposed DVDF method successfully addresses both aspects, providing superior performance by selectively filtering source domain data based on dual alignment criteria.

Abstract: Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

[360] When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

Main category: cs.LG

TL;DR: LLM agents show unpredictable performance and safety degradation on long-context tasks, with refusal rates shifting dramatically and capabilities dropping over 50% at 100K tokens despite 1M-2M context windows.

Details

Motivation: While new LLMs support longer context windows and tool calling, prior work has focused mainly on evaluating LLMs on long-context prompts, leaving agentic setups relatively unexplored from both capability and safety perspectives.

Method: The study evaluates LLM agents on long-horizon tasks requiring external tools and extended context windows, examining sensitivity to context length, type, and placement, and measuring both task performance and refusal rates for harmful requests.

Result: Models with 1M-2M token context windows show severe degradation at just 100K tokens (performance drops >50% for both benign and harmful tasks). Refusal rates shift unpredictably: GPT-4.1-nano increases from ~5% to ~40% while Grok 4 Fast decreases from ~80% to ~10% at 200K tokens.

Conclusion: LLM agents operating on longer contexts reveal potential safety issues and show notable divergence in both capability and safety performance compared to prior LLM evaluations, raising questions about current metrics and paradigms for evaluating LLM agent safety on long multi-step tasks.

Abstract: Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5% to $\sim$40% while Grok 4 Fast decreases from $\sim$80% to $\sim$10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

[361] TabGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links

Xingwang Li, Mengyun Chen, Jiamou Liu, Sijie Wang, Shuanggen Jin, Jafet C. M. Andersson, Jonas Olsson, Remco, van de Beek, Hai Victor Habi, Congzheng Han

Main category: cs.LG

TL;DR: Proposes TabGRU, a hybrid Transformer-BiGRU model for urban rainfall monitoring using Commercial Microwave Links, outperforming physics-based and deep learning baselines with R² up to 0.96.

Details

Motivation: Global urbanization and extreme weather events require high-resolution urban rainfall monitoring. Traditional physics-based models for Commercial Microwave Links (CMLs) struggle with real-world complexities like signal noise and nonlinear attenuation.

Method: Novel hybrid deep learning architecture combining Transformer and Bidirectional Gated Recurrent Unit (BiGRU) called TabGRU. Includes learnable positional embedding and attention pooling to capture both long-term dependencies and local sequential features in CML signal data.

Result: TabGRU outperformed deep learning baselines with R² of 0.91 at Torp site and 0.96 at Barl site. Showed higher accuracy than physics-based approach, effectively mitigating overestimation problems during peak rainfall events.

Conclusion: TabGRU effectively overcomes limitations of traditional methods, providing robust and accurate solution for CML-based urban rainfall monitoring under tested conditions.

Abstract: In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this task.While traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.

[362] Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu

Main category: cs.LG

TL;DR: DROCO: A dual-robust cross-domain offline RL algorithm that addresses both train-time and test-time robustness against dynamics shifts, using a novel robust cross-domain Bellman operator with dynamic value penalty and Huber loss.

Details

Motivation: Existing cross-domain offline RL focuses only on train-time robustness against dynamics shifts in training data, but neglects test-time robustness against dynamics perturbations during deployment. Policies trained with cross-domain offline RL show fragility under dynamics perturbations, especially when target domain data is limited.

Method: Introduces a robust cross-domain Bellman (RCB) operator that enhances test-time robustness against dynamics perturbations while maintaining conservatism for out-of-distribution transitions. Adds dynamic value penalty and Huber loss to counteract potential value overestimation/underestimation from RCB operator, resulting in DROCO algorithm.

Result: Extensive empirical results across various dynamics shift scenarios show DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

Conclusion: DROCO successfully addresses dual robustness (train-time and test-time) in cross-domain offline RL, providing a practical solution for handling dynamics shifts in both training and deployment scenarios.

Abstract: Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

[363] Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data

Mithra D K

Main category: cs.LG

TL;DR: Hybrid XGBoost-MLP model outperforms baseline ML methods for diabetes prediction on NHANES health survey data.

Details

Motivation: To improve diabetes prediction accuracy using established machine learning techniques applied to NHANES health survey data, comparing traditional methods with a novel hybrid approach.

Method: Compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses XGBoost as a feature encoder followed by a lightweight multilayer perceptron (MLP) head for final prediction.

Result: The hybrid XGBoost-MLP model achieves improved AUC and balanced accuracy compared to baseline models on the processed NHANES subset.

Conclusion: The hybrid approach combining XGBoost feature encoding with MLP classification demonstrates superior performance for diabetes prediction, with code released for reproducibility.

Abstract: I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.

[364] A Fully First-Order Layer for Differentiable Optimization

Zihao Zhao, Kai-Chia Mo, Shing-Hei Ho, Brandon Amos, Kai Wang

Main category: cs.LG

TL;DR: A novel algorithm for computing gradients in differentiable optimization layers using only first-order information, avoiding expensive Hessian computations while matching state-of-the-art complexity rates.

Details

Motivation: Differentiable optimization layers are valuable for learning systems but current gradient computation via implicit differentiation requires solving linear systems with Hessian terms, which is computationally expensive in both time and memory.

Method: Reformulate differentiable optimization as a bilevel optimization problem and leverage recent bilevel methods. Introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time approximation guarantees.

Result: Approximate hypergradient can be computed using only first-order information in $\tilde{\oo}(1)$ time, achieving overall complexity of $\tilde{\oo}(δ^{-1}ε^{-3})$ for constrained bilevel optimization, matching best known rates for non-smooth non-convex optimization.

Conclusion: The proposed algorithm provides efficient gradient computation for differentiable optimization layers without Hessian evaluations, with theoretical guarantees and practical implementation through an open-source Python library.

Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{\oo}(1)$ time, leading to an overall complexity of $\tilde{\oo}(δ^{-1}ε^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. Our code is available here: https://github.com/guaguakai/FFOLayer.

[365] Water Quality Estimation Through Machine Learning Multivariate Analysis

Marco Cardia, Stefano Chessa, Alessio Micheli, Antonella Giuliana Luminare, Francesca Gambineri

Main category: cs.LG

TL;DR: Integration of UV-Vis spectroscopy with machine learning for automated water quality assessment in agriculture, with SHAP interpretability.

Details

Motivation: Water quality is crucial for agrifood sector (fertigation, animal husbandry, processing). Digitalization requires automatic water quality assessment to ensure safety and regulatory compliance.

Method: Combines Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning for water quality assessment. Uses SHapley Additive exPlanations (SHAP) for model interpretability to understand wavelength contributions.

Result: Demonstrates potential for rapid, accurate, and interpretable assessment of key water quality parameters.

Conclusion: The integrated approach enables automated, reliable water quality monitoring in agriculture with explainable AI for regulatory compliance and safety assurance.

Abstract: The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.

[366] Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

Sandro Andric

Main category: cs.LG

TL;DR: LLMs show strong implicit altruism bias and behave altruistically above chance, but their implicit associations don’t predict behavior, and they systematically overestimate their own altruism - a “virtue signaling gap” affecting 75% of models.

Details

Motivation: To investigate whether LLMs exhibit altruistic tendencies and whether their implicit associations and self-reports predict actual altruistic behavior, using methods from human social psychology.

Method: Multi-method approach testing 24 frontier LLMs across three paradigms: Implicit Association Test (IAT) for implicit altruism bias, forced binary choice task for behavioral altruism, and self-assessment scale for explicit altruism beliefs.

Result: All models show strong implicit pro-altruism bias; behave altruistically above chance (65.6% vs 50%); implicit associations don’t predict behavior; models systematically overestimate their altruism (77.5% claimed vs 65.6% actual) - a “virtue signaling gap” affecting 75% of models.

Conclusion: Proposes the Calibration Gap as a standardized alignment metric; only 12.5% of models achieve ideal combination of high prosocial behavior and accurate self-knowledge; well-calibrated models are more predictable and behaviorally consistent.

Abstract: We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models “know” altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen’s d = 1.08). This “virtue signaling gap” affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

[367] Decentralized Fairness Aware Multi Task Federated Learning for VR Network

Krishnendu S. Tharakan, Carlo Fischione

Main category: cs.LG

TL;DR: DMTFL-based caching system for wireless VR that personalizes FOV prefetching at base stations using decentralized multi-task federated learning to handle statistical heterogeneity across users.

Details

Motivation: Wireless VR requires seamless, high-quality real-time video delivery but faces challenges with stringent latency requirements, QoE constraints, and limited device capabilities. Traditional FL approaches suffer from bias toward certain users and fail to capture statistical heterogeneity across users and base stations.

Method: Proposes decentralized multi-task fair federated learning (DMTFL) algorithm that learns individual caching models at each BS for personalized content delivery. Models are optimized to perform well under any target distribution with theoretical guarantees via Rademacher complexity and PAC bounds.

Result: Simulations using realistic VR head-tracking dataset demonstrate superiority of DMTFL algorithm compared to baseline algorithms for wireless VR caching and prefetching.

Conclusion: DMTFL provides an effective solution for wireless VR content delivery by addressing statistical heterogeneity through personalized caching models with theoretical performance guarantees, enabling seamless VR experiences from anywhere.

Abstract: Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video wirelessly is challenging due to the stringent quality of experience requirements, low latency constraints, and limited VR device capabilities. This paper addresses these challenges by introducing a novel decentralized multi task fair federated learning (DMTFL) based caching that caches and prefetches each VR user’s field of view (FOV) at base stations (BSs) based on the caching strategies tailored to each BS. In federated learning (FL) in its naive form, often biases toward certain users, and a single global model fails to capture the statistical heterogeneity across users and BSs. In contrast, the proposed DMTFL algorithm personalizes content delivery by learning individual caching models at each BS. These models are further optimized to perform well under any target distribution, while providing theoretical guarantees via Rademacher complexity and a probably approximately correct (PAC) bound on the loss. Using a realistic VR head-tracking dataset, our simulations demonstrate the superiority of our proposed DMTFL algorithm compared to baseline algorithms.

[368] In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian

Main category: cs.LG

TL;DR: In-context distillation reduces LLM agent inference costs by using teacher demonstrations as in-context examples for student models, achieving 2.5× cost reduction on ALFWorld and 2× on AppWorld while maintaining accuracy.

Details

Motivation: LLM agents incur high inference costs at scale, but alternatives like fine-tuning or manual prompt engineering have high development friction costs (long training cycles, laborious trial-and-error). Need a method to reduce operational costs while preserving rapid experimentation with frozen models.

Method: Introduces in-context distillation: retrieves relevant teacher demonstrations at each agent step and provides them to student as in-context examples, enabling on-the-fly imitation. Combines with self-consistency cascades to know when to trust student vs teacher. Uses frozen models without fine-tuning.

Result: On ALFWorld: matches teacher accuracy at 2.5× lower cost ($0.059→$0.024 per episode). Upfront demonstration cost amortizes after 843 episodes, with $34,900+ savings at 1M episodes. On AppWorld: achieves 2× cost reduction at iso-accuracy, shifting Pareto frontier.

Conclusion: Method reduces operational costs while maintaining rapid experimentation cycles with frozen models, making advanced agentic systems economically viable for broader applications. Combines benefits of model specialization with productivity of frozen models.

Abstract: The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from $0.059 to $0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding $34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

[369] Tensor Network Based Feature Learning Model

Albert Saiapin, Kim Batselier

Main category: cs.LG

TL;DR: The paper introduces a Feature Learning (FL) model that efficiently learns hyperparameters for tensor-product features using Canonical Polyadic Decomposition and Alternating Least Squares optimization, achieving 3-5x faster training than cross-validation with comparable prediction quality.

Details

Motivation: Existing kernel approximation methods using tensor-product features face the curse of dimensionality, which was addressed via tensor network reparameterization. However, identifying optimal feature hyperparameters remains unaddressed and typically relies on inefficient cross-validation approaches.

Method: The FL model represents tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). This CPD structure enables efficient joint learning of feature hyperparameters and model parameters using an Alternating Least Squares (ALS) optimization method.

Result: Experiments on real data of various dimensionality and scale show the FL model can be trained 3-5 times faster than standard cross-validated models while maintaining comparable prediction quality.

Conclusion: The FL model provides an efficient alternative to cross-validation for learning optimal feature hyperparameters in kernel-based algorithms, significantly accelerating training while preserving predictive performance.

Abstract: Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.

[370] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.LG

TL;DR: CUDA-L2 combines LLMs and RL to automatically optimize HGEMM CUDA kernels, achieving significant speedups over state-of-the-art libraries like cuBLAS and cuBLASLt.

Details

Motivation: Even heavily-optimized performance-critical kernels like HGEMM may have untapped optimization potential that can be discovered through automated exploration of configuration spaces at scales impractical for human engineers.

Method: Combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels, using CUDA execution speed as the RL reward to explore 1,000 configurations.

Result: CUDA-L2 systematically outperforms major matmul baselines: +22.0% over torch.matmul, +19.2% over cuBLAS, +16.8% over cuBLASLt-heuristic, and +11.4% over cuBLASLt-AutoTuning in offline mode. Speedups further increase in server mode to +28.7%, +26.0%, +22.4%, and +15.9% respectively.

Conclusion: LLM-guided RL automation can improve even the most performance-critical, heavily-optimized kernels like HGEMM by systematically exploring configuration spaces at scales impractical for humans, demonstrating the potential of AI-driven optimization for high-performance computing.

Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia’s closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {\it torch.matmul} on average; +19.2% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic’s suggestion; and +11.4% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}’s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

[371] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An

Main category: cs.LG

TL;DR: GoRL decouples optimization from generation in RL, using a tractable latent policy for stable optimization and a conditional generative decoder for expressive action synthesis, achieving better performance than Gaussian and generative policies.

Details

Motivation: There's a tension in RL between stable optimization (Gaussian policies) and expressive multimodal action distributions (generative policies). Gaussian policies are stable but unimodal, while generative policies are expressive but unstable in online RL due to intractable likelihoods and noisy gradients.

Method: GoRL separates optimization from generation: it optimizes a tractable latent policy while using a conditional generative decoder to synthesize actions. A two-timescale update schedule enables stable learning of the latent policy while the decoder gradually increases expressiveness without requiring tractable action likelihoods.

Result: GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines across continuous-control tasks. On HopperStand, it achieves normalized return above 870, more than 3 times that of the strongest baseline.

Conclusion: Decoupling optimization from generation provides a practical path to policies that are both stable and highly expressive in reinforcement learning.

Abstract: Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

[372] Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework

Wenhao Sha, Tienchong Chang

Main category: cs.LG

TL;DR: HeatTransFormer: A physics-guided Transformer for interface-dominated thermal diffusion problems, enabling accurate forward modeling and reliable inverse property identification.

Details

Motivation: Heat transfer in chip-substrate assemblies creates steep interfacial temperature gradients that conventional numerical solvers struggle with (excessive discretization) and PINNs often fail on (unstable convergence, loss of physical consistency at interfaces).

Method: Transformer architecture with: 1) physically informed spatiotemporal sampling, 2) Laplace-based activation emulating analytical diffusion solutions, 3) mask-free attention for bidirectional spatiotemporal coupling, and 4) physics-constrained inverse strategy for property identification.

Result: HeatTransFormer produces coherent temperature fields across material interfaces, resolves steep gradients, maintains physical consistency, and remains stable where PINNs fail. It enables reliable simultaneous identification of three unknown thermal properties using only external measurements.

Conclusion: Physics-guided Transformer architectures provide a unified framework for both forward and inverse modeling in interface-dominated thermal systems, overcoming limitations of conventional numerical methods and PINNs.

Abstract: Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much higher thermophysical properties. This mismatch produces steep interfacial temperature gradients, making the transient thermal response highly sensitive to the interface. Conventional numerical solvers require excessive discretization to resolve these dynamics, while physics-informed neural networks (PINNs) often exhibit unstable convergence and loss of physical consistency near the material interface. To address these challenges, we introduce HeatTransFormer, a physics-guided Transformer architecture for interface-dominated diffusion problems. The framework integrates physically informed spatiotemporal sampling, a Laplace-based activation emulating analytical diffusion solutions, and a mask-free attention mechanism supporting bidirectional spatiotemporal coupling. These components enable the model to resolve steep gradients, maintain physical consistency, and remain stable where PINNs typically fail. HeatTransFormer produces coherent temperature fields across the interface when applied to a finite layer and semi-infinite substrate configuration. Coupled with a physics-constrained inverse strategy, it further enables reliable identification of three unknown thermal properties simultaneously using only external measurements. Overall, this work demonstrates that physics-guided Transformer architectures provide a unified framework for forward and inverse modeling in interface-dominated thermal systems.

[373] Adapting Tensor Kernel Machines to Enable Efficient Transfer Learning for Seizure Detection

Seline J. S. de Rooij, Borbála Hunyadi

Main category: cs.LG

TL;DR: Proposes Adapt-TKM, an efficient transfer learning method using tensor kernel machines that transfers knowledge via regularization, requiring 100x fewer parameters than adaptive SVM while achieving better performance for seizure detection on wearable EEG devices.

Details

Motivation: Transfer learning can optimize target task performance by leveraging knowledge from related source problems, but existing methods like adaptive SVM can be parameter-heavy and inefficient for resource-constrained wearable devices.

Method: Uses tensor kernel machines with low-rank tensor networks to learn compact non-linear models in the primal domain, enabling efficient adaptation via regularization without adding more parameters. The Adapt-TKM method personalizes patient-independent models with small amounts of patient-specific data.

Result: Adapt-TKM achieves better performance than both patient-independent and fully patient-specific models for seizure detection on behind-the-ear EEG, while requiring around 100 times fewer parameters than adaptive SVM, leading to faster inference speed.

Conclusion: Adapt-TKM provides an efficient transfer learning approach particularly suitable for resource-constrained wearable devices, enabling effective personalization with minimal computational overhead.

Abstract: Transfer learning aims to optimize performance in a target task by learning from a related source problem. In this work, we propose an efficient transfer learning method using a tensor kernel machine. Our method takes inspiration from the adaptive SVM and hence transfers ‘knowledge’ from the source to the ‘adapted’ model via regularization. The main advantage of using tensor kernel machines is that they leverage low-rank tensor networks to learn a compact non-linear model in the primal domain. This allows for a more efficient adaptation without adding more parameters to the model. To demonstrate the effectiveness of our approach, we apply the adaptive tensor kernel machine (Adapt-TKM) to seizure detection on behind-the-ear EEG. By personalizing patient-independent models with a small amount of patient-specific data, the patient-adapted model (which utilizes the Adapt-TKM), achieves better performance compared to the patient-independent and fully patient-specific models. Notably, it is able to do so while requiring around 100 times fewer parameters than the adaptive SVM model, leading to a correspondingly faster inference speed. This makes the Adapt-TKM especially useful for resource-constrained wearable devices.

Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye

Main category: cs.LG

TL;DR: SeeNav-Agent: A novel VLN framework with dual-view visual prompts to reduce perception errors and SRGPO reinforcement fine-tuning for better planning, achieving state-of-the-art navigation performance.

Details

Motivation: Existing LVLM-based VLN agents suffer from perception errors, reasoning errors, and planning errors that significantly hinder navigation performance, requiring improved visual perception and planning capabilities.

Method: 1) Dual-view Visual Prompt technique to reduce perception hallucinations and improve spatial understanding; 2) Step Reward Group Policy Optimization (SRGPO) - a step-level reinforcement fine-tuning method with verifiable process rewards and efficient step-level advantage estimation via random grouping.

Result: GPT-4.1 with VP module achieves 86.7% navigation success rate (20pp improvement); Qwen2.5-VL-3B with SRGPO reaches 72.3% (5.6pp improvement over best existing LVLM). SRGPO shows superior training stability, convergence efficiency, and generalization compared to GRPO/GiGPO.

Conclusion: SeeNav-Agent effectively addresses VLN limitations through dual-view visual prompts and SRGPO reinforcement fine-tuning, achieving significant performance improvements and demonstrating the framework’s effectiveness for embodied navigation tasks.

Abstract: Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent’s understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.

[375] Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz

Main category: cs.LG

TL;DR: F2D2 is a framework that simultaneously reduces neural function evaluations for both sampling and likelihood evaluation in flow-based models by two orders of magnitude, solving a computational bottleneck while maintaining accuracy.

Details

Motivation: Current diffusion and flow-based models require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood, creating a computational bottleneck despite recent distillation methods that accelerate sampling but sacrifice likelihood tractability.

Method: Fast flow joint distillation (F2D2) leverages the insight that in continuous normalizing flows, sampling and likelihood ODEs share a common velocity field, allowing joint distillation of both sampling trajectory and cumulative divergence using a single model with an additional divergence prediction head.

Result: F2D2 achieves accurate log-likelihood with few-step evaluations while maintaining high sample quality, enabling a 2-step MeanFlow model to outperform a 1024-step teacher model with only one additional backward NFE.

Conclusion: F2D2 solves the long-standing computational bottleneck in flow-based generative models by enabling efficient joint distillation of sampling and likelihood evaluation, with applications including lightweight self-guidance methods.

Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today’s best generative models – diffusion and flow-based models – still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2’s capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.

[376] OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

Main category: cs.LG

TL;DR: OptPO: A framework that dynamically allocates inference budgets for test-time policy optimization in LLMs, reducing computational redundancy while maintaining accuracy.

Details

Motivation: Existing test-time policy optimization methods for LLMs use fixed-budget majority voting, which incurs substantial computational redundancy. There's a need for more efficient allocation of inference budgets during test-time adaptation.

Method: OptPO formulates the voting process as a Bayesian sequential probability ratio test, dynamically halting sampling once posterior confidence in a consensus answer exceeds a threshold. It uses retained rollouts for on-policy updates and integrates with algorithms like PPO or GRPO without requiring ground-truth labels.

Result: Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy.

Conclusion: OptPO offers a computationally efficient paradigm for test-time adaptation by unifying statistically optimal stopping with test-time learning, enabling more efficient use of computational resources during LLM inference.

Abstract: Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

[377] Adaptive Weighted LSSVM for Multi-View Classification

Farnaz Faramarzi Lighvan, Mehrdad Asadi, Lynn Houthuys

Main category: cs.LG

TL;DR: AW-LSSVM is an adaptive weighted LS-SVM for multi-view learning that promotes complementary learning through iterative global coupling, making each view focus on hard samples from other views.

Details

Motivation: Existing kernel-based multi-view learning methods have limitations: they either use fusion techniques without explicit collaboration across views, or use co-regularization which restricts global collaboration. There's a need for better multi-view methods that enable more effective collaboration while maintaining view isolation for privacy preservation.

Method: AW-LSSVM (Adaptive Weighted Least Squares Support Vector Machine) uses an iterative global coupling mechanism where each view adaptively focuses on hard samples from other views from previous iterations, promoting complementary learning across views while keeping raw features isolated.

Result: Experiments show AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while maintaining view isolation which makes it suitable for privacy-preserving scenarios.

Conclusion: AW-LSSVM provides an effective approach for multi-view learning through adaptive complementary learning, achieving better performance than existing methods while preserving privacy by keeping raw features isolated across views.

Abstract: Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enforcing an explicit collaboration type across views or co-regularization which limits global collaboration. We propose AW-LSSVM, an adaptive weighted LS-SVM that promotes complementary learning by an iterative global coupling to make each view focus on hard samples of others from previous iterations. Experiments demonstrate that AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while keeping raw features isolated, making it also suitable for privacy-preserving scenarios.

[378] Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models

Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji

Main category: cs.LG

TL;DR: A continual unlearning framework for generative models that handles sequential deletion requests via generative distillation, preventing catastrophic forgetting while maintaining model quality.

Details

Motivation: Real-world data privacy regulations (like GDPR's Right to be Forgotten) require removing specific concepts from generative models, but existing unlearning methods fail when deletion requests arrive sequentially, causing stability crises and quality degradation.

Method: Introduces a generative distillation-based continual unlearning framework that treats each unlearning step as multi-objective teacher-student distillation, leveraging continual learning principles to maintain model integrity during sequential deletions.

Result: Experiments on 10-step sequential benchmark show the method successfully unlearns target concepts with high fidelity while preserving performance on retained concepts and overall image quality, significantly outperforming baseline approaches.

Conclusion: The framework provides a practical solution for responsible deployment of large-scale generative models, enabling compliance with ongoing data removal requests without requiring expensive retraining or suffering catastrophic forgetting.

Abstract: The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR’s ``Right to be Forgotten.’’ This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.

[379] Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

Haozhuo Zheng, Cheng Wang, Yang Liu

Main category: cs.LG

TL;DR: GVT is a two-stage generative framework that compresses molecular graphs into discrete latent sequences using a Graph VQ-VAE, then generates molecules via autoregressive Transformer, achieving SOTA performance with high efficiency.

Details

Motivation: Current diffusion models for molecular generation are computationally intensive, while autoregressive models suffer from error propagation. There's a need for an efficient, accurate approach that bridges molecular design with large-scale sequence modeling paradigms.

Method: Two-stage framework: 1) Graph VQ-VAE compresses molecular graphs into high-fidelity discrete latent sequences using Graph Transformer with RCM node ordering and RoPE, 2) Autoregressive Transformer trained on these discrete latents for sequence-based generation.

Result: Achieves state-of-the-art or highly competitive performance on ZINC250k, MOSES, and GuacaMol benchmarks. Notably outperforms leading diffusion models on key metrics like FCD and KL Divergence. Achieves near-perfect reconstruction rates.

Conclusion: GVT presents a compelling alternative to diffusion models with superior performance and efficiency. Establishes a new baseline for discrete latent-space molecular generation and bridges molecular design with large-scale sequence modeling, enabling potential synergies with LLMs.

Abstract: The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

[380] Conformal Correction for Efficiency May be at Odds with Entropy

Senrong Xu, Tianyu Wang, Zenan Li, Yuan Yao, Taolue Chen, Feng Xu, Xiaoxing Ma

Main category: cs.LG

TL;DR: Proposes entropy-constrained conformal correction to improve CP efficiency while controlling prediction entropy, achieving up to 34.4% efficiency gains over SOTA methods.

Details

Motivation: Conformal prediction provides rigorous uncertainty sets but can be inefficient. Conformal correction improves efficiency but may increase prediction entropy. There's a trade-off between CP efficiency and prediction entropy that needs better optimization.

Method: Proposes entropy-constrained conformal correction method that fine-tunes models with a conformal-aware inefficiency loss while constraining prediction entropy. Explores Pareto optimum between efficiency and entropy through entropy constraints.

Result: Extensive experiments on computer vision and graph datasets show the method significantly improves efficiency of state-of-the-art CP methods by up to 34.4% while maintaining entropy constraints.

Conclusion: The entropy-constrained approach effectively balances the trade-off between CP efficiency and prediction entropy, achieving better Pareto optimality and practical improvements in conformal prediction performance.

Abstract: Conformal prediction (CP) provides a comprehensive framework to produce statistically rigorous uncertainty sets for black-box machine learning models. To further improve the efficiency of CP, conformal correction is proposed to fine-tune or wrap the base model with an extra module using a conformal-aware inefficiency loss. In this work, we empirically and theoretically identify a trade-off between the CP efficiency and the entropy of model prediction. We then propose an entropy-constrained conformal correction method, exploring a better Pareto optimum between efficiency and entropy. Extensive experimental results on both computer vision and graph datasets demonstrate the efficacy of the proposed method. For instance, it can significantly improve the efficiency of state-of-the-art CP methods by up to 34.4%, given an entropy threshold.

[381] FGC-Comp: Adaptive Neighbor-Grouped Attribute Completion for Graph-based Anomaly Detection

Junpeng Wu, Pinheng Zong

Main category: cs.LG

TL;DR: FGC-Comp is a lightweight attribute completion module that improves graph-based anomaly detection under missing or adversarial attributes by partitioning neighbors into label-based groups and using group-specific transforms with residual connections.

Details

Motivation: Existing graph-based anomaly detection models often fail to handle missing and adversarially obscured node attributes, which undermines aggregation stability and prediction reliability in real-world applications.

Method: Proposes FGC-Comp: partitions neighbors into three label-based groups (positive, negative, unknown), applies group-specific transforms to labeled groups while using node-conditioned gates for unknowns, fuses messages via residual connections, and trains end-to-end with binary classification.

Result: Experiments on two real-world fraud datasets demonstrate effectiveness with negligible computational overhead, improving aggregation stability and prediction reliability under incomplete attributes.

Conclusion: FGC-Comp provides a lightweight, classifier-agnostic solution for enhancing graph-based anomaly detection robustness against missing and adversarial attributes in practical deployment scenarios.

Abstract: Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies overlook the pervasive issues of missing and adversarially obscured node attributes, which can undermine aggregation stability and prediction reliability. To mitigate this, we propose FGC-Comp, a lightweight, classifier-agnostic, and deployment-friendly attribute completion module-designed to enhance neighborhood aggregation under incomplete attributes. We partition each node’s neighbors into three label-based groups, apply group-specific transforms to the labeled groups while a node-conditioned gate handles unknowns, fuse messages via residual connections, and train end-to-end with a binary classification objective to improve aggregation stability and prediction reliability under missing attributes. Experiments on two real-world fraud datasets validate the effectiveness of the approach with negligible computational overhead.

[382] Credal Graph Neural Networks

Matteo Tolloso, Davide Bacciu

Main category: cs.LG

TL;DR: First credal graph neural networks (CGNNs) that output set-valued predictions for uncertainty quantification in node classification, achieving SOTA performance on heterophilic graphs under distributional shift.

Details

Motivation: Existing uncertainty quantification approaches for GNNs rely on Bayesian inference or ensembles, but there's a need for more reliable epistemic uncertainty representation, especially under out-of-distribution conditions.

Method: Introduce credal graph neural networks (CGNNs) that extend credal learning to graph domain, training GNNs to output set-valued predictions as credal sets. Develop complementary approach leveraging layer-wise information propagation during message passing.

Result: CGNNs deliver more reliable representations of epistemic uncertainty and achieve state-of-the-art performance under distributional shift on heterophilic graphs. Analysis highlights critical role of graph homophily assumption in shaping uncertainty estimate effectiveness.

Conclusion: CGNNs provide effective uncertainty quantification for GNNs, particularly valuable for heterophilic graphs where traditional assumptions about graph structure may not hold, offering improved reliability under distributional shift.

Abstract: Uncertainty quantification is essential for deploying reliable Graph Neural Networks (GNNs), where existing approaches primarily rely on Bayesian inference or ensembles. In this paper, we introduce the first credal graph neural networks (CGNNs), which extend credal learning to the graph domain by training GNNs to output set-valued predictions in the form of credal sets. To account for the distinctive nature of message passing in GNNs, we develop a complementary approach to credal learning that leverages different aspects of layer-wise information propagation. We assess our approach on uncertainty quantification in node classification under out-of-distribution conditions. Our analysis highlights the critical role of the graph homophily assumption in shaping the effectiveness of uncertainty estimates. Extensive experiments demonstrate that CGNNs deliver more reliable representations of epistemic uncertainty and achieve state-of-the-art performance under distributional shift on heterophilic graphs.

[383] Adversarial Jamming for Autoencoder Distribution Matching

Waleed El-Geresy, Deniz Gündüz

Main category: cs.LG

TL;DR: Using adversarial wireless jamming as regularization to match autoencoder latent space to diagonal Gaussian distribution, achieving comparable results to VAE/WAE.

Details

Motivation: To develop a novel approach for distribution matching in autoencoder latent spaces by leveraging insights from adversarial wireless jamming theory, where jamming noise naturally produces diagonal Gaussian distributions.

Method: Formulate a minimax game between encoder-decoder pair and adversarial jammer, where jammer attempts to disrupt Gaussian source recovery. Use jamming as auxiliary objective to regularize latent space to match diagonal Gaussian distribution.

Result: Achieves distribution matching comparable to standard variational autoencoders (VAE) and Wasserstein autoencoders (WAE), with potential generalization to other latent distributions.

Conclusion: Adversarial jamming provides effective regularization for latent space distribution matching, offering a novel alternative to existing autoencoder regularization techniques.

Abstract: We propose the use of adversarial wireless jamming to regularise the latent space of an autoencoder to match a diagonal Gaussian distribution. We consider the minimisation of a mean squared error distortion, where a jammer attempts to disrupt the recovery of a Gaussian source encoded and transmitted over the adversarial channel. A straightforward consequence of existing theoretical results is the fact that the saddle point of a minimax game - involving such an encoder, its corresponding decoder, and an adversarial jammer - consists of diagonal Gaussian noise output by the jammer. We use this result as inspiration for a novel approach to distribution matching in the latent space, utilising jamming as an auxiliary objective to encourage the aggregated latent posterior to match a diagonal Gaussian distribution. Using this new technique, we achieve distribution matching comparable to standard variational autoencoders and to Wasserstein autoencoders. This approach can also be generalised to other latent distributions.

[384] FiMMIA: scaling semantic perturbation-based membership inference across modalities

Anton Emelyanov, Sergei Kudriashov, Alena Fenogenova

Main category: cs.LG

TL;DR: FiMMIA is a modular framework for multimodal membership inference attacks that addresses distribution shifts in MLLMs by generalizing perturbation-based methods to detect whether specific data points were in the training set.

Details

Motivation: Existing membership inference methods for LLMs perform poorly on multimodal LLMs due to instabilities from multimodal component adaptation and distribution shifts across multiple inputs, creating a need for specialized MIA approaches for multimodal domains.

Method: The approach identifies distribution shifts in existing datasets, releases an extended baseline pipeline, and generalizes perturbation-based membership inference methods to MLLMs. It trains a neural network to analyze target model behavior on perturbed inputs to capture distributional differences between members and non-members.

Result: Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of the perturbation-based membership inference attacks in multimodal domains.

Conclusion: FiMMIA provides a successful framework for multimodal membership inference that addresses the unique challenges of MLLMs, with publicly available source code and framework under MIT license.

Abstract: Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release \textbf{FiMMIA} – a modular \textbf{F}ramework for \textbf{M}ultimodal \textbf{MIA}.\footnote{The source code and framework have been made publicly available under the MIT license via \href{https://github.com/ai-forever/data_leakage_detect}{link}.The video demonstration is available on \href{https://youtu.be/a9L4-H80aSg}{YouTube}.} Our approach trains a neural network to analyze the target model’s behavior on perturbed inputs, capturing distributional differences between members and non-members. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal domains.

Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen

Main category: cs.LG

TL;DR: Flow-based diffusion models have a two-stage training dynamic: early stage generalizes across data modes for global layouts, later stage memorizes fine details from nearest samples.

Details

Motivation: Flow-based diffusion models are widely used for generative modeling but their memorization-generalization behavior remains poorly understood. The authors aim to analyze the training dynamics of flow matching objectives to better understand how these models learn.

Method: The study revisits the flow matching objective and analyzes its marginal velocity field, which admits a closed-form expression allowing exact computation of the oracle FM target. This analysis reveals the inherent two-stage training structure of flow-based diffusion models.

Result: Analysis shows flow-based diffusion models have a two-stage training target: early stage guided by mixture of data modes (generalization), later stage dominated by nearest data sample (memorization). This explains practical techniques like timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices.

Conclusion: The study deepens understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements by revealing the inherent two-stage learning behavior in flow-based diffusion models.

Abstract: Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.

[386] A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models

Ihab Ahmed, Denis Krompaß, Cheng Feng, Volker Tresp

Main category: cs.LG

TL;DR: REVIN normalization is the most efficient approach for Time-Series Foundation Models, reducing zero-shot MASE by 89% vs un-normalized baselines and 44% vs other methods, while maintaining top in-domain accuracy.

Details

Motivation: Normalization is well-studied in dataset-specific time-series models but overlooked in Time-Series Foundation Models (TSFMs) where generalization across domains is critical. Time-series data exhibits significant scale variation across domains/channels and non-stationarity, which can undermine TSFM performance regardless of architectural complexity.

Method: Systematic evaluation across four architecturally diverse TSFMs to compare normalization methods. Focus on REVIN (Reversible Instance Normalization) approach and its comparison against other normalization techniques and un-normalized baselines.

Result: REVIN emerged as the most efficient normalization method, reducing zero-shot MASE by 89% relative to un-normalized baselines and by 44% versus other normalization methods. It matched the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing, yielding the highest accuracy-efficiency trade-off.

Conclusion: REVIN normalization is highly effective for TSFMs, but its utilization depends on architectural design choices and optimization objectives, particularly regarding training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

Abstract: We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89% relative to an un-normalized baseline and by 44% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing – yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

[387] GraphMatch: Fusing Language and Graph Representations in a Dynamic Two-Sided Work Marketplace

Mikołaj Sacha, Hammad Jafri, Mattie Terzolo, Ayan Sinha, Andrew Rabinovich

Main category: cs.LG

TL;DR: GraphMatch is a recommendation framework that combines pre-trained language models with graph neural networks for text-rich, dynamic two-sided marketplaces, outperforming language-only and graph-only baselines.

Details

Motivation: Recommending matches in text-rich, dynamic two-sided marketplaces is challenging due to evolving content and interaction graphs. Existing standalone models don't effectively handle both fine-grained text semantics and time-sensitive graph structure.

Method: GraphMatch fuses pre-trained language models with graph neural networks, using adversarial negative sampling and point-in-time subgraph training to capture both text semantics and graph structure. It’s designed for low-latency inference suitable for real-time use.

Result: Extensive evaluation on Upwork data shows GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime.

Conclusion: Unifying language and graph representations yields an effective solution for text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained language models and large-scale graphs in practice.

Abstract: Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.

[388] Adaptive Decentralized Federated Learning for Robust Optimization

Shuyuan Wu, Feifei Wang, Yuan Gao, Hansheng Wang

Main category: cs.LG

TL;DR: Proposes adaptive DFL (aDFL) that adjusts client learning rates to mitigate abnormal client impact without requiring many normal neighbors or prior knowledge.

Details

Motivation: Previous DFL methods for handling abnormal clients require either a large number of normal neighboring clients or prior knowledge of reliable clients, which limits practical applicability.

Method: Develops adaptive DFL (aDFL) that adaptively adjusts client learning rates - assigning smaller rates to suspicious clients and larger rates to normal clients to mitigate negative impact of abnormal clients.

Result: Theoretical analysis shows oracle property without stringent conditions on neighboring nodes or prior knowledge. Extensive experiments demonstrate superior performance.

Conclusion: aDFL provides a fully adaptive approach to robust decentralized federated learning that overcomes limitations of previous methods.

Abstract: In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall robustness of the model. Previous methods on this issue often require a sufficiently large number of normal neighboring clients or prior knowledge of reliable clients, which reduces the practical applicability of DFL. To address these limitations, we develop here a novel adaptive DFL (aDFL) approach for robust estimation. The key idea is to adaptively adjust the learning rates of clients. By assigning smaller rates to suspicious clients and larger rates to normal clients, aDFL mitigates the negative impact of abnormal clients on the global model in a fully adaptive way. Our theory does not put any stringent conditions on neighboring nodes and requires no prior knowledge. A rigorous convergence analysis is provided to guarantee the oracle property of aDFL. Extensive numerical experiments demonstrate the superior performance of the aDFL method.

[389] Assessing the performance of correlation-based multi-fidelity neural emulators

Cristian J. Villatoro, Gianluca Geraci, Daniele E. Schiavazzi

Main category: cs.LG

TL;DR: Multi-fidelity neural emulators combine limited high-fidelity data with abundant low-fidelity information to create efficient surrogate models for computationally expensive tasks.

Details

Motivation: Outer loop tasks like optimization, uncertainty quantification, and inference become intractable with expensive high-fidelity models, and data-driven methods need large datasets. Multi-fidelity approaches address these challenges by leveraging inexpensive low-fidelity information while correcting with scarce high-fidelity data.

Method: Study investigates multi-fidelity neural emulators (neural networks) that learn input-to-output mapping by integrating limited high-fidelity data with abundant low-fidelity solutions. Examines various architectures (MLP, Siren, KAN), coordinate encoding mechanisms, exact/learnable low-fidelity information, and varying dataset sizes across different function types and dimensions.

Result: Comprehensive analysis of multi-fidelity emulator performance across low/high-dimensional functions, oscillatory patterns, discontinuities, similar/dissimilar model parametrizations, and potentially corrupted low-fidelity sources. Includes comparison with single-fidelity approaches to quantify performance gains.

Conclusion: Multi-fidelity neural emulators offer significant advantages over single-fidelity approaches by efficiently combining multiple information sources, potentially enabling tractable solutions for computationally expensive tasks while requiring less high-fidelity data.

Abstract: Outer loop tasks such as optimization, uncertainty quantification or inference can easily become intractable when the underlying high-fidelity model is computationally expensive. Similarly, data-driven architectures typically require large datasets to perform predictive tasks with sufficient accuracy. A possible approach to mitigate these challenges is the development of multi-fidelity emulators, leveraging potentially biased, inexpensive low-fidelity information while correcting and refining predictions using scarce, accurate high-fidelity data. This study investigates the performance of multi-fidelity neural emulators, neural networks designed to learn the input-to-output mapping by integrating limited high-fidelity data with abundant low-fidelity model solutions. We investigate the performance of such emulators for low and high-dimensional functions, with oscillatory character, in the presence of discontinuities, for collections of models with equal and dissimilar parametrization, and for a possibly large number of potentially corrupted low-fidelity sources. In doing so, we consider a large number of architectural, hyperparameter, and dataset configurations including networks with a different amount of spectral bias (Multi-Layered Perceptron, Siren and Kolmogorov Arnold Network), various mechanisms for coordinate encoding, exact or learnable low-fidelity information, and for varying training dataset size. We further analyze the added value of the multi-fidelity approach by conducting equivalent single-fidelity tests for each case, quantifying the performance gains achieved through fusing multiple sources of information.

[390] FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization

Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang

Main category: cs.LG

TL;DR: Fairy2i converts pre-trained real-valued LLMs to complex-valued form for extreme low-bit quantization (down to 2 bits) while maintaining performance, enabling efficient inference on commodity hardware.

Details

Motivation: LLMs have massive memory/computational demands requiring aggressive quantization. Complex-valued LLMs offer better low-bit representation but require training from scratch, preventing use of existing pre-trained real-valued models.

Method: 1) Proves mathematical equivalence between real and widely-linear complex maps to convert pre-trained Transformers losslessly; 2) Uses phase-aware quantization with fourth roots of unity codebook; 3) Implements recursive residual quantization to minimize error via multiplication-free accumulation.

Result: Fairy2i restores LLaMA-2 7B performance at effective 2-bit precision to near full-precision levels, significantly outperforming state-of-the-art real-valued binary/ternary quantization methods.

Conclusion: Bridges gap between complex-valued arithmetic efficiency and practical utility of pre-trained models, enabling efficient inference on commodity hardware while reusing existing checkpoints.

Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

[391] Hypothesis Testing for Generalized Thurstone Models

Anuran Makur, Japneet Singh

Main category: cs.LG

TL;DR: Developed a hypothesis testing framework to determine if pairwise comparison data comes from a generalized Thurstone model, with minimax analysis and graph-dependent thresholds.

Details

Motivation: Prior work focused on parameter estimation and uncertainty quantification for Thurstone models, but there's a fundamental gap in hypothesis testing to determine if data actually follows such models.

Method: Introduced separation distance between general pairwise comparison models and Thurstone models; derived upper/lower bounds on testing thresholds based on observation graph topology; proposed hypothesis test with confidence intervals; used reverse martingale techniques for time-uniform error bounds and information-theoretic methods for minimax lower bounds.

Result: For complete graphs, testing threshold scales as Θ((nk)^{-1/2}) where n is agents and k is comparisons per pair; established theoretical guarantees for hypothesis testing; validated results on synthetic and real-world datasets.

Conclusion: Developed a comprehensive hypothesis testing framework for Thurstone models with minimax optimality guarantees, bridging the gap between parameter estimation and model validation in pairwise comparison analysis.

Abstract: In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emph{generalized Thurstone model} $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $Θ((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

[392] Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

Main category: cs.LG

TL;DR: Multimodal approach combining road network data and satellite images improves traffic accident prediction by 3.7% AUROC over graph-only models, enabling causal analysis of contributing factors.

Details

Motivation: Previous traffic accident prediction methods rely primarily on road network structural features but overlook physical and environmental information from road surfaces and surroundings, limiting prediction accuracy and understanding of contributing factors.

Method: Constructed large multimodal dataset across six U.S. states with 9M accident records, 1M high-resolution satellite images aligned to road graph nodes, plus weather statistics, road type annotations, and traffic volume data. Evaluated multimodal learning methods integrating visual and network embeddings.

Result: Multimodal integration achieved 90.1% AUROC, 3.7% gain over graph neural networks using only graph structures. Causal analysis revealed accident rates increase by 24% with higher precipitation, 22% on higher-speed roads, and 29% due to seasonal patterns.

Conclusion: Integrating satellite imagery with road network data significantly improves traffic accident prediction accuracy and enables better understanding of causal factors. Satellite imagery features are essential for accurate prediction, as confirmed by ablation studies.

Abstract: We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region’s weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1%$, which is a $3.7%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24%$ under higher precipitation, by $22%$ on higher-speed roads such as motorways, and by $29%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

[393] Fast Gaussian Process Approximations for Autocorrelated Data

Ahmadreza Chokhachian, Matthias Katzfuss, Yu Ding

Main category: cs.LG

TL;DR: Proposes methods to accelerate Gaussian process regression for autocorrelated data by adapting existing fast approximations to work with blocked data, avoiding temporal overfitting while maintaining prediction performance.

Details

Motivation: Standard fast Gaussian process approximations assume independent, identically distributed data, but autocorrelated data causes temporal overfitting that deteriorates model performance. Existing methods need modification to handle autocorrelation properly.

Method: Adapts existing Gaussian process approximations to work with blocked data, where originally correlated data points are segmented into blocks to de-correlate the data within blocks.

Result: Numerical experiments across diverse datasets show the proposed approaches can remarkably accelerate computation for Gaussian process regression on autocorrelated data without compromising prediction performance.

Conclusion: The proposed blocked data approach enables efficient Gaussian process regression for autocorrelated data by preventing temporal overfitting while maintaining computational speed and prediction accuracy.

Abstract: This paper is concerned with the problem of how to speed up computation for Gaussian process models trained on autocorrelated data. The Gaussian process model is a powerful tool commonly used in nonlinear regression applications. Standard regression modeling assumes random samples and an independently, identically distributed noise. Various fast approximations that speed up Gaussian process regression work under this standard setting. But for autocorrelated data, failing to account for autocorrelation leads to a phenomenon known as temporal overfitting that deteriorates model performance on new test instances. To handle autocorrelated data, existing fast Gaussian process approximations have to be modified; one such approach is to segment the originally correlated data points into blocks in which the blocked data are de-correlated. This work explains how to make some of the existing Gaussian process approximations work with blocked data. Numerical experiments across diverse application datasets demonstrate that the proposed approaches can remarkably accelerate computation for Gaussian process regression on autocorrelated data without compromising model prediction performance.

[394] Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis

Jennifer Zvonek, Andrew Gillette

Main category: cs.LG

TL;DR: PruningAMR: Algorithm that creates adaptive mesh from neural network weights without training data, saving memory for visualization tasks.

Details

Motivation: Many visualization tasks using implicit neural representations (INRs) still require discretization to regular grids, which can be memory-intensive. There's a need for automatic mesh generation that adapts to the underlying function's resolution without accessing original training data.

Method: Uses interpolative decomposition pruning on INR weight matrices to identify geometric features. The pruned network guides adaptive mesh refinement, creating variable-resolution meshes tailored to the function’s features.

Result: Produces variable resolution visualizations with substantial memory savings from pre-trained INRs without needing access to training data.

Conclusion: PruningAMR enables efficient mesh generation from INRs by leveraging network pruning techniques to automatically adapt mesh resolution to geometric features, offering significant memory advantages for visualization tasks.

Abstract: An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR–without access to its training data–we produce a variable resolution visualization with substantial memory savings.

[395] ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics

Louis McConnell, Jieran Sun, Theo Maffei, Raphael Gottardo, Marianna Rapsomaniki

Main category: cs.LG

TL;DR: ProteinPNet is a prototype-based framework that discovers interpretable spatial motifs in tumor microenvironment from spatial proteomics data.

Details

Motivation: Understanding spatial architecture of tumor microenvironment is critical for precision oncology, but current methods lack direct interpretability.

Method: Prototypical part networks that directly learn discriminative, interpretable, faithful spatial prototypes through supervised training.

Result: Identifies biologically meaningful prototypes aligned with different tumor subtypes, capturing interpretable features of immune infiltration and tissue modularity.

Conclusion: Prototype-based learning reveals interpretable spatial biomarkers within TME, with implications for mechanistic discovery in spatial omics.

Abstract: Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.

[396] Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

Main category: cs.LG

TL;DR: LLM judges are noisy at single-sample level; proposed distribution-calibrated aggregation method using Bradley-Terry-Davidson model on rating counts improves reliability by leveraging both polarity and decisiveness.

Details

Motivation: Current LLM-based pairwise preference evaluation suffers from noise at individual sample level, and standard aggregation methods (majority vote, soft self-consistency, instruction-based self-aggregation) are inconsistent when ties are allowed, requiring better inference-time compute allocation and aggregation.

Method: Proposes distribution-calibrated aggregation scheme using Bradley-Terry-Davidson formulation on rating counts from n independent thinking-rating samples per item. Models three-way preferences (win/lose/tie) and leverages both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus.

Result: Across various evaluation benchmarks, the approach consistently reduces MAE and increases pairwise accuracy versus standard baselines. When evaluated against human-consensus meta-labels, matches or exceeds individual human raters.

Conclusion: Careful allocation of inference-time compute combined with distribution-aware aggregation methods can transform noisy individual LLM judgments into reliable ratings for evaluation, demonstrating that systematic aggregation is key to making LLM-based evaluation robust.

Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

[397] TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, Yong Chen

Main category: cs.LG

TL;DR: TokenPowerBench is the first lightweight benchmark for measuring LLM inference power consumption, enabling users to analyze energy efficiency across different models and configurations without specialized hardware.

Details

Motivation: LLM inference accounts for over 90% of total power consumption in LLM services, but existing benchmarks focus on training/performance rather than power measurement for inference.

Method: A three-component benchmark: (1) declarative configuration interface for models, prompts, and engines; (2) measurement layer capturing GPU/node/system power without specialized meters; (3) phase-aligned metrics pipeline attributing energy to prefill and decode stages.

Result: Evaluated on four major model series (Llama, Falcon, Qwen, Mistral) ranging from 1B to 405B parameters, demonstrating ability to measure power consumption across diverse LLM configurations.

Conclusion: TokenPowerBench provides an open-source tool for measuring LLM inference power consumption, helping users forecast operational costs and meet sustainability goals in LLM deployments.

Abstract: Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

[398] Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer

Yu Yang, Pan Xu

Main category: cs.LG

TL;DR: LPDT: Language model-initialized Prompt Decision Transformer that uses pretrained language models and LoRA fine-tuning to improve few-shot RL performance with limited data.

Details

Motivation: Prompt-DT methods struggle with limited datasets and poor task differentiation due to Transformer's data-hungry nature. Collecting environment-specific data is costly/unsafe, leading to suboptimal few-shot performance.

Method: LPDT leverages pretrained language models for prior knowledge, fine-tunes with Low-rank Adaptation (LoRA) for meta-RL, and incorporates prompt regularization to better differentiate tasks via prompt features.

Result: Achieves similar performance to Prompt-DT with only 10% data in some MuJoCo control tasks. Ablation studies validate effectiveness of sequence modeling, language models, prompt regularization, and prompt strategies.

Conclusion: Language model initialization provides valuable prior knowledge for RL tasks, enabling effective few-shot learning with limited data through the LPDT framework.

Abstract: Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer’s capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10%$ data in some MuJoCo control tasks. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.

[399] Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun

Main category: cs.LG

TL;DR: Bridge is a statistical framework that models systematic differences between human and LLM evaluations, allowing for correction of LLM-as-a-judge ratings to better align with human preferences.

Details

Motivation: LLMs are increasingly used as judges to evaluate model outputs at scale, but their assessments systematically diverge from human judgments, creating a need for a principled way to bridge this gap.

Method: Bridge posits latent human preference scores for prompt-response pairs and models LLM deviations as linear transformations of covariates capturing discrepancy sources. It provides an efficient fitting algorithm with asymptotic guarantees for statistical inference.

Result: Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings across accuracy, calibration, and KL divergence metrics, while exposing systematic human-LLM gaps.

Conclusion: Bridge offers a simple, principled framework for refining LLM ratings and characterizing systematic discrepancies between human and LLM evaluations, improving the reliability of LLM-as-a-judge systems.

Abstract: Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

[400] Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin

Main category: cs.LG

TL;DR: The paper explains why token-level surrogate objectives can optimize sequence-level rewards in RL with LLMs, showing this works when training-inference discrepancy and policy staleness are minimized. It provides theoretical justification for importance sampling, clipping, and Routing Replay techniques.

Details

Motivation: To provide a principled understanding of why and when token-level surrogate objectives can effectively optimize sequence-level rewards in reinforcement learning with large language models, and to explain the crucial role of existing stabilization techniques.

Method: Uses first-order approximation to show token-level surrogate objectives approximate sequence-level rewards when training-inference discrepancy and policy staleness are minimized. Validates through extensive experiments with 30B MoE models using policy gradient methods (REINFORCE), importance sampling correction, clipping, and Routing Replay.

Result: For on-policy training, basic policy gradient with importance sampling achieves highest stability. For off-policy updates, combining clipping and Routing Replay is essential to mitigate instability from policy staleness. Once stabilized, prolonged optimization yields comparable final performance regardless of initialization.

Conclusion: The paper provides theoretical insights into RL training with LLMs, explaining why token-level objectives work and offering practical recipes for stable training. The findings should facilitate future research in this area.

Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

[401] Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Tong Li, Jacob Nogas, Haochen Song, Anna Rafferty, Eric M. Schwartz, Audrey Durand, Harsh Kumar, Nina Deliu, Sofia S. Villar, Dehan Kong, Joseph J. Williams

Main category: cs.LG

TL;DR: TS-PostDiff algorithm mixes Thompson Sampling with uniform random assignment to balance user benefits with statistical inference quality, reducing false positives when differences are small while increasing reward when differences are large.

Details

Motivation: Traditional uniform random A/B testing is statistically sound but slow, while Thompson Sampling maximizes user benefits but causes biased estimates and increases false positives/negatives. Need algorithm that balances both objectives.

Method: TS-PostDiff mixes TS with UR assignment using adaptive probability: UR probability is proportional to posterior probability that arm differences are small. Allows experimenter to define “small difference” threshold.

Result: TS-PostDiff reduces false positives and increases statistical power when differences are small, while increasing reward more when differences are large compared to UR, TS, and other TS variants.

Conclusion: TS-PostDiff provides better tradeoffs between user benefits and statistical inference, highlighting important considerations for future statistically sensitive algorithm development in adaptive experimentation.

Abstract: Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users more. To more quickly and automatically use data to benefit users, multi-armed bandit algorithms such as Thompson Sampling (TS) have been advocated. While TS is interpretable and incorporates the randomization key to statistical inference, it can cause biased estimates and increase false positives and false negatives in detecting differences in arm means. We introduce a more Statistically Sensitive algorithm, TS-PostDiff (Posterior Probability of Small Difference), that mixes TS with traditional UR by using an additional adaptive step, where the probability of using UR (vs TS) is proportional to the posterior probability that the difference in arms is small. This allows an experimenter to define what counts as a small difference, below which a traditional UR experiment can obtain informative data for statistical inference at low cost, and above which using more TS to maximize user benefits is key. We evaluate TS-PostDiff against UR, TS, and two other TS variants designed to improve statistical inference. We consider results for the common two-armed experiment across a range of settings inspired by real-world applications. Our results provide insight into when and why TS-PostDiff or alternative approaches provide better tradeoffs between benefiting users (reward) and statistical inference (false positive rate and power). TS-PostDiff’s adaptivity helps efficiently reduce false positives and increase statistical power when differences are small, while increasing reward more when differences are large. The work highlights important considerations for future Statistically Sensitive algorithm development that balances reward and statistical analysis in adaptive experimentation.

[402] Can-SAVE: Deploying Low-Cost and Population-Scale Cancer Screening via Survival Analysis Variables and EHR

Petr Philonenko, Vladimir Kokh, Pavel Blinov

Main category: cs.LG

TL;DR: Can-SAVE is a lightweight AI system that uses medical history events to rank cancer risks, achieving 4-10x higher detection rates than baselines and nearly doubling cancer detection in real-world deployment.

Details

Motivation: Conventional cancer screening is costly, labor-intensive, and difficult to scale. Most AI systems rely on complex medical data, making them impractical for large-scale population screening.

Method: Integrates survival model outputs into a gradient-boosting framework to detect subtle, long-term patient risk patterns from medical history events alone, without requiring complex medical data.

Result: On 2.5M patients: 4-10x higher detection rate at identical screening volumes (AP 0.228 vs 0.193 baseline). In prospective pilot: +91% cancer detection rate, +36% population coverage. Processes 1M patients in under 3 hours on standard hardware.

Conclusion: Can-SAVE achieves nationally significant cancer detection improvements while adhering to real-world healthcare constraints, offering immediate clinical utility and a replicable framework for population-wide screening.

Abstract: Conventional medical cancer screening methods are costly, labor-intensive, and extremely difficult to scale. Although AI can improve cancer detection, most systems rely on complex or specialized medical data, making them impractical for large-scale screening. We introduce Can-SAVE, a lightweight AI system that ranks population-wide cancer risks solely based on medical history events. By integrating survival model outputs into a gradient-boosting framework, our approach detects subtle, long-term patient risk patterns - often well before clinical symptoms manifest. Can-SAVE was rigorously evaluated on a real-world dataset of 2.5 million adults spanning five Russian regions, marking the study as one of the largest and most comprehensive deployments of AI-driven cancer risk assessment. In a retrospective oncologist-supervised study over 1.9M patients, Can-SAVE achieves a 4-10x higher detection rate at identical screening volumes and an Average Precision (AP) of 0.228 vs. 0.193 for the best baseline (LoRA-tuned Qwen3-Embeddings via DeepSeek-R1 summarization). In a year-long prospective pilot (426K patients), our method almost doubled the cancer detection rate (+91%) and increased population coverage by 36% over the national screening protocol. The system demonstrates practical scalability: a city-wide population of 1 million patients can be processed in under three hours using standard hardware, enabling seamless clinical integration. This work proves that Can-SAVE achieves nationally significant cancer detection improvements while adhering to real-world public healthcare constraints, offering immediate clinical utility and a replicable framework for population-wide screening. Code for training and feature engineering is available at https://github.com/sb-ai-lab/Can-SAVE.

[403] Gaussian and Non-Gaussian Universality of Data Augmentation

Kevin Han Huang, Peter Orbanz, Morgane Austern

Main category: cs.LG

TL;DR: Data augmentation’s effects on estimation uncertainty are complex - it can increase or decrease variance, act as regularization in some cases but not others, and shift double-descent peaks, depending on data distribution, estimator properties, and sample/dimension relationships.

Details

Motivation: To quantify how data augmentation affects the variance and limiting distribution of estimates, and to understand when it acts as a regularizer versus when it fails, addressing observations and unexpected findings in machine learning practice.

Method: Develop an adaptation of Lindeberg’s technique for block dependence to establish universality results, analyze specific models in detail, and quantify effects through simple surrogates.

Result: Data augmentation may increase rather than decrease estimation uncertainty, can act as a regularizer but fails in certain high-dimensional problems, and may shift double-descent peaks of empirical risk.

Conclusion: Properties attributed to data augmentation are not universally true or false but depend on a combination of factors: data distribution, estimator properties, and interplay of sample size, number of augmentations, and dimension.

Abstract: We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors – notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg’s technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.

[404] PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs

Max Zimmer, Megi Andoni, Christoph Spiegel, Sebastian Pokutta

Main category: cs.LG

TL;DR: Pruning LLMs only requires retraining 0.01-0.05% of parameters to recover performance, enabling 30B model retraining on a single GPU in minutes.

Details

Motivation: Traditional pruning methods require costly full retraining, which is infeasible for large language models due to memory and compute constraints. The study challenges the need to retrain all parameters after pruning.

Method: Update only a small subset of highly expressive parameters (0.01-0.05%) after pruning. Introduce two novel LoRA variants that allow merging adapters back without compromising sparsity. Apply methods for memory-efficient layer-wise reconstruction to enhance retraining-free methods like Wanda and SparseGPT.

Result: Retraining just 0.01-0.05% of parameters in GPT architectures matches full retraining performance across various sparsity levels. Enables retraining of 30B parameter models on a single GPU in minutes. Novel LoRA variants bridge gap to full retraining in high sparsity regimes.

Conclusion: Updating a small subset of expressive parameters suffices for performance recovery after pruning, presenting a promising alternative to full retraining for large language models.

Abstract: Neural Networks can be effectively compressed through pruning, significantly reducing storage and compute demands while maintaining predictive performance. Simple yet effective methods like magnitude pruning remove less important parameters and typically require a costly retraining procedure to restore performance. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. This study challenges the practice of retraining all parameters by showing that updating a small subset of highly expressive parameters can suffice to recover or even enhance performance after pruning. Surprisingly, retraining just 0.01%-0.05% of the parameters in GPT-architectures can match the performance of full retraining across various sparsity levels, significantly reducing compute and memory requirements, and enabling retraining of models with up to 30 billion parameters on a single GPU in minutes. To bridge the gap to full retraining in the high sparsity regime, we introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity. Going a step further, we show that these methods can be applied for memory-efficient layer-wise reconstruction, significantly enhancing state-of-the-art retraining-free methods like Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023). Our findings present a promising alternative to avoiding retraining.

[405] Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

Main category: cs.LG

TL;DR: Mixture-of-Experts Neural Operators (MoNOs) with tree-based routing can universally approximate Lipschitz nonlinear operators between L² spaces, with each expert NO having depth/width/rank scaling as O(ε⁻¹), enabling distributed computation on standard hardware.

Details

Motivation: To understand the approximation-theoretic implications of distributing operator learning complexity across many small neural operators via mixture-of-experts architectures with tree-based routing, and to analyze how this decomposition affects approximation power, sample complexity, and stability.

Method: Propose Mixture of Neural Operators (MoNOs) where a single large neural operator’s complexity is distributed across many small neural operators, with each input routed to exactly one NO via a decision tree. Analyze tree-based routing and expert decomposition mathematically.

Result: Prove a distributed universal approximation theorem: any Lipschitz nonlinear operator between L²([0,1]ᵈ) spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy ε by an MoNO, with each expert NO having depth, width, and rank scaling as O(ε⁻¹).

Conclusion: MoNOs provide a practical distributed architecture for operator learning where each expert remains small enough for standard hardware memory, while achieving universal approximation with quantifiable complexity bounds. The analysis also yields new quantitative approximation rates for classical neural operators.

Abstract: We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many small neural operators (NOs), and each input is routed to exactly one NO via a decision tree. We analyze how this tree-based routing and expert decomposition affect approximation power, sample complexity, and stability. Our main result is a distributed universal approximation theorem for mixture of neural operators (MoNOs): any Lipschitz nonlinear operator between $L^2([0,1]^d)$ spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy $\varepsilon>0$ by an MoNO, where each expert NO has a depth, width, and rank scaling as $\mathcal{O}(\varepsilon^{-1})$. Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of $L^2([0,1]^d)$.

[406] Spectral Convolutional Conditional Neural Processes

Peiman Mohseni, Nick Duffield

Main category: cs.LG

TL;DR: Spectral ConvCNP (SConvCNP) improves neural processes by using global convolution in frequency domain to capture long-range dependencies, overcoming CNN limitations while maintaining translation equivariance.

Details

Motivation: Existing neural processes have limitations: early approaches use finite-dimensional embeddings that mismatch infinite-dimensional generative processes, while ConvCNPs with local CNN kernels struggle with long-range dependencies without high computational costs.

Method: Proposes Spectral ConvCNP (SConvCNP) that performs global convolution in frequency domain, inspired by Fourier neural operators. Directly parameterizes convolution kernels in Fourier domain to leverage compact global representations of natural signals.

Result: Validated effectiveness on both synthetic and real-world datasets, demonstrating how operator learning ideas can advance neural process capabilities.

Conclusion: SConvCNP successfully addresses long-range dependency limitations of ConvCNPs through global frequency-domain convolution, bridging operator learning with neural processes for improved performance.

Abstract: Neural processes (NPs) are probabilistic meta-learning models that map sets of observations to posterior predictive distributions, enabling inference at arbitrary domain points. Their capacity to handle variable-sized collections of unstructured observations, combined with simple maximum-likelihood training and uncertainty-aware predictions, makes them well-suited for modeling data over continuous domains. Since their introduction, several variants have been proposed. Early approaches typically represented observed data using finite-dimensional summary embeddings obtained through aggregation schemes such as mean pooling. However, this strategy fundamentally mismatches the infinite-dimensional nature of the generative processes that NPs aim to capture. Convolutional conditional neural processes (ConvCNPs) address this limitation by constructing infinite-dimensional functional embeddings processed through convolutional neural networks (CNNs) to enforce translation equivariance. Yet CNNs with local spatial kernels struggle to capture long-range dependencies without resorting to large kernels, which impose significant computational costs. To overcome this limitation, we propose the Spectral ConvCNP (SConvCNP), which performs global convolution in the frequency domain. Inspired by Fourier neural operators (FNOs) for learning solution operators of partial differential equations (PDEs), our approach directly parameterizes convolution kernels in the frequency domain, leveraging the relatively compact yet global Fourier representation of many natural signals. We validate the effectiveness of SConvCNP on both synthetic and real-world datasets, demonstrating how ideas from operator learning can advance the capabilities of NPs.

[407] XXLTraffic: Expanding and Extremely Long Traffic forecasting beyond test adaptation

Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim

Main category: cs.LG

TL;DR: XXLTraffic is a large-scale public traffic dataset with the longest timespan from Los Angeles and New South Wales, designed to address limitations of existing datasets and support research in extremely long forecasting beyond test adaptation.

Details

Motivation: Current public traffic datasets have limitations in reflecting real-world distribution shifts, including evolving infrastructures, varying temporal distributions, and long temporal gaps from sensor downtimes or traffic pattern changes, which restricts practical applicability of existing forecasting models.

Method: The authors present XXLTraffic, the largest available public traffic dataset collected from Los Angeles, USA and New South Wales, Australia. They curate benchmark settings including typical time-series forecasting with hourly/daily aggregated data, plus novel configurations introducing gaps and down-sampled training sizes to simulate practical constraints.

Result: XXLTraffic becomes the largest public traffic dataset with the longest timespan, providing a robust platform for developing and evaluating models for extremely long forecasting problems beyond test adaptation.

Conclusion: XXLTraffic addresses limitations of existing datasets, offers fresh perspectives for time-series and traffic forecasting communities, supplements existing spatio-temporal resources, and enables new research directions in extremely long forecasting beyond test adaptation.

Abstract: Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the distribution shift nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and long temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, largest available public traffic dataset with the longest timespan collected from Los Angeles, USA, and New South Wales, Australia, curated to support research in extremely long forecasting beyond test adaptation. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle the extremely long forecasting problems beyond test adaptation. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.

[408] Minimax Hypothesis Testing for the Bradley-Terry-Luce Model

Anuran Makur, Japneet Singh

Main category: cs.LG

TL;DR: The paper develops a hypothesis test to determine if pairwise comparison data follows the BTL model, establishing minimax bounds showing critical threshold scales as Θ((nk)^{-1/2}) for complete graphs.

Details

Motivation: The BTL model is widely used for ranking from pairwise comparisons, but there's a need for statistical tests to verify whether real-world data actually follows this model. Existing methods lack rigorous hypothesis testing frameworks for model validation.

Method: Develops a minimax hypothesis testing framework for BTL model validation. Derives upper bounds for general observation graphs and lower bounds for complete graphs. Uses approximation of separation distance between pairwise comparison models and BTL models as test statistic. Analyzes performance through type I/II error bounds and considers graph properties like expansion and bounded principal ratio.

Result: Establishes that for complete induced graphs, the critical threshold scales as Θ((nk)^{-1/2}) in minimax sense. Provides theoretical bounds on error probabilities and validates results through synthetic and real-world experiments. Proposes data-driven permutation testing for threshold determination.

Conclusion: The paper provides a rigorous statistical testing framework for BTL model validation with theoretical guarantees on performance. The scaling results and error bounds offer practical guidance for determining when pairwise comparison data sufficiently follows the BTL model.

Abstract: The Bradley-Terry-Luce (BTL) model is one of the most widely used models for ranking a collection of items or agents based on pairwise comparisons among them. Given $n$ agents, the BTL model endows each agent $i$ with a latent skill score $α_i > 0$ and posits that the probability that agent $i$ is preferred over agent $j$ is $α_i/(α_i + α_j)$. In this work, our objective is to formulate a hypothesis test that determines whether a given pairwise comparison dataset, with $k$ comparisons per pair of agents, originates from an underlying BTL model. We formalize this testing problem in the minimax sense and define the critical threshold of the problem. We then establish upper bounds on the critical threshold for general induced observation graphs (satisfying mild assumptions) and develop lower bounds for complete induced graphs. Our bounds demonstrate that for complete induced graphs, the critical threshold scales as $Θ((nk)^{-1/2})$ in a minimax sense. In particular, our test statistic for the upper bounds is based on a new approximation we derive for the separation distance between general pairwise comparison models and the class of BTL models. To further assess the performance of our statistical test, we prove upper bounds on the type I and type II probabilities of error. Much of our analysis is conducted within the context of a fixed observation graph structure, where the graph possesses certain ``nice’’ properties, such as expansion and bounded principal ratio. Additionally, we derive several auxiliary results, such as bounds on principal ratios of graphs, $\ell^2$-bounds on BTL parameter estimation under model mismatch, stability of rankings under the BTL model, etc. We validate our theoretical results through experiments on synthetic and real-world datasets and propose a data-driven permutation testing approach to determine test thresholds.

[409] WARPD: World model Assisted Reactive Policy Diffusion

Shashank Hegde, Satyajeet Das, Gautam Salhotra, Gaurav S. Sukhatme

Main category: cs.LG

TL;DR: WARPD introduces a diffusion-based method that generates neural policy weights directly instead of trajectories, achieving longer action horizons with robustness to perturbations while drastically reducing inference costs compared to Diffusion Policy.

Details

Motivation: Current diffusion-based imitation learning methods for robotics face two key limitations: 1) large model sizes and slow inference that limit high-frequency control, and 2) Diffusion Policy's trade-off between performance and action horizon where fewer diffusion queries lead to larger trajectory chunks that accumulate tracking errors.

Method: WARPD (World model Assisted Reactive Policy Diffusion) generates closed-loop policies by learning behavioral distributions in parameter space rather than trajectory space. Instead of producing open-loop trajectories, it directly generates weights for neural policies, enabling extended action horizons with robustness to perturbations.

Result: Empirically, WARPD outperforms Diffusion Policy in long-horizon and perturbed environments, achieves multitask performance on par with DP, while requiring only ~1/45th of the inference-time FLOPs per step.

Conclusion: WARPD addresses key limitations of trajectory-based diffusion policies by operating in parameter space, offering both improved robustness and significantly reduced computational costs for robotic control tasks.

Abstract: With the increasing availability of open-source robotic data, imitation learning has become a promising approach for both manipulation and locomotion. Diffusion models are now widely used to train large, generalized policies that predict controls or trajectories, leveraging their ability to model multimodal action distributions. However, this generality comes at the cost of larger model sizes and slower inference, an acute limitation for robotic tasks requiring high control frequencies. Moreover, Diffusion Policy (DP), a popular trajectory-generation approach, suffers from a trade-off between performance and action horizon: fewer diffusion queries lead to larger trajectory chunks, which in turn accumulate tracking errors. To overcome these challenges, we introduce WARPD (World model Assisted Reactive Policy Diffusion), a method that generates closed-loop policies (weights for neural policies) directly, instead of open-loop trajectories. By learning behavioral distributions in parameter space rather than trajectory space, WARPD offers two major advantages: (1) extended action horizons with robustness to perturbations, while maintaining high task performance, and (2) significantly reduced inference costs. Empirically, WARPD outperforms DP in long-horizon and perturbed environments, and achieves multitask performance on par with DP while requiring only ~ 1/45th of the inference-time FLOPs per step.

[410] LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster

Main category: cs.LG

TL;DR: LSHBloom is a scalable document deduplication method that combines MinhashLSH with Bloom filters, achieving state-of-the-art performance with 12× faster runtime and 18× less disk space than MinhashLSH.

Details

Motivation: LLM training requires deduplication of internet-scale text datasets to reduce training costs and prevent undesirable model properties like memorization and evaluation cheating. Current deduplication methods are either unreliable or too expensive in runtime and memory.

Method: LSHBloom extends MinhashLSH by replacing the expensive LSHIndex with lightweight Bloom filters, maintaining the same deduplication performance while dramatically reducing resource requirements.

Result: LSHBloom achieves state-of-the-art deduplication performance comparable to MinhashLSH with only marginal false positive increase (near zero), while being 12× faster and using 18× less disk space on peS2o dataset. The advantages scale to billions of documents.

Conclusion: LSHBloom enables high-quality document deduplication at internet-scale, making sophisticated deduplication techniques tractable for LLM training datasets where only heuristic solutions were previously feasible.

Abstract: Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication – detecting and eliminating additional instances of the same content – is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same state-of-the-art deduplication performance as MinhashLSH, with only a marginal increase in false positives (near zero in our experiments), while boasting competitive runtime (12$\times$ faster than MinhashLSH on peS2o) and, crucially, using 18$\times$ less disk space than MinhashLSH (as measured on peS2o). Based on extrapolation, we show that this advantage in space and runtime remains even at the extreme scale of several billion documents. LSHBloom allows practitioners to access the deduplication quality of MinHashLSH at scales that are normally only tractable for less sophisticated, heuristic solutions. As a result, LSHBloom promises to enable scaling high-quality document deduplication to internet-scale text datasets.

[411] FedSub: Introducing Class-aware Subnetworks Fusion to Enhance Personalized Federated Learning

Mattia Giovanni Campana, Franca Delmastro

Main category: cs.LG

TL;DR: FedSub introduces class-aware model updates using data prototypes and model subnetworks fusion to enhance personalization in federated learning with non-IID data.

Details

Motivation: Existing personalized federated learning methods struggle to balance personalization and generalization, often oversimplifying client similarities or relying too heavily on global models, especially in scenarios with high data heterogeneity.

Method: FedSub uses data prototypes as compact representations of client data for each class, clusters them on the server to capture label-specific similarities, and fuses model subnetworks (most relevant components for each class) based on identified clusters to generate fine-grained, class-specific personalized model updates.

Result: Experimental results in three real-world scenarios with high data heterogeneity in human activity recognition and mobile health applications demonstrate FedSub’s effectiveness over state-of-the-art methods, achieving fast convergence and high classification performance.

Conclusion: FedSub successfully addresses the personalization-generalization trade-off in federated learning by leveraging class-aware prototypes and subnetworks fusion, providing an effective solution for non-IID data scenarios.

Abstract: Personalized Federated Learning aims at addressing the challenges of non-IID data in collaborative model training. However, existing methods struggle to balance personalization and generalization, often oversimplifying client similarities or relying too heavily on global models. In this paper, we propose FedSub, a novel approach that introduces class-aware model updates based on data prototypes and model subnetworks fusion to enhance personalization. Prototypes serve as compact representations of client data for each class, clustered on the server to capture label-specific similarities among the clients. Meanwhile, model subnetworks encapsulate the most relevant components to process each class and they are then fused on the server based on the identified clusters to generate fine-grained, class-specific, and highly personalized model updates for each client. Experimental results in three real-world scenarios with high data heterogeneity in human activity recognition and mobile health applications demonstrate the effectiveness of FedSub with respect to state-of-the-art methods to achieve fast convergence and high classification performance.

[412] Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, Mikhail Yurochkin

Main category: cs.LG

TL;DR: SSLaws (Sloth) is a novel scaling law that uses benchmark data and latent skills to predict LLM performance across different model families without needing to train multiple models per family.

Details

Motivation: Existing scaling laws fail to generalize across different LLM families due to variations in training configurations and data processing. Training family-specific scaling laws requires expensive training of multiple models for each family.

Method: SSLaws assumes LLM performance is driven by low-dimensional latent skills (like reasoning and instruction following) influenced by computational resources. It leverages publicly available benchmark data and exploits correlations across benchmarks to identify these latent skills and their scaling behaviors.

Result: Theoretical results show parameter identification is possible. Empirical evaluations on 12 benchmarks from Open LLM Leaderboard v1/v2 demonstrate accurate LLM performance predictions and provide insights into scaling behaviors for complex tasks, test-time compute, and compute-optimal skill scaling.

Conclusion: SSLaws offers a practical solution for predicting LLM performance across different model families using only benchmark data, providing both accurate predictions and interpretable insights into how different skills scale with computational resources.

Abstract: Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens, but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance accurately and offers insights into scaling behaviors for complex downstream tasks, increased test-time compute, and compute-optimal scaling of skills.

[413] CSI-BERT2: A BERT-inspired Framework for Efficient CSI Prediction and Classification in Wireless Communication and Sensing

Zijian Zhao, Fanyi Meng, Zhonghao Lyu, Hang Li, Xiaoyang Li, Guangxu Zhu

Main category: cs.LG

TL;DR: CSI-BERT2 is a unified BERT-based framework for wireless CSI prediction and classification that addresses data scarcity, packet loss, and high-dimensional challenges through two-stage training and architectural improvements.

Details

Motivation: Wireless sensing faces data scarcity and packet loss issues that hinder model training, while wireless communication deals with high-dimensional CSI matrices and short coherent times due to high mobility, creating challenges in CSI estimation.

Method: Proposes CSI-BERT2 built on CSI-BERT with bidirectional self-attention. Uses two-stage training: first unsupervised MLM for general feature extraction from scarce data, then fine-tuning for downstream tasks. Extends MLM to MPM for prediction tasks. Adds architectural improvements: adaptive re-weighting layer (ARL) for subcarrier representation and MLP-based temporal embedding to mitigate temporal information loss.

Result: Achieves state-of-the-art performance on real-world and simulated datasets. Generalizes effectively across varying sampling rates and robustly handles discontinuous CSI sequences from packet loss, addressing challenges conventional methods fail to solve.

Conclusion: CSI-BERT2 provides an effective unified framework for CSI prediction and classification that overcomes key challenges in wireless systems, with demonstrated robustness to data scarcity, packet loss, and varying sampling conditions.

Abstract: Channel state information (CSI) is a fundamental component in both wireless communication and sensing systems, enabling critical functions such as radio resource optimization and environmental perception. In wireless sensing, data scarcity and packet loss hinder efficient model training, while in wireless communication, high-dimensional CSI matrices and short coherent times caused by high mobility present challenges in CSI estimation. To address these issues, we propose a unified framework named CSI-BERT2 for CSI prediction and classification tasks, built on CSI-BERT, which adapts BERT to capture the complex relationships among CSI sequences through a bidirectional self-attention mechanism. We introduce a two-stage training method that first uses a mask language model (MLM) to enable the model to learn general feature extraction from scarce datasets in an unsupervised manner, followed by fine-tuning for specific downstream tasks. Specifically, we extend MLM into a mask prediction model (MPM), which efficiently addresses the CSI prediction task. To further enhance the representation capacity of CSI data, we modify the structure of the original CSI-BERT. We introduce an adaptive re-weighting layer (ARL) to enhance subcarrier representation and a multi-layer perceptron (MLP)-based temporal embedding module to mitigate temporal information loss problem inherent in the original Transformer. Extensive experiments on both real-world collected and simulated datasets demonstrate that CSI-BERT2 achieves state-of-the-art performance across all tasks. Our results further show that CSI-BERT2 generalizes effectively across varying sampling rates and robustly handles discontinuous CSI sequences caused by packet loss-challenges that conventional methods fail to address. The dataset and code are publicly available at https://github.com/RS2002/CSI-BERT2 .

[414] Machine Unlearning via Information Theoretic Regularization

Shizhou Xu, Thomas Strohmer

Main category: cs.LG

TL;DR: The paper introduces a unified information-theoretic framework for both data point unlearning and feature unlearning, providing mathematical guarantees and practical solutions with connections to neuroscience, optimal transport, and extremal sigma algebras.

Details

Motivation: The need to effectively remove undesirable information (specific features or individual data points) from learning outcomes while minimizing utility loss and ensuring rigorous guarantees, addressing practical machine unlearning challenges in AI applications.

Method: Proposes a unified mathematical framework based on information-theoretic regularization. Introduces the Marginal Unlearning Principle for data point unlearning with formal definitions and guarantees. For feature unlearning, applies the framework to deep learning with arbitrary training objectives using information-theoretic regularization.

Result: Provides provable guarantees on sufficiency and necessity of marginal unlearning to existing approximate unlearning definitions. Offers unified analytic solution to optimal feature unlearning with various information-theoretic objectives. Reveals connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras, supported by numerical simulations.

Conclusion: The proposed information-theoretic framework provides a practical, adaptable solution for machine unlearning with rigorous mathematical foundations, bridging theoretical guarantees with real-world AI applications across both data point and feature unlearning scenarios.

Abstract: How can we effectively remove or ‘‘unlearn’’ undesirable information, such as specific features or the influence of individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a unified mathematical framework based on information-theoretic regularization to address both data point unlearning and feature unlearning. For data point unlearning, we introduce the $\textit{Marginal Unlearning Principle}$, an auditable and provable framework inspired by memory suppression studies in neuroscience. Moreover, we provide formal information-theoretic unlearning definition based on the proposed principle, named marginal unlearning, and provable guarantees on sufficiency and necessity of marginal unlearning to the existing approximate unlearning definitions. We then show the proposed framework provide natural solution to the marginal unlearning problems. For feature unlearning, the framework applies to deep learning with arbitrary training objectives. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications. From a mathematical perspective, we provide an unified analytic solution to the optimal feature unlearning problem with a variety of information-theoretic training objectives. Our theoretical analysis reveals intriguing connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras. Numerical simulations support our theoretical finding.

[415] EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks

Michael Arbel, David Salinas, Frank Hutter

Main category: cs.LG

TL;DR: A target-equivariant architecture for tabular data models that eliminates the “equivariance gap” caused by permutation sensitivity in target dimensions, improving performance on datasets with more classes than seen during pre-training.

Details

Motivation: Existing tabular data models like TabPFN have a fixed number of target dimensions and lack target equivariance, causing predictions to change when target dimension orderings are permuted. This creates an "equivariance gap" that introduces instability and requires costly ensembling strategies.

Method: Design a fully target-equivariant architecture using equivariant encoders, decoders, and a bi-attention mechanism to ensure permutation invariance in target dimensions.

Result: Empirical evaluation shows the model matches or surpasses existing methods on datasets with more classes than seen during pre-training, while incurring lower computational overhead.

Conclusion: The proposed target-equivariant architecture successfully eliminates the equivariance gap, providing more stable predictions and better performance on challenging classification tasks without the computational cost of ensembling.

Abstract: Recent foundational models for tabular data, such as TabPFN, excel at adapting to new tasks via in-context learning, but remain constrained to a fixed, pre-defined number of target dimensions-often necessitating costly ensembling strategies. We trace this constraint to a deeper architectural shortcoming: these models lack target equivariance, so that permuting target dimension orderings alters their predictions. This deficiency gives rise to an irreducible “equivariance gap”, an error term that introduces instability in predictions. We eliminate this gap by designing a fully target-equivariant architecture-ensuring permutation invariance via equivariant encoders, decoders, and a bi-attention mechanism. Empirical evaluation on standard classification benchmarks shows that, on datasets with more classes than those seen during pre-training, our model matches or surpasses existing methods while incurring lower computational overhead.

[416] MaxSup: Overcoming Representation Collapse in Label Smoothing

Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper

Main category: cs.LG

TL;DR: MaxSup replaces Label Smoothing with uniform regularization on top-1 logits to prevent overconfidence in misclassifications and restore intra-class diversity.

Details

Motivation: Label Smoothing has two critical issues: it causes overconfidence in misclassified samples and compacts feature representations into overly tight clusters, diluting intra-class diversity. The authors aim to understand the root cause and develop a better alternative.

Method: The authors analytically decompose LS loss to reveal two key terms: a regularization term for correct predictions and an error-amplification term for misclassifications. They propose Max Suppression (MaxSup) which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit.

Result: Through feature-space analyses, MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm MaxSup is a more robust alternative to LS.

Conclusion: MaxSup effectively addresses the shortcomings of Label Smoothing by preventing overconfidence in misclassifications and maintaining better feature representation diversity, making it a superior regularization technique.

Abstract: Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization

[417] Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

Main category: cs.LG

TL;DR: SAEs are not universal concept detectors - their architecture determines what concepts they can find, not just reveal what exists. Different SAEs expose different concepts based on their structural assumptions.

Details

Motivation: To understand whether SAEs truly uncover all concepts a model relies on, or if they are inherently biased toward certain kinds of concepts. Current SAE approaches may miss important concepts due to architectural limitations.

Method: 1) Introduce unified framework recasting SAEs as bilevel optimization problems; 2) Evaluate SAEs across spectrum: controlled toy models, semi-synthetic experiments on real activations, large-scale naturalistic datasets; 3) Examine two key concept properties: heterogeneity in intrinsic dimensionality and nonlinear separability; 4) Design new SAE that explicitly incorporates both properties.

Result: SAEs fail to recover concepts when ignoring heterogeneity in dimensionality and nonlinear separability. The new SAE incorporating both properties discovers previously hidden concepts. Different SAE architectures expose entirely different concepts - they’re not interchangeable.

Conclusion: There is no universal SAE - architecture determines what can be seen. SAEs don’t just reveal concepts, they determine what concepts can be detected at all. Need architecture-specific choices in model interpretability based on concept properties.

Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable – switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts – it determines what can be seen at all.

[418] Mixed precision accumulation for neural network inference guided by componentwise forward error analysis

El-Mehdi El Arar, Silviu-Ioan Filip, Theo Mary, Elisa Riccietti

Main category: cs.LG

TL;DR: Proposes mixed precision accumulation strategy for neural network inference using componentwise error analysis to selectively recompute high-error components in higher precision.

Details

Motivation: Current uniform precision accumulation in neural network inference is inefficient; error varies significantly across components, creating opportunity for mixed precision strategies that can reduce computational cost while maintaining accuracy.

Method: Develops componentwise forward error analysis showing output error is proportional to condition numbers of weight-input inner product and activation function. Proposes algorithm that first computes all components in low precision, estimates condition numbers, then selectively recomputes high-error components in higher precision.

Result: Experimental tests on various networks and datasets show the algorithm significantly improves cost-accuracy tradeoff compared to uniform precision accumulation baselines.

Conclusion: Componentwise error analysis enables effective mixed precision accumulation strategies for neural network inference, allowing selective higher precision computation only where needed based on condition numbers, achieving better computational efficiency.

Abstract: This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a linear layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost–accuracy tradeoff compared with uniform precision accumulation baselines.

[419] Maternal and Fetal Health Status Assessment by Using Machine Learning on Optical 3D Body Scans

Ruting Cheng, Yijiang Zheng, Boyuan Feng, Chuhui Qiu, Zhuoxin Long, Joaquin A. Calderon, Xiaoke Zhang, Jaclyn M. Phillips, James K. Hahn

Main category: cs.LG

TL;DR: 3D body scans during 18-24 weeks can predict pregnancy complications and estimate fetal weight with high accuracy using a novel dual-stream algorithm.

Details

Motivation: Current pregnancy monitoring methods like ultrasound are accurate but costly and inconvenient. There's a need for more accessible, convenient ways to monitor maternal and fetal health using telehealth and body shape information.

Method: Developed a novel algorithm with two parallel streams: one for supervised learning to extract sequential abdominal circumference information, and another for unsupervised learning to extract global shape descriptors, plus a branch for demographic data.

Result: The algorithm achieved prediction accuracies exceeding 88% for preterm labor, GDM, and gestational hypertension. Fetal weight estimation accuracy was 76.74% within 10% error margin, outperforming conventional anthropometric methods by 22.22%.

Conclusion: 3D body shape analysis shows significant potential as a convenient, accessible tool for predicting adverse pregnancy outcomes and estimating clinical parameters, offering advantages over traditional methods.

Abstract: Monitoring maternal and fetal health during pregnancy is crucial for preventing adverse outcomes. While tests such as ultrasound scans offer high accuracy, they can be costly and inconvenient. Telehealth and more accessible body shape information provide pregnant women with a convenient way to monitor their health. This study explores the potential of 3D body scan data, captured during the 18-24 gestational weeks, to predict adverse pregnancy outcomes and estimate clinical parameters. We developed a novel algorithm with two parallel streams which are used for extract body shape features: one for supervised learning to extract sequential abdominal circumference information, and another for unsupervised learning to extract global shape descriptors, alongside a branch for demographic data. Our results indicate that 3D body shape can assist in predicting preterm labor, gestational diabetes mellitus (GDM), gestational hypertension (GH), and in estimating fetal weight. Compared to other machine learning models, our algorithm achieved the best performance, with prediction accuracies exceeding 88% and fetal weight estimation accuracy of 76.74% within a 10% error margin, outperforming conventional anthropometric methods by 22.22%.

[420] VeLU: Variance-enhanced Learning Unit for Deep Neural Networks

Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicolè, Stefano Ghidoni, Nassir Navab

Main category: cs.LG

TL;DR: VeLU is a novel activation function that uses variance-aware transformations with ArcTan-ArcSin operations and Wasserstein-2 regularization to adapt to activation statistics, outperforming ReLU, Swish, and GELU on vision benchmarks.

Details

Motivation: Current activation functions like ReLU suffer from gradient sparsity and dead-neuron issues, while smooth alternatives like Swish and GELU lack adaptivity to input statistics. There's a need for activation functions that can dynamically respond to local activation distributions without adding computational overhead.

Method: VeLU combines ArcTan-ArcSin transformations with adaptive scaling and Wasserstein-2 regularization (Optimal Transport). It modulates responses based on local activation variance, mitigates internal covariate shift at the activation level, and improves training stability without adding learnable parameters or architectural complexity.

Result: VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks across six different deep neural networks, demonstrating superior performance without additional parameters or architectural overhead.

Conclusion: VeLU provides a principled, variance-aware activation function that addresses limitations of existing approaches by adapting to local activation statistics through optimal transport regularization, offering improved performance and stability for deep neural networks.

Abstract: Activation functions play a critical role in deep neural networks by shaping gradient flow, optimization stability, and generalization. While ReLU remains widely used due to its simplicity, it suffers from gradient sparsity and dead-neuron issues and offers no adaptivity to input statistics. Smooth alternatives such as Swish and GELU improve gradient propagation but still apply a fixed transformation regardless of the activation distribution. In this paper, we propose VeLU, a Variance-enhanced Learning Unit that introduces variance-aware and distributionally aligned nonlinearity through a principled combination of ArcTan-ArcSin transformations, adaptive scaling, and Wasserstein-2 regularization (Optimal Transport). This design enables VeLU to modulate its response based on local activation variance, mitigate internal covariate shift at the activation level, and improve training stability without adding learnable parameters or architectural overhead. Extensive experiments across six deep neural networks show that VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks. The implementation of VeLU is publicly available in GitHub.

[421] Soft-Label Caching and Sharpening for Communication-Efficient Federated Distillation

Kitsuya Azuma, Takayuki Nishio, Yuichi Kitagawa, Wakako Nakano, Takahito Tanimura

Main category: cs.LG

TL;DR: SCARLET is a federated learning framework that reduces communication overhead by 50% through synchronized soft-label caching and enhanced entropy reduction aggregation, outperforming existing distillation-based FL methods.

Details

Motivation: Conventional FL suffers from high communication overhead and limited model heterogeneity due to frequent parameter sharing. Distillation-based FL addresses this by sharing predictions instead of parameters, but still involves redundant transmissions across rounds, reducing efficiency.

Method: SCARLET integrates synchronized soft-label caching to reuse cached predictions and an enhanced Entropy Reduction Aggregation (Enhanced ERA) mechanism that resolves instability in conventional temperature-based aggregation for robust performance across diverse clients.

Result: SCARLET achieves up to 50% reduction in communication costs compared to existing methods while maintaining competitive accuracy, consistently outperforming state-of-the-art distillation-based FL methods in both accuracy and communication efficiency.

Conclusion: SCARLET provides an efficient and robust federated learning framework that significantly reduces communication overhead while maintaining high accuracy, making it suitable for diverse client scenarios with model heterogeneity.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized clients, enhancing privacy by keeping data local. Yet conventional FL, relying on frequent parameter-sharing, suffers from high communication overhead and limited model heterogeneity. Distillation-based FL approaches address these issues by sharing predictions (soft-labels, i.e., normalized probability distributions) instead, but they often involve redundant transmissions across communication rounds, reducing efficiency. We propose SCARLET, a novel framework integrating synchronized soft-label caching and an enhanced Entropy Reduction Aggregation (Enhanced ERA) mechanism. SCARLET minimizes redundant communication by reusing cached soft-labels, achieving up to 50% reduction in communication costs compared to existing methods while maintaining competitive accuracy. Enhanced ERA resolves the fundamental instability of conventional temperature-based aggregation, ensuring robust control and high performance in diverse client scenarios. Experimental evaluations demonstrate that SCARLET consistently outperforms state-of-the-art distillation-based FL methods in terms of accuracy and communication efficiency. The implementation of SCARLET is publicly available at https://github.com/kitsuyaazuma/SCARLET.

[422] Cohort-Based Active Modality Acquisition

Tillmann Rheude, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: CAMA framework for test-time cohort-based active modality acquisition using generative imputation and discriminative modeling to prioritize which samples should get additional costly modalities.

Details

Motivation: Real-world multimodal ML often faces missing modalities due to acquisition costs, requiring strategic prioritization of which samples should receive additional modalities when resources are limited.

Method: Cohort-based Active Modality Acquisition (CAMA) framework using generative imputation and discriminative modeling to estimate expected benefit of acquiring missing modalities based on evaluation metrics.

Result: Imputation-based strategies outperform unimodal, entropy-based, and random selection methods on multimodal datasets with up to 15 modalities, and effectively guide proteomics acquisition in UK Biobank.

Conclusion: CAMA provides effective cohort-level modality acquisition optimization for resource-constrained settings, enabling more efficient use of costly multimodal data acquisition.

Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on multimodal datasets with up to 15 modalities demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of additional modalities for selected samples compared with methods relying solely on unimodal information, entropy-based guidance, or random selection. We showcase the real-world relevance and scalability of our method by demonstrating its ability to effectively guide the costly acquisition of proteomics data for disease prediction in a large prospective cohort, the UK Biobank (UKBB). Our work provides an effective approach for optimizing modality acquisition at the cohort level, enabling more effective use of resources in constrained settings.

[423] How to Learn a Star: Binary Classification with Starshaped Polyhedral Sets

Marie-Charlotte Brandenburg, Katharina Jochemko

Main category: cs.LG

TL;DR: Paper analyzes binary classification using continuous piecewise linear functions with starshaped polyhedral decision boundaries, studying expressivity, VC dimension, and loss landscape geometry for 0/1-loss and log-likelihood loss.

Details

Motivation: To understand the theoretical properties of binary classification models based on continuous piecewise linear functions with starshaped polyhedral decision boundaries, particularly their expressivity and loss landscape structure.

Method: Restricts binary classification to continuous piecewise linear functions whose decision boundaries are starshaped polyhedral sets supported on a fixed polyhedral simplicial fan. Analyzes combinatorial and geometric structure of loss landscapes for 0/1-loss and log-likelihood loss functions.

Result: Provides explicit bounds on VC dimension of this model, describes sublevel sets of discrete loss as chambers in a hyperplane arrangement. For log-likelihood loss, gives sufficient conditions for unique optimum and describes geometry of optimum when varying rate parameter of underlying exponential distribution.

Conclusion: The paper provides theoretical characterization of binary classification with starshaped polyhedral decision boundaries, offering insights into model expressivity, loss landscape structure, and optimization properties for different loss functions.

Abstract: We consider binary classification restricted to a class of continuous piecewise linear functions whose decision boundaries are (possibly nonconvex) starshaped polyhedral sets, supported on a fixed polyhedral simplicial fan. We investigate the expressivity of these function classes and describe the combinatorial and geometric structure of the loss landscape, most prominently the sublevel sets, for two loss-functions: the 0/1-loss (discrete loss) and a log-likelihood loss function. In particular, we give explicit bounds on the VC dimension of this model, and concretely describe the sublevel sets of the discrete loss as chambers in a hyperplane arrangement. For the log-likelihood loss, we give sufficient conditions for the optimum to be unique, and describe the geometry of the optimum when varying the rate parameter of the underlying exponential probability distribution.

[424] Learning to Integrate Diffusion ODEs by Averaging the Derivatives

Wenze Liu, Xiangyu Yue

Main category: cs.LG

TL;DR: Secant Losses: A stable method for accelerating diffusion model inference by learning ODE integration through loss functions that gradually extend tangent to secant, achieving high-quality few-step generation.

Details

Motivation: Existing methods for accelerating diffusion model inference have limitations - numerical solvers perform poorly at extremely small steps, while distillation techniques introduce complexity and instability. There's a need for an intermediate strategy that balances performance and cost.

Method: Proposes secant losses, which learn ODE integration using loss functions derived from the derivative-integral relationship, inspired by Monte Carlo integration and Picard iteration. The method operates geometrically by gradually extending the tangent to the secant. Can be applied through fine-tuning or distillation.

Result: Achieves state-of-the-art few-step generation: Secant version of EDM gets 10-step FID of 2.14 on CIFAR-10; Secant version of SiT-XL/2 attains 4-step FID of 2.27 and 8-step FID of 1.96 on ImageNet-256×256.

Conclusion: Secant losses provide a stable and effective intermediate strategy for accelerating diffusion models, balancing performance and computational cost while maintaining training stability by targeting the same objective as diffusion models.

Abstract: To accelerate diffusion model inference, numerical solvers perform poorly at extremely small steps, while distillation techniques often introduce complexity and instability. This work presents an intermediate strategy, balancing performance and cost, by learning ODE integration using loss functions derived from the derivative-integral relationship, inspired by Monte Carlo integration and Picard iteration. From a geometric perspective, the losses operate by gradually extending the tangent to the secant, thus are named as secant losses. The target of secant losses is the same as that of diffusion models, or the diffusion model itself, leading to great training stability. By fine-tuning or distillation, the secant version of EDM achieves a $10$-step FID of $2.14$ on CIFAR-10, while the secant version of SiT-XL/2 attains a $4$-step FID of $2.27$ and an $8$-step FID of $1.96$ on ImageNet-$256\times256$. Code is available at https://github.com/poppuppy/secant_expectation.

[425] Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Sourav Ganguly, Arnob Ghosh, Kishan Panaganti, Adam Wierman

Main category: cs.LG

TL;DR: The paper proposes a novel algorithm for robust constrained Markov decision processes (RCMDP) that efficiently learns safe policies under model uncertainty without requiring binary search, achieving significant computational speedups.

Details

Motivation: Real-world control systems require constrained decision-making for safety, but simulators often fail to capture real-world adversities. Standard primal-dual methods don't work for RCMDP due to lack of strong duality, and robust value-iteration approaches fail because worst-case models may differ for reward and constraint value functions.

Method: Proposes a novel technique that minimizes constraint value functions to satisfy constraints, and when constraints are satisfied, maximizes the robust reward value function. The approach avoids binary search used in state-of-the-art methods.

Result: The algorithm finds a policy with at most ε sub-optimality and feasible policy after O(ε⁻²) iterations. It reduces computation time by at least 4x for smaller discount factors and by at least 6x for larger discount factors compared to state-of-the-art methods.

Conclusion: The proposed method provides an efficient solution for robust constrained MDPs that handles model uncertainty while ensuring constraint satisfaction, with significant computational advantages over existing approaches.

Abstract: Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function–to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $ε$ sub-optimality and feasible policy after $O(ε^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($γ$) and by at least 6x for larger value of $γ$.

[426] AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

Haonan Dong, Wenhao Zhu, Guojie Song, Liang Wang

Main category: cs.LG

TL;DR: AuroRA improves LoRA by adding an adaptive nonlinear layer between linear projectors, achieving better performance than full fine-tuning with fewer parameters.

Details

Motivation: LoRA faces a low-rank bottleneck where increasing rank to match full fine-tuning performance leads to significant parameter overhead. Existing linear LoRA variants remain fundamentally linear and cannot improve representational capacity.

Method: AuroRA incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors, forming an MLP-like structure with compressed rank. This captures fixed and learnable nonlinearities for flexible approximation of target functions.

Result: Extensive experiments on 22 datasets and 6 pretrained models show AuroRA: (I) matches/surpasses full fine-tuning with only 6.18%-25% of LoRA’s parameters, (II) outperforms competitive PEFT methods by up to 10.88% in NLP/CV tasks, and (III) shows robust performance across rank configurations.

Conclusion: AuroRA addresses LoRA’s low-rank bottleneck through adaptive nonlinearity, achieving superior performance with fewer parameters while maintaining theoretical guarantees for approximation errors and gradient bounds.

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method validated across NLP and CV domains. However, LoRA faces an inherent low-rank bottleneck: narrowing its performance gap with full finetuning requires increasing the rank of its parameter matrix, resulting in significant parameter overhead. Recent linear LoRA variants have attempted to enhance expressiveness by introducing additional linear mappings; however, their composition remains inherently linear and fails to fundamentally improve LoRA’s representational capacity. To address this limitation, we propose AuroRA, which incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors to capture fixed and learnable nonlinearities. This combination forms an MLP-like structure with a compressed rank, enabling flexible and precise approximation of diverse target functions while theoretically guaranteeing lower approximation errors and bounded gradients. Extensive experiments on 22 datasets and 6 pretrained models demonstrate that AuroRA: (I) not only matches or surpasses full fine-tuning performance with only 6.18% ~ 25% of LoRA’s parameters but also (II) outperforms competitive PEFT methods by up to 10.88% in both NLP and CV tasks, and (III) exhibits robust performance across various rank configurations.

[427] medDreamer: Model-Based Reinforcement Learning with Latent Imagination on Complex EHRs for Clinical Decision Support

Qianyi Xu, Gousia Habib, Feng Wu, Dilruk Perera, Mengling Feng

Main category: cs.LG

TL;DR: medDreamer is a model-based RL framework for personalized treatment recommendations that handles irregular EHR data and learns optimal policies beyond historical clinical decisions.

Details

Motivation: Existing RL-based clinical decision support systems ignore missing data patterns, use coarse discretization/simple imputation, are predominantly model-free, and suffer from insufficient exploration and bias from historical data.

Method: Proposes medDreamer with: 1) World model with Adaptive Feature Integration module to simulate latent patient states from irregular data, and 2) Two-phase policy trained on hybrid of real and imagined trajectories.

Result: Outperforms model-free and model-based baselines on sepsis and mechanical ventilation treatment tasks using large-scale EHR datasets in both clinical outcomes and off-policy metrics.

Conclusion: medDreamer effectively addresses limitations of existing RL approaches for clinical decision support by handling irregular data patterns and learning optimal policies that go beyond historical clinical decisions while staying close to real data.

Abstract: Timely and personalized treatment decisions are essential across a wide range of healthcare settings where patient responses can vary significantly and evolve over time. Clinical data used to support these treatment decisions are often irregularly sampled, where missing data frequencies may implicitly convey information about the patient’s condition. Existing Reinforcement Learning (RL) based clinical decision support systems often ignore the missing patterns and distort them with coarse discretization and simple imputation. They are also predominantly model-free and largely depend on retrospective data, which could lead to insufficient exploration and bias by historical behaviors. To address these limitations, we propose medDreamer, a novel model-based reinforcement learning framework for personalized treatment recommendation. medDreamer contains a world model with an Adaptive Feature Integration module that simulates latent patient states from irregular data and a two-phase policy trained on a hybrid of real and imagined trajectories. This enables learning optimal policies that go beyond the sub-optimality of historical clinical decisions, while remaining close to real clinical data. We evaluate medDreamer on both sepsis and mechanical ventilation treatment tasks using two large-scale Electronic Health Records (EHRs) datasets. Comprehensive evaluations show that medDreamer significantly outperforms model-free and model-based baselines in both clinical outcomes and off-policy metrics.

[428] Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models

Antti Koskela, Tejas Kulkarni

Main category: cs.LG

TL;DR: Decentralized gossip-based averaging with node-level noise achieves differential privacy with Rényi DP parameter growth O(T), similar to central aggregation when using secure summation.

Details

Motivation: Decentralized ML training offers scalability and robustness benefits, but achieving differential privacy is challenging without a central aggregator and with varying trust assumptions among nodes.

Method: Novel privacy analysis of decentralized gossip-based averaging algorithms with additive node-level noise, using a linear systems formulation to characterize privacy leakage between nodes, especially considering averaging over nearest neighbors with secure summation.

Result: When gossip averaging uses secure summation, the Rényi DP parameter growth is asymptotically O(T), where T is the number of training rounds, similar to central aggregation.

Conclusion: The analytical framework accurately characterizes privacy leakage in decentralized settings, showing that secure summation enables privacy guarantees comparable to centralized approaches.

Abstract: Fully decentralized training of machine learning models offers significant advantages in scalability, robustness, and fault tolerance. However, achieving differential privacy (DP) guarantees in such settings is challenging due to the absence of a central aggregator and varying trust assumptions among nodes. We present a novel privacy analysis of decentralized gossip-based averaging algorithms with additive node-level noise, from arbitrary views of nodes in a graph and especially consider the averaging over nearest neighbors with secure summation and individual node-wise views. Our main contribution is a an analytical framework based on a linear systems formulation that accurately characterizes privacy leakage between nodes across different scenarios. In case the gossip averaging happens via secure summation, we show that the Rényi DP parameter growth is asymptotically $O(T)$, where $T$ is the number of training rounds, similarly as in the case of central aggregation.

[429] Matryoshka Model Learning for Improved Elastic Student Models

Chetan Verma, Aditya Srinivas Timmaraju, Cho-Jui Hsieh, Suyash Damle, Ngot Bui, Yang Zhang, Wen Chen, Xin Liu, Prateek Jain, Inderjit S Dhillon

Main category: cs.LG

TL;DR: MatTA is a framework for training multiple accurate Student models using a Teacher-TA-Student approach, where TA models act as intermediate larger versions that help Students better learn from Teachers, enabling multiple servable options from a single training run.

Details

Motivation: Industry ML models need to meet evolving serving constraints, requiring significant resources for development. There's a need for efficient methods to produce multiple model variants that trade off accuracy for serving costs.

Method: Proposes Teacher-TA-Student recipe: TA models are larger versions of Student models with higher capacity, helping Students better relate to Teacher models and incorporate domain expertise. Multiple accurate Student models can be extracted from a single TA model.

Result: Live A/B tests in production ML system showed 20% improvement on key metric. On GPT-2 Medium, achieved relative improvements of over 24% on SAT Math and over 10% on LAMBADA benchmark.

Conclusion: MatTA enables efficient training of multiple servable Student models from a single training run, providing practical options to trade accuracy for serving costs while maintaining strong performance.

Abstract: Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.

[430] On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity

Quentin Bertrand, Anne Gagneux, Mathurin Massias, Rémi Emonet

Main category: cs.LG

TL;DR: Flow matching generalization is not driven by noisy loss; closed-form loss performs as well or better than stochastic version.

Details

Motivation: To understand why recent generative models like diffusion and flow matching generalize so well, specifically testing whether the stochastic nature of the conditional flow matching loss is a key factor.

Method: Empirical comparison of stochastic vs closed-form flow matching losses in high-dimensional settings, using state-of-the-art flow matching models on standard image datasets.

Result: Stochastic and closed-form flow matching losses yield nearly equivalent losses in high dimensions, with comparable statistical performance, and closed-form can even improve performance.

Conclusion: The noisy nature of the loss is not a key factor driving generalization in flow matching; other factors like architectural inductive biases likely play more important roles.

Abstract: Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods, such as diffusion and flow matching techniques, generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the noisy nature of the loss as a key factor driving generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.

[431] Revisiting Clustering of Neural Bandits: Selective Reinitialization for Mitigating Loss of Plasticity

Zhiyuan Su, Sunhao Dai, Xiao Zhang

Main category: cs.LG

TL;DR: Proposes Selective Reinitialization (SeRe) framework to address loss of plasticity in Clustering of Neural Bandits by dynamically resetting underutilized neural units, enabling better adaptation to non-stationary environments like dynamic user preferences.

Details

Motivation: Clustering of Neural Bandits (CNB) algorithms suffer from loss of plasticity where neural network parameters become rigid over time, limiting adaptation to non-stationary environments like dynamic user preferences in recommendation systems.

Method: Selective Reinitialization (SeRe) framework uses contribution utility metric to identify and selectively reset underutilized neural units, with adaptive change detection mechanism adjusting reinitialization frequency based on degree of non-stationarity.

Result: Theoretically proven sublinear cumulative regret in piecewise-stationary environments; extensive experiments on six real-world recommendation datasets show SeRe-enhanced CNB algorithms effectively mitigate loss of plasticity with lower regrets.

Conclusion: SeRe framework successfully addresses plasticity loss in CNB algorithms, improving adaptability and robustness in dynamic recommendation settings while maintaining stable knowledge retention.

Abstract: Clustering of Bandits (CB) methods enhance sequential decision-making by grouping bandits into clusters based on similarity and incorporating cluster-level contextual information, demonstrating effectiveness and adaptability in applications like personalized streaming recommendations. However, when extending CB algorithms to their neural version (commonly referred to as Clustering of Neural Bandits, or CNB), they suffer from loss of plasticity, where neural network parameters become rigid and less adaptable over time, limiting their ability to adapt to non-stationary environments (e.g., dynamic user preferences in recommendation). To address this challenge, we propose Selective Reinitialization (SeRe), a novel bandit learning framework that dynamically preserves the adaptability of CNB algorithms in evolving environments. SeRe leverages a contribution utility metric to identify and selectively reset underutilized units, mitigating loss of plasticity while maintaining stable knowledge retention. Furthermore, when combining SeRe with CNB algorithms, the adaptive change detection mechanism adjusts the reinitialization frequency according to the degree of non-stationarity, ensuring effective adaptation without unnecessary resets. Theoretically, we prove that SeRe enables sublinear cumulative regret in piecewise-stationary environments, outperforming traditional CNB approaches in long-term performances. Extensive experiments on six real-world recommendation datasets demonstrate that SeRe-enhanced CNB algorithms can effectively mitigate the loss of plasticity with lower regrets, improving adaptability and robustness in dynamic settings.

[432] Knowledge Adaptation as Posterior Correction

Mohammad Emtiyaz Khan

Main category: cs.LG

TL;DR: The paper proposes viewing adaptation as “correction” of old posterior distributions, showing that many existing adaptation methods follow this principle across various learning settings.

Details

Motivation: Current AI models lack the fast adaptability of humans and animals, and there's limited understanding of mechanisms enabling rapid adaptation in machines.

Method: Cast adaptation as posterior correction using the dual representation of the Bayesian Learning Rule, quantifying interference between old representations and new information via natural-gradient mismatch.

Result: Shows that many existing adaptation methods (continual learning, federated learning, unlearning, model merging) follow the posterior correction principle, with more accurate posteriors enabling faster adaptation.

Conclusion: Posterior correction provides a unified framework for understanding and enabling fast adaptation in machines across diverse learning scenarios.

Abstract: Adaptation is the holy grail of intelligence, but even the best AI models lack the adaptability of toddlers. In spite of great progress, little is known about the mechanisms by which machines can learn to adapt as fast as humans and animals. Here, we cast adaptation as `correction’ of old posteriors and show that a wide-variety of existing adaptation methods follow this very principle, including those used for continual learning, federated learning, unlearning, and model merging. In all these settings, more accurate posteriors often lead to smaller corrections and can enable faster adaptation. Posterior correction is derived by using the dual representation of the Bayesian Learning Rule of Khan and Rue (2023), where the interference between the old representation and new information is quantified by using the natural-gradient mismatch. We present many examples demonstrating how machines can learn to adapt quickly by using posterior correction.

[433] Training a Scientific Reasoning Model for Chemistry

Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, Andrew D. White

Main category: cs.LG

TL;DR: A 24B parameter reasoning LLM (ether0) trained on chemistry problems outperforms general-purpose models, frontier models, and human experts on molecular design tasks with high data efficiency.

Details

Motivation: To test whether language model reasoning can generalize beyond mathematics/programming/logic to scientific domains like chemistry, and to create more data-efficient specialized models.

Method: Post-trained Mistral-Small-24B with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks, enabling natural language reasoning and chemical structure responses.

Result: ether0 exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks, while being more data-efficient than specialized models.

Conclusion: Reasoning models can be effectively post-trained for scientific domains without extensive domain pretraining, offering a data-efficient approach for specialized scientific language models.

Abstract: Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.

[434] scE2TM improves single-cell embedding interpretability and reveals cellular perturbation signatures

Hegang Chen, Yuyin Lu, Yifan Zhao, Zhiming Dai, Fu Lee Wang, Qing Li, Yanghui Rao, Yue Li

Main category: cs.LG

TL;DR: scE2TM is an external knowledge-guided embedded topic model for scRNA-seq analysis that prevents interpretation collapse through embedding clustering regularization, achieving superior clustering performance and biological interpretability across multiple datasets.

Details

Motivation: Current embedded topic models for single-cell RNA-seq analysis suffer from interpretation collapse (topics semantically collapsing into redundancy) and fail to fully capture biological variation. Additionally, the rise of single-cell foundation models creates opportunities to leverage external biological knowledge to guide embeddings.

Method: scE2TM uses embedding clustering regularization where each topic is constrained to be the center of a separately aggregated gene cluster, preventing semantic collapse. The model incorporates external biological knowledge from single-cell foundation models to guide embeddings, ensuring topics capture unique biological information.

Result: Across 20 scRNA-seq datasets, scE2TM achieves superior clustering performance compared to seven state-of-the-art methods. It produces topics with higher diversity and stronger consistency with underlying biological pathways. In interferon-stimulated PBMCs, it successfully simulates topic perturbations mirroring experimental interferon responses. In melanoma, it identifies malignant-specific topics that reveal gene programs associated with patient survival.

Conclusion: scE2TM provides a high-quality, interpretable embedding framework for scRNA-seq analysis that prevents interpretation collapse, leverages external biological knowledge, and delivers both superior clustering performance and meaningful biological insights across diverse applications.

Abstract: Single-cell RNA sequencing technologies have revolutionized our understanding of cellular heterogeneity, yet computational methods often struggle to balance performance with biological interpretability. Embedded topic models have been widely used for interpretable single-cell embedding learning. However, these models suffer from the potential problem of interpretation collapse, where topics semantically collapse towards each other, resulting in redundant topics and incomplete capture of biological variation. Furthermore, the rise of single-cell foundation models creates opportunities to harness external biological knowledge for guiding model embeddings. Here, we present scE2TM, an external knowledge-guided embedded topic model that provides a high-quality cell embedding and interpretation for scRNA-seq analysis. Through embedding clustering regularization method, each topic is constrained to be the center of a separately aggregated gene cluster, enabling it to capture unique biological information. Across 20 scRNA-seq datasets, scE2TM achieves superior clustering performance compared with seven state-of-the-art methods. A comprehensive interpretability benchmark further shows that scE2TM-learned topics exhibit higher diversity and stronger consistency with underlying biological pathways. Modeling interferon-stimulated PBMCs, scE2TM simulates topic perturbations that drive control cells toward stimulated-like transcriptional states, faithfully mirroring experimental interferon responses. In melanoma, scE2TM identifies malignant-specific topics and extrapolates them to unseen patient data, revealing gene programs associated with patient survival.

[435] Instruction-based Time Series Editing

Jiaxing Qiu, Dongliang Guo, Brynne Sullivan, Teague R. Henry, Tom Hartvigsen

Main category: cs.LG

TL;DR: InstructTime is the first instruction-based time series editor that uses natural language instructions to edit time series with controllable strength, overcoming limitations of rigid attribute-based diffusion methods.

Details

Motivation: Existing diffusion-based time series editors rely on predefined attribute vectors and all-or-nothing sampling, limiting flexibility in condition format and lacking customizable control over editing strength. There's a need for more accessible and flexible editing methods.

Method: InstructTime embeds time series and natural language instructions into a shared multi-modal representation space, then decodes to generate edited time series. It uses multi-resolution encoders to handle both local and global edits, and enables controllable editing strength through interpolation in the learned representation space.

Result: InstructTime achieves state-of-the-art performance: high-quality edits with controllable strength, generalization to unseen instructions, and easy adaptation to unseen conditions through few-shot learning on both synthetic and real datasets.

Conclusion: Instruction-based time series editing with InstructTime provides a flexible, accessible approach that overcomes limitations of existing methods, enabling natural language control and customizable editing strength for diverse time series editing tasks.

Abstract: In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient’s blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.

[436] Dynamic Feature Selection based on Rule-based Learning for Explainable Classification with Uncertainty Quantification

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

Main category: cs.LG

TL;DR: The paper proposes a rule-based dynamic feature selection method that addresses interpretability and uncertainty estimation issues in DFS, showing competitive performance against opaque methods.

Details

Motivation: Dynamic feature selection (DFS) adapts features per sample for better decision transparency, especially valuable in clinical settings. However, existing DFS methods use opaque models and ignore uncertainty introduced by DFS, limiting real-world applicability.

Method: The authors formalize additional uncertainty sources in DFS and provide estimation formulas. They propose a novel rule-based system as the base classifier for DFS to enhance interpretability compared to neural estimators.

Result: The rule-based DFS approach demonstrates competitive performance against established greedy and reinforcement learning methods, which are considered opaque compared to the explainable rule-based system.

Conclusion: The paper presents an interpretable DFS framework that addresses transparency and uncertainty issues, making DFS more applicable to real-life scenarios like clinical decision-making where explainability is crucial.

Abstract: Dynamic feature selection (DFS) offers a compelling alternative to traditional, static feature selection by adapting the selected features to each individual sample. This provides insights into the decision-making process for each case, which makes DFS especially significant in settings where decision transparency is key, i.e., clinical decisions. However, existing DFS methods use opaque models, which hinder their applicability in real-life scenarios. DFS also introduces new own sources of uncertainty compared to the static setting, which is also not considered in the existing literature. In this paper, we formalize the additional sources of uncertainty in DFS, and give formulas to estimate them. We also propose novel approach by leveraging a rule-based system as a base classifier for the DFS process, which enhances decision interpretability compared to neural estimators. Finally, we demonstrate the competitive performance of our rule-based DFS approach against established and state-of-the-art greedy and reinforcement learning methods, which are mostly considered opaque, compared to our explainable rulebased system.

[437] Increasing Information Extraction in Low-Signal Regimes via Multiple Instance Learning

Atakan Azakli, Bernd Stelzer

Main category: cs.LG

TL;DR: MIL can outperform single-instance learning in low-signal regimes for parameter estimation, contrary to prior belief that single-instance methods are sufficient.

Details

Motivation: Prior work suggested single-instance methods are often sufficient for MIL tasks, but this assumes enough signal for near-optimal classifiers. The authors challenge this by showing single-instance models can fail in low-signal regimes where MIL can perform better.

Method: Introduces an information-theoretic perspective on MIL for parameter estimation with i.i.d. data. Applies this to constrain Wilson coefficients of SMEFT using kinematic information from LHC particle collision events.

Result: MIL can mitigate sub-optimality of single-instance models in low-signal regimes. Under specific modeling and weak signal conditions, pooling instances increases effective Fisher information compared to single-instance approaches.

Conclusion: MIL outperforms single-instance learning in challenging low-signal regimes, providing practical benefits for applications like particle physics where signal strength is weak.

Abstract: In this work, we introduce a new information-theoretic perspective on Multiple Instance Learning (MIL) for parameter estimation with i.i.d. data, and show that MIL can outperform single-instance learners in low-signal regimes. Prior work [Nachman and Thaler, 2021] argued that single-instance methods are often sufficient, but this conclusion presumes enough single-instance signal to train near-optimal classifiers. We demonstrate that even state-of-the-art single-instance models can fail to reach optimal classifier performance in challenging low-signal regimes, whereas MIL can mitigate this sub-optimality. As a concrete application, we constrain Wilson coefficients of the Standard Model Effective Field Theory (SMEFT) using kinematic information from subatomic particle collision events at the Large Hadron Collider (LHC). In experiments, we observe that under specific modeling and weak signal conditions, pooling instances can increase the effective Fisher information compared to single-instance approaches.

[438] GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

Songyu Ke, Chenyu Wu, Yuxuan Liang, Huiling Qin, Junbo Zhang, Yu Zheng

Main category: cs.LG

TL;DR: GeoMAE is a self-supervised spatio-temporal representation learning model that addresses missing data challenges in urban intelligence systems, achieving up to 13.20% improvement over baselines.

Details

Motivation: Missing data in urban intelligence systems (due to environmental conditions and equipment failures) challenges downstream applications like traffic forecasting and energy prediction. Existing methods focus on time-series analysis while neglecting dynamic spatial correlations, and complex missing data patterns with varying maintenance conditions challenge model generalizability.

Method: GeoMAE has three main components: 1) Input preprocessing module, 2) Attention-based spatio-temporal forecasting network (STAFN), and 3) Auxiliary learning task inspired by Masking AutoEncoders to enhance robustness of spatio-temporal representation learning.

Result: Empirical evaluations on real-world datasets show GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20% relative improvement over the best baseline models.

Conclusion: GeoMAE provides an effective self-supervised approach for robust spatio-temporal learning from incomplete urban datasets, addressing both spatial and temporal correlations while handling complex missing data patterns.

Abstract: The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses a significant challenge to the efficacy of downstream applications, notably in the realms of traffic forecasting and energy consumption prediction. Therefore, it is imperative to develop a robust spatio-temporal learning methodology capable of extracting meaningful insights from incomplete datasets. Despite the existence of methodologies for spatio-temporal graph forecasting in the presence of missing values, unresolved issues persist. Primarily, the majority of extant research is predicated on time-series analysis, thereby neglecting the dynamic spatial correlations inherent in sensor networks. Additionally, the complexity of missing data patterns compounds the intricacy of the problem. Furthermore, the variability in maintenance conditions results in a significant fluctuation in the ratio and pattern of missing values, thereby challenging the generalizability of predictive models. In response to these challenges, this study introduces GeoMAE, a self-supervised spatio-temporal representation learning model. The model is comprised of three principal components: an input preprocessing module, an attention-based spatio-temporal forecasting network (STAFN), and an auxiliary learning task, which draws inspiration from Masking AutoEncoders to enhance the robustness of spatio-temporal representation learning. Empirical evaluations on real-world datasets demonstrate that GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20% relative improvement over the best baseline models.

[439] Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem

Soumyajit Guin, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: The paper proposes novel algorithms for Stochastic Shortest Path (SSP) problems in RL, with two tabular algorithms and one function approximation algorithm, showing asymptotic convergence and superior performance.

Details

Motivation: SSP problems are important in RL because other RL cost-criteria can be formulated as SSP problems. There's a need for effective algorithms that guarantee convergence in both tabular and function approximation settings.

Method: Developed two algorithms for tabular SSP settings and one algorithm for function approximation settings. All algorithms are designed to achieve asymptotic almost-sure convergence.

Result: The tabular algorithms outperform other well-known convergent RL algorithms. The function approximation algorithm shows reliable performance compared to other algorithms in the function approximation setting. All algorithms demonstrate asymptotic almost-sure convergence.

Conclusion: The proposed algorithms provide effective solutions for SSP problems in RL with guaranteed convergence and superior performance compared to existing methods in both tabular and function approximation settings.

Abstract: In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algorithms. We further observe reliable performance of our function approximation algorithm compared to other algorithms in the function approximation setting.

[440] Implicit Hypergraph Neural Network

Akash Choudhuri, Yongjian Zhong, Bijaya Adhikari

Main category: cs.LG

TL;DR: IHNN is an implicit hypergraph neural network that captures long-range dependencies in hypergraphs while maintaining performance, outperforming existing methods in node classification tasks.

Details

Motivation: Existing hypergraph neural networks suffer from performance degradation when trying to capture long-range dependencies by increasing message-passing rounds, creating a trade-off between local information and long-range high-order dependencies.

Method: Proposes Implicit Hypergraph Neural Network (IHNN) that jointly learns fixed-point representations for both nodes and hyperedges using implicit differentiation, with a tractable projected gradient descent approach for efficient training.

Result: Extensive experiments on real-world hypergraphs for node classification show IHNN outperforms prior works in most settings, establishing new state-of-the-art in hypergraph learning.

Conclusion: IHNN successfully addresses the long-range dependency problem in hypergraph neural networks by leveraging implicit differentiation to learn fixed-point representations, achieving superior performance while capturing both local and global information.

Abstract: Hypergraphs offer a generalized framework for capturing high-order relationships between entities and have been widely applied in various domains, including healthcare, social networks, and bioinformatics. Hypergraph neural networks, which rely on message-passing between nodes over hyperedges to learn latent representations, have emerged as the method of choice for predictive tasks in many of these domains. These approaches typically perform only a small number of message-passing rounds to learn the representations, which they then utilize for predictions. The small number of message-passing rounds comes at a cost, as the representations only capture local information and forego long-range high-order dependencies. However, as we demonstrate, blindly increasing the message-passing rounds to capture long-range dependency also degrades the performance of hyper-graph neural networks. Recent works have demonstrated that implicit graph neural networks capture long-range dependencies in standard graphs while maintaining performance. Despite their popularity, prior work has not studied long-range dependency issues on hypergraph neural networks. Here, we first demonstrate that existing hypergraph neural networks lose predictive power when aggregating more information to capture long-range dependency. We then propose Implicit Hypergraph Neural Network (IHNN), a novel framework that jointly learns fixed-point representations for both nodes and hyperedges in an end-to-end manner to alleviate this issue. Leveraging implicit differentiation, we introduce a tractable projected gradient descent approach to train the model efficiently. Extensive experiments on real-world hypergraphs for node classification demonstrate that IHNN outperforms the closest prior works in most settings, establishing a new state-of-the-art in hypergraph learning.

[441] Amortized Sampling with Transferable Normalizing Flows

Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov

Main category: cs.LG

TL;DR: Prose is a 285M parameter transferable normalizing flow trained on peptide MD trajectories that enables zero-shot uncorrelated sampling across different peptide systems, achieving transferability across sequence length while maintaining efficient likelihood evaluation.

Details

Motivation: Classical molecular conformation sampling methods (MD, MCMC) lack amortization - computational cost must be paid for each system. Learned samplers have shown limited transferability across systems despite competitive performance on single systems.

Method: Prose: a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Uses importance sampling-based finetuning procedure.

Result: Prose achieves zero-shot uncorrelated proposal sampling for arbitrary peptide systems with previously intractable transferability across sequence length. Competitive performance to established methods like sequential Monte Carlo when finetuned.

Conclusion: Deep learning enables scalable and transferable samplers. Prose demonstrates practical transferable sampling for peptides, with open-sourced codebase, model weights, and training dataset to stimulate further research in amortized sampling methods.

Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

[442] Simulating classification models to evaluate Predict-Then-Optimize methods

Pieter Smet

Main category: cs.LG

TL;DR: The paper examines the relationship between prediction accuracy and solution quality in Predict-Then-Optimize frameworks, introduces algorithms for simulating classifier predictions, and shows the non-trivial relationship between prediction error and optimization performance.

Details

Motivation: The paper addresses the common but often unverified assumption in Predict-Then-Optimize approaches that more accurate predictions lead to better optimization solutions. This assumption is challenging to validate for complex, constrained optimization problems and is frequently overlooked in literature.

Method: The authors introduce a new algorithm for simulating predictions of multiclass classifiers, complementing existing binary classification simulation methods. They conduct computational studies to evaluate algorithm performance and apply these simulation algorithms to assess a Predict-Then-Optimize algorithm for a machine scheduling problem.

Result: The simulation algorithms can reproduce classifier performance with reasonable accuracy, though some variability is observed. More importantly, experiments reveal that the relationship between prediction error and solution quality (proximity to actual optimum) is non-trivial and complex.

Conclusion: The relationship between prediction accuracy and optimization solution quality is not straightforward, highlighting important considerations for designing and evaluating decision-making systems based on machine learning predictions. Simulation approaches provide valuable tools for analyzing these relationships without training real models.

Abstract: Uncertainty in optimization is often represented as stochastic parameters in the optimization model. In Predict-Then-Optimize approaches, predictions of a machine learning model are used as values for such parameters, effectively transforming the stochastic optimization problem into a deterministic one. This two-stage framework is built on the assumption that more accurate predictions result in solutions that are closer to the actual optimal solution. However, providing evidence for this assumption in the context of complex, constrained optimization problems is challenging and often overlooked in the literature. Simulating predictions of machine learning models offers a way to (experimentally) analyze how prediction error impacts solution quality without the need to train real models. Complementing an algorithm from the literature for simulating binary classification, we introduce a new algorithm for simulating predictions of multiclass classifiers. We conduct a computational study to evaluate the performance of these algorithms, and show that classifier performance can be simulated with reasonable accuracy, although some variability is observed. Additionally, we apply these algorithms to assess the performance of a Predict-Then-Optimize algorithm for a machine scheduling problem. The experiments demonstrate that the relationship between prediction error and how close solutions are to the actual optimum is non-trivial, highlighting important considerations for the design and evaluation of decision-making systems based on machine learning predictions.

[443] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals

Wenhui Cui, Christopher Sandino, Hadi Pouransari, Ran Liu, Juri Minxha, Ellen Zippi, Aman Verma, Anna Sedlackova, Erdrin Azemi, Behrooz Mahasseni

Main category: cs.LG

TL;DR: CPEP framework aligns EMG signals with pose representations for improved gesture classification, enabling zero-shot learning for unseen gestures.

Details

Motivation: Traditional gesture classification relies on high-quality structured data (videos, images, skeletons), but low-power biosignals like EMG enable continuous prediction on wearables. However, EMG data is weaker modality that needs alignment with structured data for better representation learning.

Method: Proposes Contrastive Pose-EMG Pre-training (CPEP) framework that aligns EMG and pose representations through contrastive learning. Learns an EMG encoder that produces high-quality, pose-informative representations by aligning with structured pose data.

Result: Outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification. Enables zero-shot classification for gestures not seen during training.

Conclusion: Aligning weak-modality EMG data with structured pose representations significantly improves gesture classification performance and enables zero-shot learning, making continuous gesture prediction on wearables more effective and generalizable.

Abstract: Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Leveraging low-power, cost-effective biosignals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearables. In this paper, we demonstrate that learning representations from weak-modality data that are aligned with those from structured, high-quality data can improve representation quality and enables zero-shot classification. Specifically, we propose a Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, where we learn an EMG encoder that produces high-quality and pose-informative representations. We assess the gesture classification performance of our model through linear probing and zero-shot setups. Our model outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.

[444] Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs

Michael H. Freedman, Michael Mulligan

Main category: cs.LG

TL;DR: The paper investigates whether Kolmogorov-Arnold (KA) geometry emerges naturally in vanilla single-hidden-layer neural networks during training, rather than being engineered into architectures like KANs.

Details

Motivation: To understand how neural networks organically learn to prepare input data for downstream processing, and to learn about KA geometry emergence to potentially accelerate learning through timely hyperparameter interventions.

Method: Quantify KA geometry through statistical properties of the exterior powers of the Jacobian matrix J(x): number of zero rows and various observables for minor statistics, measuring scale and axis alignment of J(x).

Result: KA geometry often emerges when training vanilla single hidden layer neural networks, leading to rough understanding of where KA geometry occurs in function complexity and model hyperparameter space.

Conclusion: This research represents the “flip side” of KA-Networks (KANs) - instead of engineering KA into neural networks, the paper observes KA emerging organically in shallow MLPs during training.

Abstract: The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such universal functions have a distinctive local geometry, a “texture,” which can be characterized by the inner function’s Jacobian $J({\mathbf{x}})$, as $\mathbf{x}$ varies over the data. It is natural to ask if this distinctive KA geometry emerges through conventional neural network optimization. We find that indeed KA geometry often is produced when training vanilla single hidden layer neural networks. We quantify KA geometry through the statistical properties of the exterior powers of $J(\mathbf{x})$: number of zero rows and various observables for the minor statistics of $J(\mathbf{x})$, which measure the scale and axis alignment of $J(\mathbf{x})$. This leads to a rough understanding for where KA geometry occurs in the space of function complexity and model hyperparameters. The motivation is first to understand how neural networks organically learn to prepare input data for later downstream processing and, second, to learn enough about the emergence of KA geometry to accelerate learning through a timely intervention in network hyperparameters. This research is the “flip side” of KA-Networks (KANs). We do not engineer KA into the neural network, but rather watch KA emerge in shallow MLPs.

[445] Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

Yuandong Tian

Main category: cs.LG

TL;DR: The paper proposes a mathematical framework called Li₂ that characterizes grokking (delayed generalization) in 2-layer nonlinear networks through three stages: lazy learning, independent feature learning, and interactive feature learning.

Details

Motivation: While grokking has been studied extensively, there's no mathematical framework that characterizes what features emerge, how and when it happens, and how it relates to gradient dynamics for complex structured inputs.

Method: Proposes the Li₂ framework that captures three key stages: (I) Lazy learning where top layer overfits to random hidden representation, (II) Independent feature learning where backpropagated gradient enables hidden nodes to learn representations independently following gradient ascent of energy function E, and (III) Interactive feature learning where gradient focuses on missing features.

Result: The framework explains grokking behavior, provides provable scaling laws for feature emergence, memorization and generalization, reveals why optimizers like Muon are effective, and shows how key hyperparameters (weight decay, learning rate, sample sizes) affect grokking.

Conclusion: The Li₂ framework provides a principled mathematical understanding of grokking dynamics, feature emergence, and generalization in neural networks, with extensions possible to multi-layer architectures.

Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li}_2$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.

[446] SimpleFold: Folding Proteins is Simpler than You Think

Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Josh Susskind, Miguel Angel Bautista

Main category: cs.LG

TL;DR: SimpleFold is a flow-matching based protein folding model using only standard transformer blocks, achieving competitive performance without domain-specific architectural designs.

Details

Motivation: To challenge the necessity of complex domain-specific architectures in protein folding models by exploring whether general-purpose transformer blocks alone can achieve state-of-the-art performance.

Method: Uses standard transformer blocks with adaptive layers, trained via generative flow-matching objective with additional structural term. Scaled to 3B parameters and trained on ~9M distilled protein structures plus experimental PDB data.

Result: SimpleFold-3B achieves competitive performance on standard folding benchmarks compared to SOTA baselines, shows strong ensemble prediction capability, and demonstrates efficient deployment on consumer hardware.

Conclusion: SimpleFold challenges the reliance on complex domain-specific architectures in protein folding, opening up alternative design space for future progress using general-purpose architectures.

Abstract: Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general-purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer-level hardware. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.

[447] Adversarial Diffusion for Robust Reinforcement Learning

Daniele Foffano, Alessio Russo, Alexandre Proutiere

Main category: cs.LG

TL;DR: AD-RRL uses diffusion models to train robust RL policies by generating worst-case trajectories during training, optimizing CVaR for robustness against environment uncertainties.

Details

Motivation: Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning. Existing methods struggle with compounding errors in step-by-step transition models and lack flexibility in handling uncertainty.

Method: AD-RRL leverages diffusion models to generate full trajectories “all at once” to avoid compounding errors. It uses conditional sampling to generate worst-case trajectories during training, building on the connection between Conditional Value at Risk (CVaR) optimization and robust RL to optimize the CVaR of cumulative return.

Result: Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

Conclusion: Diffusion models provide an effective approach for training robust RL policies by enabling generation of worst-case trajectories and optimizing CVaR, addressing the challenge of robustness to modeling errors and uncertainties.

Abstract: Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories “all at once”, mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

[448] BoreaRL: A Multi-Objective Reinforcement Learning Environment for Climate-Adaptive Boreal Forest Management

Kevin Bradley Dsouza, Enoch Ofosu, Daniel Chukwuemeka Amaogu, Jérôme Pigeon, Richard Boudreault, Pooneh Maghoul, Juan Moreno-Cruz, Yuri Leonenko

Main category: cs.LG

TL;DR: BoreaRL is a new multi-objective RL environment for boreal forest management that reveals carbon objectives are easier to optimize than permafrost preservation, with current MORL methods struggling to achieve robust climate-adaptive policies.

Details

Motivation: Boreal forests store massive amounts of carbon in permafrost soils vulnerable to climate change, but current tools can't adequately address the complex trade-offs between carbon sequestration and permafrost preservation in forest management.

Method: Developed BoreaRL, a physically-grounded simulator of coupled energy, carbon, and water fluxes with two training paradigms: site-specific mode for controlled studies and generalist mode for learning robust policies under environmental stochasticity. Evaluated multi-objective RL algorithms on this environment.

Result: Found fundamental asymmetry: carbon objectives are significantly easier to optimize than thaw (permafrost) objectives, with thaw-focused policies showing minimal learning progress. In generalist settings, standard preference-conditioned approaches fail, while naive site selection achieves superior performance. Carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density.

Conclusion: Robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark for developing more effective approaches. The environment is open-sourced to accelerate research in multi-objective RL for climate applications.

Abstract: Boreal forests store 30-40% of terrestrial carbon, much in climate-vulnerable permafrost soils, making their management critical for climate mitigation. However, optimizing forest management for both carbon sequestration and permafrost preservation presents complex trade-offs that current tools cannot adequately address. We introduce BoreaRL, the first multi-objective reinforcement learning environment for climate-adaptive boreal forest management, featuring a physically-grounded simulator of coupled energy, carbon, and water fluxes. BoreaRL supports two training paradigms: site-specific mode for controlled studies and generalist mode for learning robust policies under environmental stochasticity. Through evaluation of multi-objective RL algorithms, we reveal a fundamental asymmetry in learning difficulty: carbon objectives are significantly easier to optimize than thaw (permafrost preservation) objectives, with thaw-focused policies showing minimal learning progress across both paradigms. In generalist settings, standard gradient-descent based preference-conditioned approaches fail, while a naive site selection approach achieves superior performance by strategically selecting training episodes. Analysis of learned strategies reveals distinct management philosophies, where carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density to protect permafrost while maintaining carbon gains. Our results demonstrate that robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark for developing more effective approaches. We open-source BoreaRL to accelerate research in multi-objective RL for climate applications.

[449] ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

Main category: cs.LG

TL;DR: LLMs struggle with genuine strategic reasoning in chess, failing to beat amateur-level chess engines, with fine-tuned models showing improvement but still lacking true reasoning capabilities.

Details

Motivation: To determine whether LLMs possess genuine complex strategic reasoning skills or are just good at pattern recognition, using chess as a testbed since it requires long-term planning, rule comprehension, and conversation memorization.

Method: Created ChessArena - a competitive framework where LLMs play against each other in four different modes, equipped with ranking algorithm and leaderboard. Evaluated 13 LLMs playing over 800 games, assessing basic understanding, move selection, and puzzle solving capabilities.

Result: Significant shortcomings revealed: no LLM could beat Maia-1100 (amateur-level chess engine), some failed to defeat random players. Fine-tuned Qwen3-8B showed substantial improvement, approaching larger state-of-the-art reasoning models.

Conclusion: Current LLMs lack genuine strategic reasoning capabilities despite strong pattern recognition skills, highlighting the need for better reasoning benchmarks and specialized training approaches.

Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

[450] Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations

Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, Sergey Levine

Main category: cs.LG

TL;DR: Unified approach combining contrastive representations and quasimetric temporal distances for goal-conditioned RL, enabling optimal goal-reaching with suboptimal data in stochastic environments.

Details

Motivation: Current GCRL approaches use either contrastive representations (successor features) or temporal distances (quasimetric networks), each with limitations. Contrastive methods struggle with stitching tasks, while quasimetric methods struggle with noisy, high-dimensional environments. A unified approach could combine the strengths of both frameworks.

Method: Proposes a representation learning approach that unifies contrastive representations and quasimetric temporal distances. Uses quasimetric representation space structure (triangle inequality) with additional constraints to learn successor representations that enable optimal goal-reaching. Leverages quasimetric distance parameterization to learn optimal distances even with suboptimal data in stochastic environments.

Result: The unified approach improves performance on stitching tasks where contrastive learning methods struggle, and on noisy, high-dimensional environments where quasimetric network methods struggle. Combines stability and long-horizon capabilities of Monte Carlo contrastive RL with free stitching capabilities of quasimetric parameterizations.

Conclusion: The proposed unification of contrastive representations and quasimetric temporal distances provides a robust GCRL approach that addresses limitations of both individual frameworks, enabling optimal goal-reaching in challenging scenarios with suboptimal data and stochastic environments.

Abstract: Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) contrastive representations, in which methods learn “successor features” with a contrastive objective that performs inference over future outcomes, and (2) temporal distances, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a quasimetric distance parameterization to learn optimal goal-reaching distances, even with suboptimal data and in stochastic environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.

[451] GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer

Main category: cs.LG

TL;DR: GLASS Flows is a new sampling method that enables efficient reward alignment for flow/diffusion models by using ODE sampling for Markov transitions instead of slower SDE sampling.

Details

Motivation: Current reward alignment algorithms for flow matching and diffusion models are inefficient because they rely on slow SDE sampling for Markov transitions, creating a bottleneck that limits practical application.

Method: GLASS Flows introduces a “flow matching model within a flow matching model” paradigm that simulates Markov transitions using ODE sampling instead of SDE sampling. The “inner” flow matching model can be retrieved from pre-trained models without retraining.

Result: GLASS Flows eliminates the trade-off between stochastic evolution and efficiency, and when combined with Feynman-Kac Steering, achieves state-of-the-art performance in text-to-image generation.

Conclusion: GLASS Flows provides a simple, drop-in solution for efficient inference-time scaling of flow and diffusion models by combining ODE efficiency with SDE-like stochastic evolution.

Abstract: The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a “flow matching model within a flow matching model” to sample Markov transitions. As we show in this work, this “inner” flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

[452] LLM-NAS: LLM-driven Hardware-Aware Neural Architecture Search

Hengyi Zhu, Grace Li Zhang, Shaoyi Huang

Main category: cs.LG

TL;DR: LLM-NAS: An LLM-driven Neural Architecture Search method that uses complexity partitioning, prompt co-evolution, and zero-cost predictors to efficiently generate diverse, high-accuracy, low-latency neural architectures, reducing search time from days to minutes.

Details

Motivation: Traditional HW-NAS methods require multiple GPU days per dataset, while LLM-driven approaches suffer from exploration bias - they repeatedly propose designs within limited search space and fail to discover architectures across different latency ranges.

Method: Three key components: 1) Complexity-driven partitioning engine to divide search space by complexity for diversity; 2) LLM-powered architecture prompt co-evolution operator where LLM updates design heuristics knowledge base and performs guided evolution; 3) Zero-cost predictor to avoid training candidates from scratch.

Result: On HW-NAS-Bench, LLM-NAS achieves higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Search cost drops from days to minutes compared with traditional supernet baselines.

Conclusion: LLM-NAS effectively addresses exploration bias in LLM-driven NAS, enabling efficient discovery of diverse, high-performance neural architectures across different latency ranges while dramatically reducing search time.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose LLM-NAS: an LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed LLM-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, LLM-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.

[453] Reference Grounded Skill Discovery

Seungeun Rho, Aaron Trinh, Danfei Xu, Sehoon Ha

Main category: cs.LG

TL;DR: RGSD uses reference data to ground skill discovery in semantically meaningful latent space, enabling both imitation of reference behaviors and discovery of related variations for high-DoF agents.

Details

Motivation: Scaling unsupervised skill discovery to high-DoF agents is challenging due to exponential exploration space growth and limited meaningful skill manifold, requiring semantic guidance for effective exploration.

Method: RGSD performs contrastive pretraining to embed motions on a unit hypersphere, clustering reference trajectories into distinct directions, grounding skill discovery in semantically meaningful latent space.

Result: RGSD successfully imitates skills (walking, running, punching, sidestepping) and discovers variations on a 359-D observation, 69-D action SMPL humanoid, outperforming imitation-learning baselines in downstream tasks.

Conclusion: Lightweight reference-grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems, enabling faithful style command satisfaction.

Abstract: Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present Reference-Grounded Skill Discovery (RGSD), a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors. On a simulated SMPL humanoid with $359$-D observations and $69$-D actions, RGSD successfully imitates skills such as walking, running, punching, and sidestepping, while also discover variations of these behaviors. In downstream locomotion tasks, RGSD leverages the discovered skills to faithfully satisfy user-specified style commands and outperforms imitation-learning baselines, which often fail to maintain the commanded style. Overall, our results suggest that lightweight reference-grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems.

[454] Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Tuan Nguyen, Long Tran-Thanh

Main category: cs.LG

TL;DR: A black-box framework for LLM safety alignment that uses game theory and linear programming at inference time, requiring no model retraining or internal access.

Details

Motivation: Existing alignment methods are costly, inflexible, and require model retraining or internal access, making them impractical for third-party stakeholders and resource-constrained settings.

Method: Formulates safety-helpfulness trade-off as a two-player zero-sum game, uses LLM agents with linear programming solver at inference time to compute equilibrium strategies without accessing model internals.

Result: Demonstrates feasibility of black-box safety alignment, offering scalable and accessible pathway for stakeholders to enforce safety across evolving LLM ecosystems.

Conclusion: Proposes a model-independent framework that enables third-party safety alignment without retraining or internal model access, making safety enforcement more accessible and scalable.

Abstract: Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

[455] Multi-View Graph Learning with Graph-Tuple

Shiyu Chen, Ningyuan Huang, Soledad Villar

Main category: cs.LG

TL;DR: Multi-view graph-tuple framework partitions graphs into disjoint subgraphs to capture interactions at multiple scales, overcoming limitations of single-scale sparsification in GNNs for dense graphs.

Details

Motivation: Traditional GNNs scale with graph edges, making them inefficient for dense graphs like point clouds or molecular interactions. Common sparsification methods force an arbitrary single interaction scale and discard multi-scale information.

Method: Introduces a multi-view graph-tuple framework that partitions graphs into disjoint subgraphs capturing primary local interactions and weaker long-range connections. Uses heterogeneous message-passing architecture inspired by non-commuting operators theory.

Result: Formally proves the framework is strictly more expressive and guarantees lower oracle risk than single-graph message-passing models. Demonstrates better performance on molecular property prediction from Coulomb matrices and cosmological parameter inference from point clouds.

Conclusion: The multi-view graph-tuple approach effectively captures multi-scale interactions in dense graphs, outperforming single-graph baselines and offering a versatile solution for scientific applications with complex graph structures.

Abstract: Graph Neural Networks (GNNs) typically scale with the number of graph edges, making them well suited for sparse graphs but less efficient on dense graphs, such as point clouds or molecular interactions. A common remedy is to sparsify the graph via similarity thresholding or distance pruning, but this forces an arbitrary choice of a single interaction scale and discards crucial information from other scales. To overcome this limitation, we introduce a multi-view graph-tuple framework. Instead of a single graph, our graph-tuple framework partitions the graph into disjoint subgraphs, capturing primary local interactions and weaker, long-range connections. We then learn multi-view representations from the graph-tuple via a heterogeneous message-passing architecture inspired by the theory of non-commuting operators, which we formally prove is strictly more expressive and guarantees a lower oracle risk compared to single-graph message-passing models. We instantiate our framework on two scientific domains: molecular property prediction from feature-scarce Coulomb matrices and cosmological parameter inference from geometric point clouds. On both applications, our multi-view graph-tuple models demonstrate better performance than single-graph baselines, highlighting the power and versatility of our multi-view approach.

[456] Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Chenwei Tang, Jingyu Xing, Lin Long, Xinyu Liu, Deng Xiong, Wei Ju, Shudong Huang, Jiancheng Lv, Ziyue Qiao

Main category: cs.LG

TL;DR: COMPASS introduces a novel test-time RL approach for LLMs that learns from unlabeled data using composite self-scoring rewards without external supervision.

Details

Motivation: Current RL methods for LLMs face scalability bottlenecks due to heavy reliance on human-curated preference data or labeled datasets for reward modeling. The paper aims to overcome this limitation by enabling RL on unlabeled data where models learn autonomously from continuous experience streams.

Method: COMPASS (Composite Path and Answer Self-Scoring) integrates two components: 1) Dual-Calibration Answer Reward (DCAR) that establishes trustworthy pseudo-labels through confidence and credibility calibration, and 2) Decisive Path Reward (DPR) that directly optimizes reasoning process quality beyond outcome supervision.

Result: Extensive experiments show COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing scalable learning from continuous experience.

Conclusion: COMPASS provides a more scalable direction for LLMs to learn from continuous experience by systematically enhancing analytical capabilities through joint reinforcement of trustworthy consensus answers and decisive reasoning chains without external supervision.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model’s analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.

[457] Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

Jane H. Lee, Baturay Saglam, Spyridon Pougkakiotis, Amin Karbasi, Dionysis Kalogerias

Main category: cs.LG

TL;DR: A framework for risk-aware constrained reinforcement learning using optimized certainty equivalents to handle tail risks and catastrophic events in high-stakes applications.

Details

Motivation: Standard constrained RL focuses on expected accumulated rewards but neglects risky tail events, which is insufficient for high-stakes applications where outlier risks are critical.

Method: Proposes a risk-aware constrained RL framework using optimized certainty equivalents (OCEs) to achieve per-stage robustness in both reward values and time, with a strong Lagrangian duality framework that can wrap around standard RL solvers like PPO.

Result: The framework ensures exact equivalence to original constrained problems under appropriate constraint qualifications, yields a simple algorithmic recipe, and establishes convergence under common assumptions.

Conclusion: The approach provides effective risk-aware properties for constrained RL, verified through numerical experiments, making it suitable for high-stakes applications where tail risks matter.

Abstract: Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.

[458] GAPO: Robust Advantage Estimation for Real-World Code LLMs

Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao

Main category: cs.LG

TL;DR: GAPO improves RL for code-editing LLMs by using adaptive outlier-free Q-estimation instead of group means, handling skewed reward distributions better than GRPO/DAPO.

Details

Motivation: Real-world code-editing scenarios often have skewed reward distributions with unpredictable outliers, which distort advantage computation in group-relative RL methods like GRPO, leading to increased noise and suboptimal performance.

Method: GAPO adaptively finds an outlier-free highest-density interval (HDI) per prompt, then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation, making it robust to skewed distributions while remaining plug-and-play and efficient.

Result: Validated on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO.

Conclusion: GAPO effectively addresses the problem of skewed reward distributions in code-editing RL, providing a robust, plug-and-play solution that outperforms existing group-relative methods while maintaining efficiency.

Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.

[459] Conformational Rank Conditioned Committees for Machine Learning-Assisted Directed Evolution

Mia Adler, Carrie Liang, Brian Peng, Oleg Presnyakov, Justin M. Baker, Jannelle Lauffer, Himani Sharma, Barry Merriman

Main category: cs.LG

TL;DR: RCC-MLDE framework uses ranked conformations with separate neural network committees per rank to distinguish conformational from epistemic uncertainty, improving SARS-CoV-2 antibody docking performance.

Details

Motivation: Current MLDE approaches for antibody optimization often use single conformations or single committees across all conformations, which fails to separate conformational uncertainty (multiple protein structures) from epistemic uncertainty (model uncertainty). This limitation hinders accurate navigation of antibody fitness landscapes.

Method: Introduces a rank-conditioned committee (RCC) framework that uses ranked conformations and assigns a separate deep neural network committee to each rank. This design enables principled separation between epistemic uncertainty (model uncertainty) and conformational uncertainty (structural variability).

Result: Validated on SARS-CoV-2 antibody docking, demonstrating significant improvements over baseline MLDE strategies. The approach shows enhanced performance in therapeutic antibody discovery applications.

Conclusion: RCC-MLDE provides a scalable route for therapeutic antibody discovery while directly addressing the challenge of modeling conformational uncertainty, offering a more robust framework for navigating antibody fitness landscapes.

Abstract: Machine Learning-assisted directed evolution (MLDE) is a powerful tool for efficiently navigating antibody fitness landscapes. Many structure-aware MLDE pipelines rely on a single conformation or a single committee across all conformations, limiting their ability to separate conformational uncertainty from epistemic uncertainty. Here, we introduce a rank -conditioned committee (RCC) framework that leverages ranked conformations to assign a deep neural network committee per rank. This design enables a principled separation between epistemic uncertainty and conformational uncertainty. We validate our RCC-MLDE approach on SARS-CoV-2 antibody docking, demonstrating significant improvements over baseline strategies. Our results offer a scalable route for therapeutic antibody discovery while directly addressing the challenge of modeling conformational uncertainty.

[460] TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: TabTune is a unified library that standardizes the workflow for tabular foundation models, addressing fragmentation in preprocessing, APIs, fine-tuning, and evaluation.

Details

Motivation: Tabular foundation models face adoption barriers due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and lack of standardized evaluation for deployment metrics like calibration and fairness.

Method: TabTune provides a single interface for seven state-of-the-art models with multiple adaptation strategies (zero-shot, meta-learning, supervised fine-tuning, parameter-efficient fine-tuning), automates model-aware preprocessing, manages architectural heterogeneity, and integrates evaluation modules.

Result: The framework enables consistent benchmarking of adaptation strategies for tabular foundation models, designed for extensibility and reproducibility.

Conclusion: TabTune addresses the fragmentation in tabular foundation model workflows and facilitates standardized evaluation and deployment of these models.

Abstract: Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models.

[461] MSAD: A Deep Dive into Model Selection for Time series Anomaly Detection

Emmanouil Sylligardos, John Paparrizos, Themis Palpanas, Pierre Senellart, Paul Boniol

Main category: cs.LG

TL;DR: Time series classification methods can effectively select the best anomaly detection algorithms for diverse datasets, outperforming any single anomaly detection method while maintaining similar execution time.

Details

Motivation: No single anomaly detection method works best across heterogeneous time series datasets, and existing AutoML solutions don't work for time series anomaly detection. There's a need for model selection methods that can choose the best anomaly detection algorithm based on time series characteristics.

Method: Evaluated 234 model configurations from 16 base classifiers across 1980+ time series, conducting the first extensive experimental evaluation of time series classification as model selection for anomaly detection.

Result: Model selection methods outperform every single anomaly detection method while maintaining similar execution time. Time series classification algorithms prove accurate and efficient for anomaly detection model selection.

Conclusion: Time series classification provides an effective solution for model selection in anomaly detection, establishing a strong baseline for AutoML pipelines and addressing the challenge of heterogeneous time series datasets.

Abstract: Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines. Preprint version of an article accepted at the VLDB Journal.

[462] Fast 3D Surrogate Modeling for Data Center Thermal Management

Soumyendu Sarkar, Antonio Guillen-Perez, Zachariah J Carmichael, Avisek Naug, Refik Mert Cam, Vineet Gundecha, Ashwin Ramesh Babu, Sahand Ghorbanpour, Ricardo Luna Gutierrez

Main category: cs.LG

TL;DR: Vision-based surrogate models for real-time 3D temperature prediction in data centers achieve 20,000x speedup over traditional CFD, enabling energy savings of 7% through better cooling control.

Details

Motivation: Data centers consume massive energy and produce carbon emissions. Real-time temperature prediction is crucial for sustainability and efficiency, but traditional CFD methods are too slow and complex for practical use.

Method: Developed vision-based surrogate modeling framework using 3D voxelized data center representations. Evaluated multiple architectures: 3D CNN U-Net variants, 3D Fourier Neural Operator, and 3D vision transformers to map thermal inputs (server workloads, fan speeds, HVAC settings) to temperature fields.

Result: Models generalize across data center configurations and achieve 20,000x speedup (from hours to milliseconds). Enable real-time hot spot detection and temperature distribution estimation, leading to 7% energy savings and reduced carbon footprint.

Conclusion: Vision-based surrogate models provide fast, accurate 3D temperature prediction for data centers, enabling real-time cooling optimization and workload redistribution to significantly reduce energy consumption and carbon emissions.

Abstract: Reducing energy consumption and carbon emissions in data centers by enabling real-time temperature prediction is critical for sustainability and operational efficiency. Achieving this requires accurate modeling of the 3D temperature field to capture airflow dynamics and thermal interactions under varying operating conditions. Traditional thermal CFD solvers, while accurate, are computationally expensive and require expert-crafted meshes and boundary conditions, making them impractical for real-time use. To address these limitations, we develop a vision-based surrogate modeling framework that operates directly on a 3D voxelized representation of the data center, incorporating server workloads, fan speeds, and HVAC temperature set points. We evaluate multiple architectures, including 3D CNN U-Net variants, a 3D Fourier Neural Operator, and 3D vision transformers, to map these thermal inputs to high-fidelity heat maps. Our results show that the surrogate models generalize across data center configurations and significantly speed up computations (20,000x), from hundreds of milliseconds to hours. This fast and accurate estimation of hot spots and temperature distribution enables real-time cooling control and workload redistribution, leading to substantial energy savings (7%) and reduced carbon footprint.

[463] GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances

Peter Sharpe

Main category: cs.LG

TL;DR: GLOBE is a novel neural surrogate for homogeneous PDEs that combines boundary-element method principles with equivariant ML, achieving dramatic accuracy improvements on aerodynamic simulations while being compact and flexible.

Details

Motivation: To create more accurate, generalizable, and practical ML-based PDE surrogates for industrial computer-aided engineering by incorporating rigorous physics- and domain-inspired inductive biases that respect PDE information flow.

Method: GLOBE represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, using multiscale branches and communication hyperlayers. The architecture is translation-, rotation-, and parity-equivariant; discretization-invariant; and units-invariant via nondimensionalization. Features include explicit far-field decay envelope, boundary-to-boundary hyperlayer communication, and all-to-all boundary-to-target evaluation for global receptive field.

Result: On AirFRANS dataset: 200x lower MSE than reference baselines, 50x lower than next-best model on “Full” split; 100x lower error on velocity/pressure and 600x lower on surface pressure than Transolver on “Scarce” split. Model is compact (117k parameters), supports arbitrary point evaluation, and works with non-watertight meshes.

Conclusion: Rigorous physics- and domain-inspired inductive biases enable substantial gains in accuracy, generalizability, and practicality for ML-based PDE surrogates in industrial CAE applications.

Abstract: We introduce GLOBE, a new neural surrogate for homogeneous PDEs that draws inductive bias from boundary-element methods and equivariant ML. GLOBE represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, composed across multiscale branches and communication hyperlayers. The architecture is translation-, rotation-, and parity-equivariant; discretization-invariant in the fine-mesh limit; and units-invariant via rigorous nondimensionalization. An explicit far-field decay envelope stabilizes extrapolation, boundary-to-boundary hyperlayer communication mediates long-range coupling, and the all-to-all boundary-to-target evaluation yields a global receptive field that respects PDE information flow, even for elliptic PDEs. On AirFRANS (steady incompressible RANS over NACA airfoils), GLOBE achieves substantial accuracy improvements. On the “Full” split, it reduces mean-squared error by roughly 200x on all fields relative to the dataset’s reference baselines, and roughly 50x relative to the next-best-performing model. In the “Scarce” split, it achieves over 100x lower error on velocity and pressure fields and over 600x lower error on surface pressure than Transolver. Qualitative results show sharp near-wall gradients, coherent wakes, and limited errors under modest extrapolation in Reynolds number and angle of attack. In addition to this accuracy, the model is quite compact (117k parameters), and fields can be evaluated at arbitrary points during inference. We also demonstrate the ability to train and predict with non-watertight meshes, which has strong practical implications. These results show that rigorous physics- and domain-inspired inductive biases can achieve large gains in accuracy, generalizability, and practicality for ML-based PDE surrogates for industrial computer-aided engineering (CAE).

[464] PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

Siqi Liang, Yudi Zhang, Yue Guo

Main category: cs.LG

TL;DR: A persona-based AI agent framework using Knowledge-Graph-enhanced RAG to personalize LLM responses by combining user history summaries with global interaction patterns.

Details

Motivation: Need for personalized AI agents that adapt to individual user preferences and maintain consistent persona-aligned behaviors.

Method: Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) that constructs LLM-derived graph index of documents, summarizes communities, and generates personalized prompts combining user history and global interaction patterns.

Result: On LaMP benchmark: 11.1% F1 improvement for news categorization, 56.1% F1 improvement for movie tagging, and 10.4% reduction in product rating MAE over prior methods.

Conclusion: The framework enables effective persona-based AI agents that maintain personalized behaviors while benefiting from collective knowledge through graph-enhanced RAG.

Abstract: We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user’s “persona” (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user’s historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F

[465] Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning

Zhizuo Chen, Theodore T. Allen

Main category: cs.LG

TL;DR: NVMDP framework extends MDPs to handle non-stationary environments and varying discount rates, enabling policy shaping without changing state/action spaces or rewards.

Details

Motivation: Standard MDP algorithms struggle with non-stationary environments, and infinite-horizon formulations don't directly apply to finite-horizon tasks. There's a need for a more flexible framework that can handle time-varying dynamics and discount rates.

Method: Introduces Non-stationary and Varying-discounting MDP (NVMDP) framework with theoretical foundations including value formulations, optimality conditions, and policy improvement. Adapts dynamic programming and Q-learning algorithms to NVMDPs, and extends Policy Gradient Theorem and TRPO for function approximation.

Result: NVMDP-based algorithms successfully recover optimal trajectories in non-stationary gridworld environments under multiple reward/discount schemes, while original Q-learning fails. The framework subsumes both infinite-horizon stationary MDPs and finite-horizon MDPs as special cases.

Conclusion: NVMDPs provide a theoretically sound and practically effective framework for RL that handles non-stationarity and enables policy shaping with minimal algorithmic modifications, offering robust performance where traditional methods fail.

Abstract: Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation, optimality conditions, and policy improvement under finite state and action spaces. Building on these results, we adapt dynamic programming and generalized Q-learning algorithms to NVMDPs, along with formal convergence proofs. For problems requiring function approximation, we extend the Policy Gradient Theorem and the policy improvement bound in Trust Region Policy Optimization (TRPO), offering proofs in both scalar and matrix forms. Empirical evaluations in a non-stationary gridworld environment demonstrate that NVMDP-based algorithms successfully recover optimal trajectories under multiple reward and discounting schemes, whereas original Q-learning fails. These results collectively show that NVMDPs provide a theoretically sound and practically effective framework for reinforcement learning, requiring only minor algorithmic modifications while enabling robust handling of non-stationarity and explicit optimal policy shaping.

[466] AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu

Main category: cs.LG

TL;DR: AVA-VLA introduces Active Visual Attention to dynamically modulate visual processing in VLA models by leveraging recurrent states as belief state approximations, achieving SOTA performance on robotic benchmarks.

Details

Motivation: Existing VLA models process visual inputs independently at each timestep using a Markov Decision Process (MDP) approach, which is history-agnostic and suboptimal for sequential decision-making. This fails to leverage historical context for effective visual token processing.

Method: Reformulates the problem from a Partially Observable Markov Decision Process (POMDP) perspective. Introduces Active Visual Attention (AVA) module that uses recurrent states (neural approximations of belief states) to compute soft weights for dynamically processing task-relevant visual tokens based on historical context.

Result: Achieves state-of-the-art performance across popular robotic benchmarks including LIBERO and CALVIN. Real-world deployments on a dual-arm robot platform validate practical applicability and robust sim-to-real transferability.

Conclusion: The POMDP-inspired AVA-VLA framework with Active Visual Attention effectively addresses the history-agnostic limitation of existing VLA models, enabling more effective visual token processing for sequential decision-making in embodied AI tasks.

Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent’s belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework’s practical applicability and robust sim-to-real transferability.

[467] Learning Massively Multitask World Models for Continuous Control

Nicklas Hansen, Hao Su, Xiaolong Wang

Main category: cs.LG

TL;DR: Newt is a language-conditioned multitask world model trained on 200 diverse tasks with online RL, showing better performance and data-efficiency than baselines.

Details

Motivation: Current RL research focuses too much on single-task or offline regimes, creating a perception that online RL doesn't scale. The authors want to explore whether a single agent can be trained on hundreds of tasks with online interaction, inspired by the foundation model approach of large-scale pretraining followed by light RL.

Method: Introduces a new benchmark with 200 diverse tasks across domains and embodiments, each with language instructions and demonstrations. Presents Newt, a language-conditioned multitask world model that is first pretrained on demonstrations to learn task-aware representations and action priors, then jointly optimized with online interaction across all tasks.

Result: Newt achieves better multitask performance and data-efficiency than strong baselines, exhibits strong open-loop control capabilities, and enables rapid adaptation to unseen tasks.

Conclusion: The work demonstrates that online RL can scale to hundreds of diverse tasks when combined with appropriate pretraining and multitask learning approaches, challenging the prevailing view that online RL doesn’t scale. The released benchmark and models provide resources for further research in this direction.

Abstract: General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

[468] BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

Main category: cs.LG

TL;DR: BioArc is a framework that uses Neural Architecture Search to discover optimal architectures for biological foundation models, moving beyond intuition-driven design to principled, automated discovery across biological modalities.

Details

Motivation: Existing biological foundation models directly adopt architectures from general AI domains without considering unique biological data properties, leading to suboptimal performance in capturing long-range dependencies, sparse information, and complex biological "grammars."

Method: BioArc uses Neural Architecture Search (NAS) to systematically explore architecture design space across multiple biological modalities, analyzing interplay between architecture, tokenization, and training strategies to discover novel architectures and distill design principles.

Result: The large-scale analysis identifies novel high-performance architectures, distills empirical design principles for future model development, and proposes effective architecture prediction methods for new biological tasks.

Conclusion: BioArc provides a foundational resource and principled methodology to guide creation of next-generation task-specific and foundation models for biology, moving beyond intuition-driven architecture design.

Abstract: Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars’’ inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

[469] DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning

Bo Han, Zhuoming Li, Xiaoyu Wang, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia

Main category: cs.LG

TL;DR: DiCaP introduces a correctness-aware pseudo-labeling framework for semi-supervised multi-label learning that weights pseudo-labels based on their correctness likelihood and uses dual-thresholding to handle confident vs ambiguous samples.

Details

Motivation: Existing SSMLL methods assign equal weights to all pseudo-labels regardless of quality, which amplifies the impact of noisy predictions and degrades performance. The authors identify that optimal pseudo-label weighting should reflect correctness likelihood.

Method: Distribution-Calibrated Pseudo-labeling (DiCaP) estimates posterior precision to calibrate pseudo-label weights based on correctness likelihood. It uses a dual-thresholding mechanism: confident samples get weighted pseudo-labels, while ambiguous samples are explored via unsupervised contrastive learning.

Result: Experiments on multiple benchmark datasets show consistent improvements, surpassing state-of-the-art methods by up to 4.27%.

Conclusion: The proposed correctness-aware weighting approach and dual-thresholding mechanism effectively improve semi-supervised multi-label learning performance by better handling pseudo-label quality and uncertainty.

Abstract: Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model’s performance. While pseudo-labeling has become a dominant strategy in SSMLL, most existing methods assign equal weights to all pseudo-labels regardless of their quality, which can amplify the impact of noisy or uncertain predictions and degrade the overall performance. In this paper, we theoretically verify that the optimal weight for a pseudo-label should reflect its correctness likelihood. Empirically, we observe that on the same dataset, the correctness likelihood distribution of unlabeled data remains stable, even as the number of labeled training samples varies. Building on this insight, we propose Distribution-Calibrated Pseudo-labeling (DiCaP), a correctness-aware framework that estimates posterior precision to calibrate pseudo-label weights. We further introduce a dual-thresholding mechanism to separate confident and ambiguous regions: confident samples are pseudo-labeled and weighted accordingly, while ambiguous ones are explored by unsupervised contrastive learning. Experiments conducted on multiple benchmark datasets verify that our method achieves consistent improvements, surpassing state-of-the-art methods by up to 4.27%.

[470] Limitations of Using Identical Distributions for Training and Testing When Learning Boolean Functions

Jordi Pérez-Guijarro

Main category: cs.LG

TL;DR: The paper shows that when learners can be optimally adapted to training distributions, matching training and test distributions is not always optimal, contrary to conventional wisdom, assuming one-way functions exist.

Details

Motivation: The paper investigates whether training on a distribution different from the test distribution can improve generalization when learners are optimally adapted to their training distributions, addressing limitations of prior work that didn't account for optimal learner selection.

Method: The authors analyze the problem in full generality, examining scenarios where learners can be optimally adapted to their training distributions, and use theoretical analysis assuming the existence of one-way functions to prove their results.

Result: Surprisingly, the paper finds that matching training and test distributions is not always optimal when learners are optimally adapted to training distributions, assuming one-way functions exist. However, with certain regularity conditions on target functions, the standard conclusion (matching is best) holds for uniform distributions.

Conclusion: The conventional wisdom that training and test distributions should match is not universally true when learners can be optimally adapted; distribution mismatch can sometimes improve generalization, though regularity conditions can restore the standard conclusion for uniform distributions.

Abstract: When the distributions of the training and test data do not coincide, the problem of understanding generalization becomes considerably more complex, prompting a variety of questions. Prior work has shown that, for some fixed learning methods, there are scenarios where training on a distribution different from the test distribution improves generalization. However, these results do not account for the possibility of choosing, for each training distribution, the optimal learning algorithm, leaving open whether the observed benefits stem from the mismatch itself or from suboptimality of the learner. In this work, we address this question in full generality. That is, we study whether it is always optimal for the training distribution to be identical to the test distribution when the learner is allowed to be optimally adapted to the training distribution. Surprisingly, assuming the existence of one-way functions, we find that the answer is no. That is, matching distributions is not always the best scenario. Nonetheless, we also show that when certain regularities are imposed on the target functions, the standard conclusion is recovered in the case of the uniform distribution.

[471] Mean-Field Limits for Two-Layer Neural Networks Trained with Consensus-Based Optimization

William De Deyn, Michael Herty, Giovanni Samaey

Main category: cs.LG

TL;DR: CBO for neural network training: hybrid CBO+Adam converges faster, reformulated CBO reduces memory in multi-task learning, mean-field analysis shows monotonic variance decrease.

Details

Motivation: To develop more efficient optimization methods for neural network training by leveraging Consensus-Based Optimization (CBO) and addressing its limitations through hybrid approaches and theoretical analysis.

Method: 1) Compare CBO vs Adam on neural networks, 2) Develop hybrid CBO+Adam approach, 3) Recast CBO for multi-task learning with reduced memory, 4) Reformulate CBO in optimal transport framework, 5) Analyze mean-field limit dynamics on Wasserstein-over-Wasserstein space.

Result: 1) Hybrid CBO+Adam converges faster than CBO alone, 2) Reformulated CBO reduces memory overhead in multi-task learning, 3) Theoretical analysis shows monotonic variance decrease in mean-field limit.

Conclusion: CBO can be effectively combined with gradient-based methods like Adam for faster convergence, and its theoretical properties can be rigorously analyzed through optimal transport and mean-field formulations, making it a promising approach for neural network optimization.

Abstract: We study Consensus-Based Optimization (CBO) for two-layer neural network training. We compare the performance of CBO against Adam on two test cases and demonstrate how a hybrid approach, combining CBO with Adam, provides faster convergence than CBO. Additionally, in the context of multi-task learning, we recast CBO into a formulation that offers less memory overhead. The CBO method allows for a mean-field limit formulation, which we couple with the mean-field limit of the neural network. To this end, we first reformulate CBO within the optimal transport framework. In the limit of infinitely many particles, we define the corresponding dynamics on the Wasserstein-over-Wasserstein space and show that the variance decreases monotonically.

[472] HBLLM: A Haar-Based Approach for Accurate Structured 1-Bit Quantized LLMs

Ningning Chen, Weicai Ye, Ying Jiang

Main category: cs.LG

TL;DR: HBLLM is a wavelet-enhanced 1-bit post-training quantization method for LLMs that uses Haar wavelet transforms and structure-aware grouping to achieve high-fidelity quantization with minimal storage overhead.

Details

Motivation: To develop an efficient 1-bit quantization method for LLMs that maintains high fidelity while minimizing storage requirements, addressing the challenge of model compression for large language models.

Method: Uses Haar wavelet transforms for frequency decomposition, frequency-aware multi-parameter intra-row grouping, ℓ₂-norm-based saliency-driven column selection, and shared mean quantization for non-salient weights across frequency bands.

Result: Achieves state-of-the-art 1-bit quantization performance with perplexity of 6.71 on LLaMA2-13B using only 1.08 bits average weight storage, outperforming existing methods.

Conclusion: HBLLM demonstrates that wavelet-enhanced quantization with structure-aware grouping can achieve high-fidelity 1-bit compression for LLMs with minimal storage overhead.

Abstract: We introduce HBLLM, a wavelet-enhanced high-fidelity $1$-bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) $\ell_2$-norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in $1$-bit quantization, attaining a perplexity of $6.71$ on LLaMA$2$-$13$B with an average weight storage of only $1.08$ bits. Code available at: https://github.com/Yeyke/HBLLM.

[473] Using physics-inspired Singular Learning Theory to understand grokking & other phase transitions in modern neural networks

Anish Lakkapragada

Main category: cs.LG

TL;DR: Empirical study of Singular Learning Theory (SLT) in toy neural network models, testing free energy scaling and local learning coefficients across controlled network families to understand phase transitions and theory-practice gaps.

Details

Motivation: Classical statistical inference fails to explain modern neural network success due to non-identifiability (singularity). Singular Learning Theory (SLT) offers a physics-inspired framework to bridge this theory-practice gap, but needs empirical validation.

Method: Two-part empirical study: 1) Test Arrhenius-style rate hypothesis for SLT free energy using grokking modulo-arithmetic model and Anthropic’s Toy Models of Superposition. 2) Measure local learning coefficient scaling with problem difficulty across controlled network families (polynomial regressors, low-rank linear networks, low-rank autoencoders).

Result: Experiments recover known scaling laws while others show meaningful deviations from theoretical expectations. Demonstrates SLT’s utility for understanding neural network phase transitions.

Conclusion: SLT has merits for understanding neural network phase transitions and theory-practice gaps. The paper illustrates empirical validation of SLT concepts while posing open research questions for the field.

Abstract: Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory-practice gap. In this paper, we empirically study SLT in toy settings relevant to interpretability and phase transitions. First, we understand the SLT free energy $\mathcal{F}_n$ by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic’s Toy Models of Superposition. Second, we understand the local learning coefficient $λ_α$ by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our experiments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field.

[474] Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment

Libo Wang

Main category: cs.LG

TL;DR: Researchers developed Sigma, a VLA model for humanoid robots that creates a time-updable thought space between semantics and continuous control, achieving telepathic communication through deep semantic understanding without retraining the base model.

Details

Motivation: To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, enabling better semantic alignment and intention-driven behavior.

Method: Constructed Sigma VLA model using pi05_base as foundation, preprocessed svla_so101_pickplace dataset, designed custom architecture combining deep semantic understanding and association for telepathic communication, optimized through data preprocessing, LoRA fine-tuning, and inference-stage adapter improvements.

Result: Sigma showed stable decrease in control MSE across vector, fragment, and entire trajectory timescales while maintaining telepathy norm and semantic-text alignment quality unchanged, demonstrating effective mind-responsive alignment control.

Conclusion: Mind-responsive alignment control can be quantified through an architecture combining deep semantic understanding and association without retraining the base model, providing reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.

Abstract: To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, this study constructs and trains a VLA model named “Sigma” that runs on a single RTX 4090. It uses the open-source pi05_base model as a foundation and preprocesses svla_so101_pickplace into a training dataset. The researcher independently designed an architecture for a vision-language-action model that combines deep semantic understanding and association to achieve telepathic communication. The training process involved repeated optimizations of data preprocessing, LoRA fine-tuning, and the inference-stage adapter. The experiment employed offline closed-loop replay, comparing Sigma with the untuned pure pi05_base model under data conditions. Results showed that Sigma exhibited a stable decrease in control MSE across vector, fragment, and entire trajectory timescales, while maintaining the telepathy norm and semantic-text alignment quality unchanged. It demonstrates that mind-responsive alignment control is quantified through an architecture that combines deep understanding of semantics and association without retraining the base model, which provides reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.

[475] Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models

Shashank Landge, Abhishek Patil, Tejas kamble, Bhushan Buddhivant, Priyanka Joshi

Main category: cs.LG

TL;DR: SVD-based data-free quantization method identifies important weights via principal components, outperforming calibration-dependent methods like AWQ and SpQR on privacy-sensitive tasks.

Details

Motivation: Current PTQ methods require calibration data to identify critical weights, making them unsuitable for privacy-sensitive scenarios where data is unavailable or cannot be shared.

Method: Proposes a data-free approach using SVD to identify principal component weights, preserving top-k weights in FP32 while aggressively quantizing residual weights without needing forward passes or calibration data.

Result: SVD-based method achieves 66.06% accuracy on RTE task, outperforming AWQ and SpQR (both 65.34%), showing structural importance correlates with functional importance.

Conclusion: Intrinsic matrix structure can serve as a robust proxy for weight saliency, enabling effective quantization without calibration data, particularly valuable for privacy-sensitive deployments.

Abstract: As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features’’ – weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable. In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model’s downstream performance. We introduce a novel selection heuristic that preserves the top-$k$ weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06%, outperforming both AWQ (65.34%) and SpQR (65.34%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.

[476] Morphling: Fast, Fused, and Flexible GNN Training at Scale

Anubhab, Rupesh Nasre

Main category: cs.LG

TL;DR: Morphling is a domain-specific code synthesizer that compiles high-level GNN specifications into optimized implementations for CPUs and GPUs, achieving 20X speedups over existing frameworks.

Details

Motivation: Current GNN frameworks like PyG and DGL fail to address the divergent execution characteristics of GNNs - irregular graph traversals fused with regular dense matrix operations. They rely on generic kernels with poor cache locality, excessive memory movement, and substantial intermediate allocations.

Method: Morphling compiles high-level GNN specifications into portable implementations targeting OpenMP, CUDA, and MPI. It uses a library of optimized, architecture-aware primitives and incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths based on input feature statistics.

Result: Morphling improves per-epoch training throughput by average 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Memory-efficient layouts reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware.

Conclusion: Specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

Abstract: Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. The results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Morphling’s memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

[477] The Active and Noise-Tolerant Strategic Perceptron

Maria-Florina Balcan, Hedyeh Beyhaghi

Main category: cs.LG

TL;DR: Active learning algorithm for classifying strategic agents that achieves exponential improvement in label complexity over passive learning, even with strategic manipulation.

Details

Motivation: Active learning is efficient for costly labeling tasks (hiring, admissions, loans), but strategic agents manipulate features for favorable outcomes, creating challenges beyond clean data learning. Need algorithms that work in strategic environments while preserving active learning efficiency.

Method: Modified version of Active Perceptron algorithm for learning linear separators in strategic setting. Works with data drawn uniformly from unit sphere, handles nonrealizable case where some labels are inconsistent with optimal classifier.

Result: Achieves excess error ε using only Õ(d ln 1/ε) label queries and incurs at most Õ(d ln 1/ε) additional mistakes relative to optimal classifier. Substantially fewer label queries than prior strategic Perceptron work, computationally efficient.

Conclusion: First active learning algorithm for strategic classification that preserves exponential improvement in label complexity over passive learning, successfully addressing both strategic manipulation and active learning efficiency simultaneously.

Abstract: We initiate the study of active learning algorithms for classifying strategic agents. Active learning is a well-established framework in machine learning in which the learner selectively queries labels, often achieving substantially higher accuracy and efficiency than classical supervised methods-especially in settings where labeling is costly or time-consuming, such as hiring, admissions, and loan decisions. Strategic classification, however, addresses scenarios where agents modify their features to obtain more favorable outcomes, resulting in observed data that is not truthful. Such manipulation introduces challenges beyond those in learning from clean data. Our goal is to design active and noise-tolerant algorithms that remain effective in strategic environments-algorithms that classify strategic agents accurately while issuing as few label requests as possible. The central difficulty is to simultaneously account for strategic manipulation and preserve the efficiency gains of active learning. Our main result is an algorithm for actively learning linear separators in the strategic setting that preserves the exponential improvement in label complexity over passive learning previously obtained only in the non-strategic case. Specifically, for data drawn uniformly from the unit sphere, we show that a modified version of the Active Perceptron algorithm [DKM05,YZ17] achieves excess error $ε$ using only $\tilde{O}(d \ln \frac{1}ε)$ label queries and incurs at most $\tilde{O}(d \ln \frac{1}ε)$ additional mistakes relative to the optimal classifier, even in the nonrealizable case, when a $\tildeΩ(ε)$ fraction of inputs have inconsistent labels with the optimal classifier. The algorithm is computationally efficient and, under these distributional assumptions, requires substantially fewer label queries than prior work on strategic Perceptron [ABBN21].

[478] Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion

Ahmed Nebli

Main category: cs.LG

TL;DR: ThermoLion is a vision-centric optimizer that dynamically modulates gradient update bitrate using local SNR gating and momentum alignment, outperforming AdamW and Lion on 12 vision datasets.

Details

Motivation: Current optimizers impose static compromises: magnitude-based methods (AdamW) assume high-fidelity gradient norms but suffer noise amplification in rugged landscapes, while sign-based methods (Lion) use radical 1-bit quantization that discards fine-grained descent information. Optimal convergence requires dynamic modulation of update bitrate rather than static priors.

Method: ThermoLion uses local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between “low-bit” exploration and “high-precision” exploitation phases. It also introduces Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories.

Result: Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion surpasses state-of-the-art optimizers (AdamW and Lion) in both convergence speed and terminal accuracy.

Conclusion: Dynamic modulation of gradient update bitrate through SNR gating and momentum alignment provides superior optimization for deep vision models compared to static approaches, enabling better adaptation to the stochastic noise characteristics of vision training landscapes.

Abstract: The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce ThermoLion, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a “low-bit” exploration phase and a “high-precision” exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion surpasses state-of-the-art optimizers, such as AdamW and Lion, in convergence speed and terminal accuracy.

[479] Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

Main category: cs.LG

TL;DR: FORL is a framework combining conditional diffusion models and zero-shot time-series foundation models to handle non-stationary environments in offline RL, improving performance when agents face unexpected time-varying offsets.

Details

Motivation: Real-world scenarios often have abrupt, time-varying offsets that create partial observability, causing agents to misperceive their true state. Existing offline RL methods assume stationarity or only consider synthetic perturbations, which fail in these non-stationary environments.

Method: FORL unifies: (1) conditional diffusion-based candidate state generation trained without assuming specific patterns of future non-stationarity, and (2) zero-shot time-series foundation models. It targets environments with unexpected, potentially non-Markovian offsets.

Result: Empirical evaluations on offline RL benchmarks augmented with real-world time-series data show FORL consistently improves performance compared to competitive baselines in non-stationary environments.

Conclusion: FORL bridges the gap between offline RL and real-world non-stationary environments by integrating zero-shot forecasting with agent experience, enabling robust performance from episode onset despite unexpected offsets.

Abstract: Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent’s experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.

cs.MA

[480] Orchestration Framework for Financial Agents: From Algorithmic Trading to Agentic Trading

Jifeng Li, Arnav Grover, Abraham Alpuerto, Yupeng Cao, Xiao-Yang Liu

Main category: cs.MA

TL;DR: An orchestration framework for financial agents that democratizes algorithmic trading by mapping traditional system components to specialized agents, achieving superior returns compared to benchmarks in both stock and cryptocurrency trading.

Details

Motivation: Financial markets are complex with temporal dynamics and low signal-to-noise ratios, making algorithmic trading systems traditionally difficult and expensive to develop. The paper aims to democratize financial intelligence by making sophisticated trading accessible to the general public through an agent-based framework.

Method: Proposes an orchestration framework that maps each component of traditional algorithmic trading systems to specialized agents: planner, orchestrator, alpha agents, risk agents, portfolio agents, backtest agents, execution agents, audit agents, and memory agent.

Result: For stock trading (hourly data, 04/2024-12/2024): achieved 20.42% return (vs S&P 500’s 15.97%), Sharpe ratio of 2.63, max drawdown of -3.59%. For BTC trading (minute data, 27/07/2025-13/08/2025): achieved 8.39% return (vs BTC’s 3.80% increase), Sharpe ratio of 0.38, max drawdown of -2.80%.

Conclusion: The agent orchestration framework successfully democratizes financial intelligence by providing an accessible, modular approach to algorithmic trading that outperforms market benchmarks in both traditional and cryptocurrency markets, with code made publicly available on GitHub.

Abstract: The financial market is a mission-critical playground for AI agents due to its temporal dynamics and low signal-to-noise ratio. Building an effective algorithmic trading system may require a professional team to develop and test over the years. In this paper, we propose an orchestration framework for financial agents, which aims to democratize financial intelligence to the general public. We map each component of the traditional algorithmic trading system to agents, including planner, orchestrator, alpha agents, risk agents, portfolio agents, backtest agents, execution agents, audit agents, and memory agent. We present two in-house trading examples. For the stock trading task (hourly data from 04/2024 to 12/2024), our approach achieved a return of $20.42%$, a Sharpe ratio of 2.63, and a maximum drawdown of $-3.59%$, while the S&P 500 index yielded a return of $15.97%$. For the BTC trading task (minute data from 27/07/2025 to 13/08/2025), our approach achieved a return of $8.39%$, a Sharpe ratio of $0.38$, and a maximum drawdown of $-2.80%$, whereas the BTC price increased by $3.80%$. Our code is available on \href{https://github.com/Open-Finance-Lab/AgenticTrading}{GitHub}.

[481] Decentralized Multi-Agent System with Trust-Aware Communication

Yepeng Ding, Ahmed Twabi, Junwei Yu, Lingfeng Zhang, Tohru Kondo, Hiroyuki Sato

Main category: cs.MA

TL;DR: A decentralized multi-agent system (DMAS) architecture using blockchain to address trust, scalability, and censorship issues in traditional centralized multi-agent systems.

Details

Motivation: Traditional centralized multi-agent systems have critical limitations: single points of failure, vulnerability to censorship, scalability issues, and trust problems. The emergence of LLMs accelerates autonomous agent development, creating need for more robust architectures.

Method: Propose a decentralized agent runtime with blockchain-based architecture. Formalize a trust-aware communication protocol using cryptographic primitives and on-chain operations to ensure security properties like verifiable interaction cycles, communication integrity, authenticity, non-repudiation, and conditional confidentiality.

Result: Comprehensive security analysis substantiates the security properties. Performance analysis validates DMAS as a scalable and efficient solution for building trustworthy multi-agent systems.

Conclusion: DMAS architecture successfully addresses fundamental problems of traditional centralized MAS by enabling trust-aware, scalable, and censorship-resistant interactions among autonomous agents, paving way for more robust Internet of Agents.

Abstract: The emergence of Large Language Models (LLMs) is rapidly accelerating the development of autonomous multi-agent systems (MAS), paving the way for the Internet of Agents. However, traditional centralized MAS architectures present significant challenges, including single points of failure, vulnerability to censorship, inherent scalability limitations, and critical trust issues. We propose a novel Decentralized Multi-Agent System (DMAS) architecture designed to overcome these fundamental problems by enabling trust-aware, scalable, and censorship-resistant interactions among autonomous agents. Our DMAS features a decentralized agent runtime underpinned by a blockchain-based architecture. We formalize a trust-aware communication protocol that leverages cryptographic primitives and on-chain operations to provide security properties: verifiable interaction cycles, communication integrity, authenticity, non-repudiation, and conditional confidentiality, which we further substantiate through a comprehensive security analysis. Our performance analysis validates the DMAS as a scalable and efficient solution for building trustworthy multi-agent systems.

[482] EZYer: A simulacrum of high school with generative agent

Jinming Yang, Zimu Ji, Weiqi Luo, Gaoxi Wang, Bin Ma, Yueling Deng

Main category: cs.MA

TL;DR: EZYer is a generative AI agent for educational content creation that automatically generates structured teaching materials, LaTeX Beamer courseware, and interactive academic notes through multi-role collaboration, with quality control systems ensuring academic rigor.

Details

Motivation: Existing educational tools suffer from incomplete services, insufficient performance, and weak interactivity in courseware generation, interactive notes, and content quality assurance, especially in the context of online education and LLM advancements.

Method: EZYer consists of three modules: 1) Teacher Module with text corpus retrieval and generation for structured teaching materials and LaTeX Beamer courseware; 2) Student Module with four collaborative roles (Teacher, Assistant, Top Student, Struggling Student) for note generation; 3) Controller with keyword filtering, content scoring, role co-validation, and dynamic correction systems for quality control.

Result: Evaluation using five-dimensional metrics (content accuracy, knowledge coverage, usability, formatting correctness, visual design) scored by five LLMs on 100 generated Beamer and Notes shows excellent quality and good application prospects.

Conclusion: EZYer demonstrates strong performance in generating high-quality educational content with academic rigor and pedagogical propriety, showing promising potential for practical applications in educational technology.

Abstract: With the rapid development of the online education and large language model, the existing educational tools still suffer from incomplete service, insufficient performance and weak interactivity in terms of courseware generation, interactive notes and quality assurance of content. In particular, the proposed generative agent EZYer : 1) Teacher Module: Integrating the Text Corpus retrieval and in-depth generation technologies, it automatically generates structured teaching materials and LaTeX Beamer courseware in line with the high school mathematics syllabus and supports user-defined image insertion. 2) Student Module: Throughout the collaborative interaction of the four roles of Teacher, Assistant, Top Student and Struggling Student, Note Taker summarizes and generates academic notes to enhance the depth and interest of learning. 3) Controller: set up keyword filtering system, content scoring system, role co-validation system, and dynamic content correction system. This ensure academic strictness and pedagogical propriety of EZYer inputs and outputs. In order to evaluate EZYer, this paper designs five-dimensional evaluation indexes of content accuracy, knowledge coverage, usability, formatting correctness and visual design and appeal, and scores 100 Beamer and Notes generated by EZYer by five large language models, separately, and the results show that the quality of EZYer-generated content is excellent and has a good application prospect.

[483] Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi

Main category: cs.MA

TL;DR: Current LLM safety mechanisms fail in multi-agent ecosystems where local compliance can lead to collective failure; need shift from model-level to system-level safety with new frameworks.

Details

Motivation: Existing safety mechanisms (prompts, fine-tuning, moderation) assume single-agent human-model interaction but don't scale to LLM-to-LLM ecosystems where outputs are recursively reused across agent chains, creating emergent systemic risks.

Method: Proposes conceptual transition to system-level safety with Emergent Systemic Risk Horizon (ESRH) framework to formalize interaction-based instability, plus taxonomy of failure modes and InstitutionalAI architecture for adaptive oversight.

Result: Identifies critical gap in current governance: individual model alignment insufficient for multi-agent systems where local compliance aggregates into collective failure through interaction dynamics.

Conclusion: Need paradigm shift from model-level to system-level safety governance, requiring new frameworks like ESRH and architectures like InstitutionalAI to manage emergent risks in LLM-to-LLM ecosystems.

Abstract: This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single-agent safety containment, prompts, fine-tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi-model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM-to-LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.

[484] DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems

Junwei Yu, Yepeng Ding, Hiroyuki Sato

Main category: cs.MA

TL;DR: DynTaskMAS is a new framework for LLM-based multi-agent systems that uses dynamic task graphs to enable asynchronous parallel execution, improving efficiency and scalability.

Details

Motivation: Current LLM-based multi-agent systems face challenges in resource management, task coordination, and system efficiency, lacking sophisticated mechanisms for parallel execution and dynamic task management.

Method: Four key innovations: (1) Dynamic Task Graph Generator for intelligent task decomposition, (2) Asynchronous Parallel Execution Engine for optimized scheduling, (3) Semantic-Aware Context Management System for efficient information sharing, and (4) Adaptive Workflow Manager for dynamic performance optimization.

Result: Significant improvements: 21-33% reduction in execution time (higher gains for complex tasks), 35.4% improvement in resource utilization (65% to 88%), near-linear throughput scaling up to 16 concurrent agents (3.47X improvement for 4X agents).

Conclusion: DynTaskMAS establishes a foundation for building scalable, high-performance LLM-based multi-agent systems capable of efficiently handling complex, dynamic tasks.

Abstract: The emergence of Large Language Models (LLMs) in Multi-Agent Systems (MAS) has opened new possibilities for artificial intelligence, yet current implementations face significant challenges in resource management, task coordination, and system efficiency. While existing frameworks demonstrate the potential of LLM-based agents in collaborative problem-solving, they often lack sophisticated mechanisms for parallel execution and dynamic task management. This paper introduces DynTaskMAS, a novel framework that orchestrates asynchronous and parallel operations in LLM-based MAS through dynamic task graphs. The framework features four key innovations: (1) a Dynamic Task Graph Generator that intelligently decomposes complex tasks while maintaining logical dependencies, (2) an Asynchronous Parallel Execution Engine that optimizes resource utilization through efficient task scheduling, (3) a Semantic-Aware Context Management System that enables efficient information sharing among agents, and (4) an Adaptive Workflow Manager that dynamically optimizes system performance. Experimental evaluations demonstrate that DynTaskMAS achieves significant improvements over traditional approaches: a 21-33% reduction in execution time across task complexities (with higher gains for more complex tasks), a 35.4% improvement in resource utilization (from 65% to 88%), and near-linear throughput scaling up to 16 concurrent agents (3.47X improvement for 4X agents). Our framework establishes a foundation for building scalable, high-performance LLM-based multi-agent systems capable of handling complex, dynamic tasks efficiently.

[485] R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

Harsh Goel, Mohammad Omama, Behdad Chalaki, Vaishnav Tadiparthi, Ehsan Moradi Pari, Sandeep Chinchali

Main category: cs.MA

TL;DR: R3DM is a novel role-based MARL framework that learns emergent roles by maximizing mutual information between agents’ roles, observed trajectories, and expected future behaviors, outperforming SOTA methods by up to 20% win rate improvement.

Details

Motivation: Existing role-based MARL methods derive roles only from past experience, neglecting their influence on future trajectories. Inspired by biological systems where roles shape future behavior for coordination, the paper aims to develop a framework where roles actively shape future behavior to enable effective multi-agent coordination.

Method: Proposes R3DM (Role Discovery and Diversity through Dynamics Models) that learns emergent roles by maximizing mutual information between agents’ roles, observed trajectories, and expected future behaviors. Uses contrastive learning on past trajectories to derive intermediate roles, then shapes intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model.

Result: Benchmarking on SMAC and SMACv2 environments shows R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%.

Conclusion: R3DM successfully demonstrates that learning roles that shape future behavior leads to better multi-agent coordination, providing a novel framework that bridges the gap between past experience and future behavior in role-based MARL.

Abstract: Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent’s past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent’s role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents’ roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%. The code is available at https://github.com/UTAustin-SwarmLab/R3DM.

cs.MM

Yijun Liu, Wu Liu, Xiaoyan Gu, Allen He, Weiping Wang, Yongdong Zhang

Main category: cs.MM

TL;DR: PopSim: A novel simulation-based paradigm using LLM-powered multi-agent social networks to predict social media popularity by simulating content propagation dynamics, outperforming existing methods by 8.82%.

Details

Motivation: Existing popularity prediction methods use static inductive models trained on historical data, which fail to capture the dynamic, interactive nature of UGC propagation with complex nonlinear evolution.

Method: PopSim uses LLM-based multi-agent social network simulation with social-mean-field agent interaction mechanism (modeling dual-channel bidirectional individual-population interactions), multi-source information aggregation module to transform heterogeneous metadata into uniform LLM format, and fusion of multimodal propagation dynamics.

Result: Extensive experiments on real-world datasets show PopSim consistently outperforms state-of-the-art methods, reducing prediction error by an average of 8.82%.

Conclusion: PopSim offers a new simulation-based paradigm for social media popularity prediction that effectively captures dynamic propagation processes, providing superior performance over traditional inductive approaches.

Abstract: Accurately predicting the popularity of user-generated content (UGC) is essential for advancing social media analytics and recommendation systems. Existing approaches typically follow an inductive paradigm, where researchers train static models on historical data for popularity prediction. However, the UGC propagation is inherently a dynamic process, and static modeling based on historical features fails to capture the complex interactions and nonlinear evolution. In this paper, we propose PopSim, a novel simulation-based paradigm for social media popularity prediction (SMPP). Unlike the inductive paradigm, PopSim leverages the large language models (LLMs)-based multi-agent social network sandbox to simulate UGC propagation dynamics for popularity prediction. Specifically, to effectively model the UGC propagation process in the network, we design a social-mean-field-based agent interaction mechanism, which models the dual-channel and bidirectional individual-population interactions, enhancing agents’ global perception and decision-making capabilities. In addition, we propose a multi-source information aggregation module that transforms heterogeneous social metadata into a uniform formulation for LLMs. Finally, propagation dynamics with multimodal information are fused to provide comprehensive popularity prediction. Extensive experiments on real-world datasets demonstrate that SimPop consistently outperforms the state-of-the-art methods, reducing prediction error by an average of 8.82%, offering a new perspective for research on the SMPP task.

[487] Stepwise Schema-Guided Prompting Framework with Parameter Efficient Instruction Tuning for Multimedia Event Extraction

Xiang Yuan, Xinrong Chen, Haochen Li, Hang Yang, Guanyu Wang, Weiping Li, Tong Mo

Main category: cs.MM

TL;DR: SSGPF uses multimodal LLMs with stepwise schema-guided prompting for multimedia event extraction, achieving SOTA results on M2E2 benchmark.

Details

Motivation: Current MEE frameworks struggle with complex multimedia event structures and lack multimodal-aligned training data for effective knowledge transfer.

Method: Stepwise Schema-Guided Prompting Framework (SSGPF) with MLLM backbone: Event Type Schema Guided Prompting for event detection, then Argument Role Schema Guided Prompting with text-bridged grounding for argument extraction. Uses weakly-aligned multimodal dataset and LoRA fine-tuning on LLaVA-v1.5-7B.

Result: SSGPF outperforms SOTA baselines by 5.8% F1 on event detection and 8.4% F1 on argument extraction on M2E2 benchmark.

Conclusion: The proposed SSGPF effectively addresses MEE challenges through adaptive structure capturing and multimodal alignment, demonstrating strong performance improvements over existing methods.

Abstract: Multimedia Event Extraction (MEE) has become an important task in information extraction research as news today increasingly prefers to contain multimedia content. Current MEE works mainly face two challenges: (1) Inadequate extraction framework modeling for handling complex and flexible multimedia event structure; (2) The absence of multimodal-aligned training data for effective knowledge transfer to MEE task. In this work, we propose a Stepwise Schema-Guided Prompting Framework (SSGPF) using Multimodal Large Language Model (MLLM) as backbone for adaptive structure capturing to solve MEE task. At the initial step of SSGPF, we design Event Type Schema Guided Prompting (ETSGP) for event detection, then we devise Argument Role Schema Guided Prompting (ARSGP) that contains multi-step prompts with text-bridged grounding technique for argument extraction. We construct a weakly-aligned multimodal event labeled dataset based on existing unimodal event annotations, then conduct parameter efficient instruction tuning with LoRA on LLaVA-v1.5-7B under SSGPF. Experiments on the M2E2 benchmark demonstrate that SSGPF significantly outperforms current SOTA baselines by 5.8 percent F1 on event detection and 8.4 percent F1 on argument extraction.

[488] TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs

Pengju Xu, Yan Wang, Shuyuan Zhang, Xuan Zhou, Xin Li, Yue Yuan, Fengzhao Li, Shunyuan Zhou, Xingyu Wang, Yi Zhang, Haiying Zhao

Main category: cs.MM

TL;DR: TCC-Bench is a bilingual Visual Question Answering benchmark for evaluating MLLMs’ understanding of traditional Chinese culture, revealing current models struggle with culturally-grounded visual content.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) show limited effectiveness in non-Western cultural contexts, raising concerns about their wider applicability and cultural inclusivity.

Method: Created TCC-Bench with culturally rich visual data from museum artifacts, everyday scenes, and comics. Used semi-automated pipeline: GPT-4o generates candidate questions, followed by human curation to ensure quality and avoid data leakage. Designed to prevent language bias by not disclosing cultural concepts in question texts.

Result: Experimental evaluations across various MLLMs show current models face significant challenges when reasoning about culturally grounded visual content, highlighting performance gaps in cultural understanding.

Conclusion: There is a need for further research in developing culturally inclusive and context-aware multimodal systems. TCC-Bench provides a valuable benchmark for assessing and improving cultural understanding in MLLMs.

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) have significantly enhanced the ability of artificial intelligence systems to understand and generate multimodal content. However, these models often exhibit limited effectiveness when applied to non-Western cultural contexts, which raises concerns about their wider applicability. To address this limitation, we propose the Traditional Chinese Culture understanding Benchmark (TCC-Bench), a bilingual (i.e., Chinese and English) Visual Question Answering (VQA) benchmark specifically designed for assessing the understanding of traditional Chinese culture by MLLMs. TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts. We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage. The benchmark also avoids language bias by preventing direct disclosure of cultural concepts within question texts. Experimental evaluations across a wide range of MLLMs demonstrate that current models still face significant challenges when reasoning about culturally grounded visual content. The results highlight the need for further research in developing culturally inclusive and context-aware multimodal systems. The code and data can be found at: https://tcc-bench.github.io/.

eess.AS

[489] On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer

Main category: eess.AS

TL;DR: Parameter-efficient adaptation method for ASR to transcribe stuttered speech with dysfluency tokens, evaluated on English and German datasets, revealing English-centric biases in tokenizers.

Details

Motivation: Current ASR systems fail to accurately transcribe stuttered speech, producing non-verbatim transcriptions that lack clinical and research value. There's also a bias toward English in existing systems.

Method: Proposed parameter-efficient adaptation method to decode dysfluencies as special tokens, with multi-step fine-tuning strategy including language-adaptive pretraining to address English bias.

Result: Demonstrated effectiveness of lightweight adaptation for dysfluency-aware ASR, but exposed English-centric tokenizer bias that limits performance on German data.

Conclusion: Lightweight adaptation techniques work for dysfluency-aware ASR, but multilingual E2E systems have significant limitations due to tokenizer biases that need addressing.

Abstract: Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer’s English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

[490] Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

Aref Farhadipour, Teodora Vukovic, Volker Dellwo

Main category: eess.AS

TL;DR: UZH-CL system for FAME2026 Challenge uses dual-encoder baseline and ImageBind-LoRA for cross-modal verification in unseen languages, achieving 2nd place with 24.73% EER despite Arabic-only fine-tuning.

Details

Motivation: Address cross-modal verification under unique multilingual conditions with unseen/unheard languages, tackling data scarcity and language constraints in the FAME2026 Challenge.

Method: Two architectures: 1) Baseline dual-encoder trained from scratch with contrastive and orthogonal projection losses, 2) Foundation model approach using ImageBind with LoRA fine-tuning. External Arabic dataset curated from VoxBlink to address data scarcity.

Result: ImageBind-LoRA achieved 24.73% EER on evaluation set (English and German) despite being fine-tuned exclusively on Arabic audio, securing 2nd place in the competition.

Conclusion: ImageBind-LoRA demonstrates remarkable cross-lingual generalization capabilities, showing foundation models with LoRA adaptation can effectively handle cross-modal verification in unseen languages with limited training data.

Abstract: This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

[491] Perceptual evaluation of Acoustic Level of Detail in Virtual Acoustic Environments

Stefan Fichna, Steven van de Par, Bernhard U. Seeber, Stephan D. Ewert

Main category: eess.AS

TL;DR: Study finds that acoustic simulations can be significantly simplified (reduced acoustic level of detail) while maintaining perceptual quality comparable to real binaural recordings, as long as diffuse late reverberation is properly represented.

Details

Motivation: Virtual acoustic environments are crucial for hearing research, but real-time applications require simplified simulations. The necessary acoustic level of detail (ALOD) for capturing perceptually relevant effects in reverberant indoor environments remains unclear.

Method: Simulated three real environments (living room with kitchen, pub, underground station) with varying ALOD by: 1) generating different numbers of image sources for early reflections, and 2) excluding geometrical room details. Perceptually evaluated using headphones compared to binaural room impulse responses measured with dummy head in real environments, or using loudspeakers. Assessed perceived overall difference for pulse stimulus, electric bass, and speech token, plus plausibility, speech intelligibility, and externalization.

Result: Strong reduction in ALOD is feasible while maintaining similar plausibility, speech intelligibility, and externalization as with dummy head recordings. Number and accuracy of early reflections appear less relevant when diffuse late reverberation is appropriately represented.

Conclusion: Acoustic simulations for virtual environments can be significantly simplified without compromising perceptual quality, particularly focusing on proper representation of diffuse late reverberation rather than detailed early reflections.

Abstract: Virtual acoustic environments enable the creation and simulation of realistic and eco-logically valid daily-life situations vital for hearing research and audiology. Reverberant indoor environments are particularly important. For real-time applications, room acous-tics simulation requires simplifications, however, the necessary acoustic level of detail (ALOD) remains unclear in order to capture all perceptually relevant effects. This study examines the impact of varying ALOD in simulations of three real environments: a living room with a coupled kitchen, a pub, and an underground station. ALOD was varied by generating different numbers of image sources for early reflections, or by excluding geo-metrical room details specific for each environment. Simulations were perceptually eval-uated using headphones in comparison to binaural room impulse responses measured with a dummy head in the corresponding real environments, or by using loudspeakers. The study assessed the perceived overall difference for a pulse stimulus, a played electric bass and a speech token. Additionally, plausibility, speech intelligibility, and externaliza-tion were evaluated. Results indicate that a strong reduction in ALOD is feasible while maintaining similar plausibility, speech intelligibility, and externalization as with dummy head recordings. The number and accuracy of early reflections appear less relevant, pro-vided diffuse late reverberation is appropriately represented.

[492] IDMap: A Pseudo-Speaker Generator Framework Based on Speaker Identity Index to Vector Mapping

Zeyan Liu, Liping Chen, Kong Aik Lee, Zhenhua Ling

Main category: eess.AS

TL;DR: IDMap framework generates unique pseudo-speakers for voice anonymization using feedforward mapping from identity index to speaker vector, improving privacy protection with lower computational cost.

Details

Motivation: Current pseudo-speaker generation methods have limitations in uniqueness (restricting privacy protection) and computational efficiency, especially problematic for large-scale scenarios requiring many pseudo-speakers.

Method: Proposes IDMap framework that establishes feedforward mapping from speaker identity index to speaker vector, implemented as two models: IDMap-MLP (multi-layer perceptron) and IDMap-Diff (diffusion-based).

Result: Small-scale evaluations on LibriSpeech show improved uniqueness of pseudo-speakers and better voice privacy protection with reduced computational cost. Large-scale evaluations on MLS and Common Voice datasets demonstrate stable privacy protection as number of pseudo-speakers increases.

Conclusion: IDMap framework effectively addresses uniqueness and computational efficiency limitations in pseudo-speaker generation for voice anonymization, providing scalable solution for voice privacy protection.

Abstract: Facilitated by the speech generation framework that disentangles speech into content, speaker, and prosody, voice anonymization is accomplished by substituting the original speaker embedding vector with that of a pseudo-speaker. In this framework, the pseudo-speaker generation forms a fundamental challenge. Current pseudo-speaker generation methods demonstrate limitations in the uniqueness of pseudo-speakers, consequently restricting their effectiveness in voice privacy protection. Besides, existing model-based methods suffer from heavy computation costs. Especially, in the large-scale scenario where a huge number of pseudo-speakers are generated, the limitations of uniqueness and computational inefficiency become more significant. To this end, this paper proposes a framework for pseudo-speaker generation, which establishes a mapping from speaker identity index to speaker vector in the feedforward architecture, termed IDMap. Specifically, the framework is specified into two models: IDMap-MLP and IDMap-Diff. Experiments were conducted on both small- and large-scale evaluation datasets. Small-scale evaluations on the LibriSpeech dataset validated the effectiveness of the proposed IDMap framework in enhancing the uniqueness of pseudo-speakers, thereby improving voice privacy protection, while at a reduced computational cost. Large-scale evaluations on the MLS and Common Voice datasets further justified the superiority of the IDMap framework regarding the stability of the voice privacy protection capability as the number of pseudo-speakers increased. Audio samples and open-source code can be found in https://github.com/VoicePrivacy/IDMap.

eess.IV

[493] Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction

Sina Raeisadigh, Myles Joshua Toledo Tan, Henning Müller, Abderrahmane Hedjoudje

Main category: eess.IV

TL;DR: 24-hour diffusion MRI outperforms baseline imaging for predicting 3-month functional outcomes in acute ischemic stroke patients when combined with clinical and lesion-volume features.

Details

Motivation: To determine whether early post-treatment diffusion MRI (24-hour) provides better prognostic value than pre-treatment baseline imaging for predicting functional outcomes in acute ischemic stroke patients.

Method: Analyzed 74 AIS patients with paired ADC scans and clinical data. Used 3D ResNet-50 embeddings fused with clinical variables, reduced via PCA (≤12 components), and classified with linear SVM using eight-fold stratified group cross-validation.

Result: J1 (24-hour) multimodal models achieved highest predictive performance (AUC = 0.923 ± 0.085), outperforming J0-based configurations (AUC ≤ 0.86). Incorporating lesion-volume features improved model stability and interpretability.

Conclusion: Early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging, and combining MRI, clinical, and lesion-volume features creates a robust, interpretable framework for predicting 3-month functional outcomes in AIS patients.

Abstract: This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (<=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC <= 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.

[494] TT-Stack: A Transformer-Based Tiered-Stacking Ensemble Framework with Meta-Learning for Automated Breast Cancer Detection in Mammography

Showkat Osman, Md. Tajwar Munim Turzo, Maher Ali Rusho, Md. Makid Haider, Sazzadul Islam Sajin, Ayatullah Hasnat Behesti, Ahmed Faizul Haque Dhrubo, Md. Khurshid Jahan, Mohammad Abdul Qayum

Main category: eess.IV

TL;DR: A two-level Stack of Transformers ensemble framework using seven lightweight vision transformers achieves state-of-the-art breast cancer detection in mammograms with 99.33% accuracy and 99.97% ROC-AUC.

Details

Motivation: Breast cancer remains a leading cause of cancer deaths worldwide, with early detection crucial for improving survival rates. Traditional computer-aided diagnosis systems have limitations in feature representation and generalization across diverse mammographic images.

Method: Proposes TT-Stack ensemble framework using seven lightweight vision transformers (RepViT, DaViT, EfficientViT, MobileViT, FasterViT, MViT, PVT v2) with two-tier meta-learning. Models process single-channel grayscale mammograms using transfer learning from ImageNet. Training includes stratified splits, class-balanced upsampling, early stopping, and adaptive learning rate schedules on the Mammogram Mastery dataset.

Result: EfficientViT and PVT-v2 were top performers with 99.33% validation accuracy, 97.96% F1-score, and perfect 1.000 ROC-AUC. The final TT-Stack ensemble achieved 99.33% accuracy, 100% precision, 96% recall, 97.96% F1-score, and 99.97% ROC-AUC with robust performance due to architectural diversity.

Conclusion: The TT-Stack ensemble framework demonstrates superior performance for breast cancer detection in mammograms, offering a parameter-efficient approach suitable for clinical practice with minimal variance and strong generalization capabilities.

Abstract: Breast cancer continues to be the second most common cause of cancer-related deaths around the world, with early detection being important to improve survival rates for patients. Traditional computer-aided diagnosis systems have limitations in their ability to represent features and generalize to the range of mammographic images. We present a new two-level Stack of Transformers (TT-Stack) ensemble framework based on using heterogeneous lightweight vision transformer architectures to automatically identify breast cancer in mammograms. Specifically, we integrate seven state-of-the-art vision transformers: RepViT, DaViT, EfficientViT, MobileViT, FasterViT, MViT, and PVT v2 while also designing a two-tier meta-learning approach for the ensemble by simply taking the logits from the base model and applying logistic regression for binary classification (Cancer vs. Non-Cancer). Each of the transformer backbone models was developed to process single-channel grayscale mammograms while still taking advantage of transfer learning from pre-training on ImageNet so that they would offer a parameter-efficient approach that may reasonably be applied in clinical practice with minimal variance. The training process included stratified 80/20 splits when necessary, class-balanced upsampling, early stopping, and an adaptive learning rate schedule on the public Mammogram Mastery dataset. In separate evaluations here, it was determined that EfficientViT and PVT-v2 were the top per-forming models achieving 99.33% validation, 97.96% F1-score, and perfect 1.000:0 ROC-AUC with only small train/validation gaps. Finally, the TT-Stack ensemble model by the end of the evaluation reached 99.33% accuracy with 100% precision, 96% recall, 97.96% F1-score and a 99.97% ROC-AUC, and demonstrated robustness in performance due to the diversity of the architecture.

[495] Maintaining SUV Accuracy in Low-Count PET with PETfectior: A Deep Learning Denoising Solution

Yamila Rotstein Habarnau, Nicolás Bustos, Paola Corona, Christian González, Sonia Traverso, Federico Matorra, Francisco Funes, Juan Martín Giraut, Laura Pelegrina, Gabriel Bruno, Mauro Namías

Main category: eess.IV

TL;DR: PETfectior AI software enables high-quality PET imaging with half the radiation dose while maintaining lesion detection accuracy and quantitative performance.

Details

Motivation: To reduce patient radiation exposure and radiopharmaceutical costs in PET imaging while maintaining diagnostic quality, as current standards require high counting statistics that increase radiation dose.

Method: Prospective study of 258 patients with 18F-FDG PET/CT. Standard scans (100%) were compared to half-counting-statistics versions processed with PETfectior AI software (50%+PETfectior). Lesion detectability, SUVmax quantitative concordance, and subjective image quality were evaluated across 1649 lesions.

Result: PETfectior achieved 99.9% sensitivity for lesion detection with only 1 false positive. SUVmax measurements agreed within 12.5% (95% limits) with -1.01% bias. Image quality was rated equal to or better than standard-of-care images.

Conclusion: PETfectior can be safely used clinically at half counting statistics, providing high sensitivity/specificity, low quantitative bias, and excellent subjective image quality while reducing radiation exposure.

Abstract: Background: Diagnostic PET image quality depends on the administered activity and acquisition time. However, minimizing these variables is desirable to reduce patient radiation exposure and radiopharmaceutical costs. PETfectior is an artificial intelligence-based software that processes PET scans and increases signal-to-noise ratio, obtaining high-quality images from low-count-rate images. We perform an initial clinical validation of PETfectior on images acquired with half of the counting statistics required to meet the most recent EANM quantitative standards for 18F-FDG PET, evaluating lesions detectability, quantitative performance and image quality. Materials and methods: 258 patients referred for 18F-FDG PET/CT were prospectively included. The standard-of-care scans (100% scans) were acquired and reconstructed according to EARL standards 2. Half-counting-statistics versions were generated from list-mode data and processed with PETfecftior (50%+PETfectior scans). All oncologic lesions were segmented on both PET/CT versions, manually or automatically, and lesions detectability was evaluated. The SUVmax of the lesions was measured and the quantitative concordance of 50%+PETfectior and 100% images was evaluated. Subjective image quality was visually assessed by two experienced physicians. Results: 1649 lesions were detected in a total of 198 studies. The 50%+PETfectior images showed high sensitivity for lesion detection (99.9%) and only 1 false positive was detected. The SUVmax measured in 100% and 50%+PETfectior images agreed within 12.5% (95% limits of agreement), with a bias of -1.01%. Image quality of the 50%+PETfectior images was rated equal to or better than the standard-of-care images. Conclusion: PETfectior can safely be used in clinical practice at half counting statistics, with high sensitivity and specificity, low quantitative bias and high subjective image quality.

[496] Malaria detection using Deep Convolution Neural Network

Sumit Kumar, Harsh Vardhan, Sneha Priya, Ayush Kumar

Main category: eess.IV

TL;DR: This paper presents a deep learning approach using a 2-layer convolutional neural network to achieve state-of-the-art accuracy for malaria detection from blood cell images, aiming to reduce diagnostic burden in resource-constrained regions.

Details

Motivation: Malaria remains a major global health crisis with 219 million cases and 435,000 deaths annually, primarily affecting sub-Saharan Africa. Despite AI's potential to revolutionize malaria detection, international funding has declined, creating a need for efficient, accurate diagnostic solutions that can work in resource-constrained settings.

Method: The research uses the NIH malaria cell image dataset and implements a 2-layer convolutional neural network for malaria detection. The approach focuses on detecting and segmenting red blood cells infected with malaria parasites, aiming to demonstrate that high accuracy can be achieved with simpler network architectures.

Result: The paper claims to achieve state-of-the-art accuracy in malaria detection using only a 2-layer convolutional network, establishing a new baseline for AI-based malaria detection efforts. This demonstrates that complex deep architectures may not be necessary for effective malaria diagnosis.

Conclusion: Simple convolutional neural networks can provide accurate malaria detection, potentially revolutionizing diagnostics in resource-constrained regions by reducing the burden on microscopists and improving diagnostic accuracy with minimal computational requirements.

Abstract: The latest WHO report showed that the number of malaria cases climbed to 219 million last year, two million higher than last year. The global efforts to fight malaria have hit a plateau and the most significant underlying reason is international funding has declined. Malaria, which is spread to people through the bites of infected female mosquitoes, occurs in 91 countries but about 90% of the cases and deaths are in sub-Saharan Africa. The disease killed 4,35,000 people last year, the majority of them children under five in Africa. AI-backed technology has revolutionized malaria detection in some regions of Africa and the future impact of such work can be revolutionary. The malaria Cell Image Data-set is taken from the official NIH Website NIH data. The aim of the collection of the dataset was to reduce the burden for microscopists in resource-constrained regions and improve diagnostic accuracy using an AI-based algorithm to detect and segment the red blood cells. The goal of this work is to show that the state of the art accuracy can be obtained even by using 2 layer convolution network and show a new baseline in Malaria detection efforts using AI.

[497] ContourDiff: Unpaired Medical Image Translation with Structural Consistency

Yuwen Chen, Nicholas Konz, Hanxue Gu, Haoyu Dong, Yaqian Chen, Lin Li, Jisoo Lee, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: ContourDiff with SCGD is a novel diffusion-based medical image translation framework that preserves anatomical fidelity by using domain-invariant contour representations as spatial constraints during translation.

Details

Motivation: Existing medical image translation methods prioritize perceptual quality over anatomical fidelity, which is crucial for clinical applications like segmentation where preserving anatomical structure is essential for downstream tasks.

Method: Proposes ContourDiff with Spatially Coherent Guided Diffusion (SCGD) that uses domain-invariant anatomical contour representations as precise spatial constraints. A diffusion model converts contour representations from input domains to output domain images, applying contour constraints at every diffusion sampling step to preserve anatomy.

Result: Outperforms other unpaired image translation methods on lumbar spine and hip-and-thigh CT-to-MRI translation tasks across segmentation performance metrics (models trained on translated images applied to real MRIs) and image quality metrics (foreground FID and KID). Shows zero-shot capability where models trained on one anatomical region work on unseen regions without retraining.

Conclusion: ContourDiff effectively preserves anatomical content during medical image translation without needing input domain information during training, enabling better performance on downstream clinical tasks and demonstrating strong generalization capabilities.

Abstract: Accurately translating medical images between different modalities, such as Computed Tomography (CT) to Magnetic Resonance Imaging (MRI), has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tasks, e.g., when leveraging masks from the input domain to develop a segmentation model with images translated to the output domain. To address these challenges, we propose ContourDiff with Spatially Coherent Guided Diffusion (SCGD), a novel framework that leverages domain-invariant anatomical contour representations of images. These representations are simple to extract from images, yet form precise spatial constraints on their anatomical content. We introduce a diffusion model that converts contour representations of images from arbitrary input domains into images in the output domain of interest. By applying the contour as a constraint at every diffusion sampling step, we ensure the preservation of anatomical content. We evaluate our method on challenging lumbar spine and hip-and-thigh CT-to-MRI translation tasks, via (1) the performance of segmentation models trained on translated images applied to real MRIs, and (2) the foreground FID and KID of translated images with respect to real MRIs. Our method outperforms other unpaired image translation methods by a significant margin across almost all metrics and scenarios. Moreover, it achieves this without the need to access any input domain information during training and we further verify its zero-shot capability, showing that a model trained on one anatomical region can be directly applied to unseen regions without retraining (GitHub: https://github.com/mazurowski-lab/ContourDiff).

[498] Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction

Jinho Kim, Marcel Dominik Nickel, Florian Knoll

Main category: eess.IV

TL;DR: Zero-shot self-supervised learning enables 14-second breath-hold MRCP with R=25 acceleration, achieving quality comparable to 338-second respiratory-triggered scans while reducing training time via partial trainable stages.

Details

Motivation: To reduce breath-hold times in MRCP scans while maintaining image quality, addressing the clinical need for faster acquisitions without compromising diagnostic utility.

Method: Used zero-shot self-supervised learning with incoherent k-space sampling (R=25), compared with parallel imaging (R=3) and compressed sensing. Introduced partial trainable stages approach where n stages are frozen (pretrained) and m stages are trainable to reduce backpropagation depth and training time.

Result: Zero-shot reconstruction significantly improved visual quality over compressed sensing, achieving comparable quality to successful respiratory-triggered acquisitions. PSNR decreased only slightly from 38.25 dB to 37.67 dB while training time decreased up to 6.7-fold with optimized frozen/trainable configurations.

Conclusion: Zero-shot learning enables high-fidelity MRCP reconstructions with dramatically reduced breath-hold times (14s vs 338s), and the partially trainable approach provides a practical solution for clinical translation by balancing reconstruction quality with computational efficiency.

Abstract: To investigate the feasibility of zero-shot self-supervised learning reconstruction for reducing breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Breath-hold MRCP was acquired from 11 healthy volunteers on 3T scanners using an incoherent k-space sampling pattern, leading to 14-second acquisition time and an acceleration factor of R=25. Zero-shot reconstruction was compared with parallel imaging of respiratory-triggered MRCP (338s, R=3) and compressed sensing reconstruction. For two volunteers, breath-hold scans (40s, R=6) were additionally acquired and retrospectively undersampled to R=25 to compute peak signal-to-noise ratio (PSNR). To address long zero-shot training time, the n+m full stages of the zero-shot learning were divided into two parts to reduce backpropagation depth during training: 1) n frozen stages initialized with n-stage pretrained network and 2) m trainable stages initialized either randomly or m-stage pretrained network. Efficiency of our approach was assessed by varying initialization strategies and the number of trainable stages using the retrospectively undersampled data. Zero-shot reconstruction significantly improved visual image quality over compressed sensing, particularly in SNR and ductal delineation, and achieved image quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Improved initializations enhanced PSNR and reduced reconstruction time. Adjusting frozen/trainable configurations demonstrated that PSNR decreased only slightly from 38.25 dB (0/13) to 37.67 dB (12/1), while training time decreased up to 6.7-fold. Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and the proposed partially trainable approach offers a practical solution for translation into time-constrained clinical workflows.

[499] Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar

Rongsheng Qian, Chi Xu, Xiaoqiang Ma, Hao Fang, Yili Jin, William I. Atlas, Jiangchuan Liu

Main category: eess.IV

TL;DR: SCOPE: Self-supervised framework for real-time sonar compression and artifact correction without clean-noise pairs, achieving 40% SSIM improvement at ≤0.0118 bpp with 80% bandwidth reduction.

Details

Motivation: Real-time imaging sonar is essential for underwater monitoring but faces limitations: low uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) affecting up to 98% of frames.

Method: SCOPE combines Adaptive Codebook Compression (ACC) for frequency-encoded latent representations tailored to sonar, and Frequency-Aware Multiscale Segmentation (FAMS) that decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing artifacts. Uses hedging training strategy with low-pass proxy pairs generated without labels.

Result: Achieves SSIM of 0.77 (40% improvement over prior self-supervised baselines) at bitrates ≤0.0118 bpp. Reduces uplink bandwidth by >80% while improving downstream detection. Real-time performance: 3.1 ms encoding on embedded GPU, 97 ms full decoding on server. Deployed for months in three Pacific Northwest rivers for salmon enumeration.

Conclusion: Learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions, supporting environmental monitoring applications.

Abstract: Real-time imaging sonar is crucial for underwater monitoring where optical sensing fails, but its use is limited by low uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) affecting up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.

[500] Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction

Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng

Main category: eess.IV

TL;DR: CROPKT is the first systematic study on cross-cancer prognosis knowledge transfer in Whole-Slide Images, shifting from cancer-specific models to knowledge transfer paradigms to address rare tumors and leverage multi-cancer knowledge.

Details

Motivation: Current cancer prognosis models follow a cancer-specific paradigm that struggles with rare tumors and cannot leverage knowledge from other cancers. Multi-task learning approaches are computationally demanding and require large multi-cancer datasets.

Method: The study comprises three parts: (1) curating UNI2-h-DSS dataset with 26 cancers to measure WSI prognostic knowledge transferability, (2) designing experiments to understand transferability mechanisms, and (3) proposing ROUPKT, a routing-based baseline approach to efficiently utilize knowledge from off-the-shelf models of other cancers.

Result: CROPKT establishes the foundation for cross-cancer knowledge transfer in WSI-based prognosis prediction, demonstrating the feasibility and utility of transferring prognostic knowledge across different cancers including rare tumors.

Conclusion: This work serves as an inception for a new paradigm in WSI-based cancer prognosis prediction using cross-cancer knowledge transfer, offering a more scalable approach that can handle rare tumors and leverage multi-cancer knowledge efficiently.

Abstract: Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm in which each cancer corresponds to a single model. However, this paradigm naturally struggles to scale to rare tumors and cannot leverage knowledge from other cancers. While multi-task learning frameworks have been explored recently, they often place high demands on computational resources and require extensive training on ultra-large, multi-cancer WSI datasets. To this end, this paper shifts the paradigm to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It comprises three major parts. (1) We curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors). (2) Beyond a simple evaluation merely for benchmarking, we design a range of experiments to gain deeper insights into the underlying mechanism behind transferability. (3) We further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. CROPKT could serve as an inception that lays the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.

[501] Random forest-based out-of-distribution detection for robust lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: Transformer-based lung cancer segmentation model with self-supervised pretraining shows strong ID performance but poor OOD generalization. RF-Deep classifier using tumor-anchored features achieves effective OOD detection, especially on abdomen CT cohorts.

Details

Motivation: Accurate cancer lesion segmentation from CT scans is crucial for automated treatment planning, but transformer models with self-supervised pretraining often generalize poorly to out-of-distribution data, limiting clinical reliability.

Method: Used Swin Transformer encoder pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans, with randomly initialized decoder. For OOD detection, developed RF-Deep - a random forest classifier applied to contextual tumor-anchored feature representations.

Result: RF-Deep achieved FPR95 of 18.26% and 27.66% on chest CT cohorts (pulmonary embolism, negative COVID-19), and near-perfect detection (<0.1% FPR95) on abdomen CT cohorts (kidney cancers, non-cancerous pancreas), outperforming established OOD methods.

Conclusion: RF-Deep provides a simple, lightweight, and effective approach for enhancing segmentation model reliability in clinical deployment by effectively detecting out-of-distribution inputs across different anatomical regions.

Abstract: Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining have achieved strong performance on in-distribution (ID) data but often generalize poorly on out-of-distribution (OOD) inputs. We investigate this behavior for lung cancer segmentation using an encoder-decoder model. Our encoder is a Swin Transformer pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans spanning cancerous and non-cancerous conditions, and the decoder was randomly initialized. This model was evaluated on an independent ID test set and four OOD scenarios, including chest CT cohorts (pulmonary embolism and negative COVID-19) and abdomen CT cohorts (kidney cancers and non-cancerous pancreas). OOD detection was performed at the scan level using RF-Deep, a random forest classifier applied to contextual tumor-anchored feature representations. We evaluated 920 3D CTs (172,650 images) and observed that RF-Deep achieved FPR95 values of 18.26% and 27.66% on the chest CT cohorts, and near-perfect detection (less than 0.1% FPR95) on the abdomen CT cohorts, consistently outperforming established OOD methods. These results demonstrate that our RF-Deep classifier provides a simple, lightweight, and effective approach for enhancing the reliability of segmentation models in clinical deployment.

[502] MRI Super-Resolution with Deep Learning: A Comprehensive Survey

Mohammad Khateri, Serge Vasylechko, Morteza Ghahremani, Liam Timms, Deniz Kocanaogullari, Simon K. Warfield, Camilo Jaimes, Davood Karimi, Alejandra Sierra, Jussi Tohka, Sila Kurugol, Onur Afacan

Main category: eess.IV

TL;DR: Survey paper reviewing deep learning-based MRI super-resolution methods, covering theoretical foundations, architectures, datasets, and metrics, with taxonomy and open challenges.

Details

Motivation: High-resolution MRI is clinically important but expensive and technically constrained. Super-resolution offers computational solution using affordable low-resolution scans to improve diagnostic accuracy without hardware upgrades.

Method: Comprehensive survey approach: systematic taxonomy of DL-based MRI SR methods, examining from computer vision, computational imaging, inverse problems, and MR physics perspectives. Includes theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics.

Result: Provides organized review of established and emerging SR techniques for MRI, identifies unique clinical/research challenges, highlights open problems, and offers GitHub repository with resources/tools/tutorials for community.

Conclusion: DL-based MRI super-resolution is promising computational approach to overcome hardware limitations, but requires addressing specific clinical/research challenges. Community needs to work on identified open problems for practical implementation.

Abstract: High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

[2] Deep Research: A Systematic Survey

[3] Mirror, Mirror on the Wall – Which is the Best Model of Them All?

[4] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

[5] Reversing Large Language Models for Efficient Training and Fine-Tuning

[6] Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

[7] Spoken Conversational Agents with Large Language Models

[8] Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation

[9] Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

[10] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation

[11] Swivuriso: The South African Next Voices Multilingual Speech Dataset

[12] Lightweight Latent Reasoning for Narrative Tasks

[13] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

[14] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

[15] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models

[16] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

[17] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering

[18] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models

[19] A Concise Review of Hallucinations in LLMs and their Mitigation

[20] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

[21] What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints

[22] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce

[23] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

[24] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

[25] Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

[26] An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

[27] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

[28] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

[29] SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

[30] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

[31] Towards Unification of Hallucination Detection and Fact Verification for Large Language Models

[32] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

[33] TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

[34] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

[35] A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

[36] BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion

[37] promptolution: A Unified, Modular Framework for Prompt Optimization

[38] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

[39] Bangla Hate Speech Classification with Fine-tuned Transformer Models

[40] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

[41] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

[42] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs

[43] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

[44] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

[45] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

[46] On SkipGram Word Embedding Models with Negative Sampling: Unified Framework and Impact of Noise Distributions

[47] Unifying Linear-Time Attention via Latent Probabilistic Modelling

[48] OmniBench: Towards The Future of Universal Omni-Language Models

[49] Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

[50] From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

[51] Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

[52] SkyLadder: Better and Faster Pretraining via Context Window Scheduling

[53] On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

[54] Computational Fact-Checking of Online Discourse: Scoring scientific accuracy in climate change related news articles

[55] Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

[56] MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

[57] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

[58] Multilingual Pretraining for Pixel Language Models

[59] Do Large Language Models Think Like the Brain? Sentence-Level Evidences from Layer-Wise Embeddings and fMRI

[60] Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models

[61] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

[62] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

[63] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

[64] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

[65] WebMall – A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

[66] Just-in-time and distributed task representations in language models

[67] Evolution and compression in LLMs: On the emergence of human-aligned categorization

[68] Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs

[69] Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

[70] ParlAI Vote: A Web Platform for Analyzing Gender and Political Bias in Large Language Models

[71] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

[72] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

[73] Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

[74] Reasoning Up the Instruction Ladder for Controllable Language Models

[75] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

[76] AI Text Detectors and the Misclassification of Slightly Polished Arabic Text

[77] A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis