Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 52]
cs.CV [Total: 183]
cs.AI [Total: 77]
cs.SD [Total: 10]
cs.LG [Total: 121]
cs.MA [Total: 4]
cs.MM [Total: 1]
eess.AS [Total: 3]
eess.IV [Total: 5]

cs.CL

[1] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models

Luciano Floridi, Jessica Morley, Claudio Novelli, David Watson

Main category: cs.CL

TL;DR: LLMs appear to perform abductive reasoning but actually generate text based on learned patterns from human texts, lacking true reasoning capabilities.

Details

Motivation: To examine the nature of reasoning in token-completion LLMs and clarify that their apparent abductive reasoning is actually pattern-based text generation rather than genuine reasoning.

Method: Analyzes LLMs’ stochastic nature and similarity to human abductive reasoning, using examples to demonstrate how they produce plausible outputs without actual reasoning.

Result: LLMs can generate plausible ideas and mimic reasoning but lack grounding in truth, semantics, verification, or understanding; their outputs must be critically assessed.

Conclusion: LLMs have a stochastic base but appear abductive, requiring careful evaluation and application; they can assist human thinking but cannot verify truth or explanations.

Abstract: This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.

[2] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models

Yumou Wei, John Stamper, Paulo F. Carvalho

Main category: cs.CL

TL;DR: SLMs can generate high-quality educational questions through a “generate-then-validate” pipeline combining text generation and probabilistic reasoning, with quality validated by both human experts and LLMs.

Details

Motivation: To explore small language models (SLMs) as a complement to large language models for automatic question generation in learning analytics, leveraging SLMs' capabilities more effectively.

Method: A novel question generation pipeline using “generate-then-validate” strategy: first expansive generation of candidate questions, then selective validation through probabilistic reasoning to refine them.

Result: Both human experts (7) and LLM evaluations showed that generated questions had clear answers and generally aligned well with intended learning objectives.

Conclusion: SLMs can effectively generate high-quality questions when guided by a well-designed pipeline that leverages their strengths in text generation and probabilistic reasoning.

Abstract: We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a “generate-then-validate” strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.

[3] Workflow is All You Need: Escaping the “Statistical Smoothing Trap” via High-Entropy Information Foraging and Adversarial Pacing

Zhongjie Jiang

Main category: cs.CL

TL;DR: DeepNews Framework addresses the “impossible trinity” in domain-specific long-form text generation by modeling expert cognitive processes to achieve low hallucination, deep logic, and personalization.

Details

Motivation: Current LLMs face the "impossible trinity" problem in vertical domains - they struggle to simultaneously achieve low hallucination, deep logical coherence, and personalized expression due to the Statistical Smoothing Trap that overlooks expert-level cognitive processes.

Method: DeepNews Framework: 1) Dual-granularity retrieval with 10:1 saturated information input ratio based on information foraging theory; 2) Schema-guided strategic planning using domain expert knowledge bases and Atomic Blocks; 3) Adversarial constraint prompting with techniques like Rhythm Break and Logic Fog to disrupt probabilistic smoothness.

Result: Experiments reveal a Knowledge Cliff - truthfulness collapses below 15,000 characters of context, while high-redundancy input (>30,000 chars) stabilizes Hallucination-Free Rate above 85%. In blind tests with Chinese tech media, DeepNews (on DeepSeek-V3) achieved 25% acceptance rate vs. 0% for zero-shot GPT-5.

Conclusion: The DeepNews Framework successfully addresses the “impossible trinity” by explicitly modeling expert cognitive processes, demonstrating that structured workflows can overcome fundamental limitations of current generative paradigms in domain-specific long-form writing.

Abstract: Central to long-form text generation in vertical domains is the “impossible trinity” confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system–built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).

[4] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset

Moonsoo Park, Jeongseok Yun, Bohyung Kim

Main category: cs.CL

TL;DR: A two-stage prompting framework that infers explicit and implicit personas from short reviews to generate personalized responses using LLMs, evaluated on Korean food delivery data.

Details

Motivation: Personalized review response generation is challenging when user information is limited (like in food delivery platforms). LLMs produce generic responses without contextual user data, reducing engagement and effectiveness.

Method: Two-stage prompting framework: 1) Infer both explicit (user-stated preferences) and implicit (demographic/stylistic cues) personas directly from short review texts, 2) Incorporate inferred persona attributes into response generation prompt to produce user-tailored replies. Adjust decoding temperature during inference to encourage diverse yet faithful generations.

Result: Evaluated on real-world dataset from Korean food delivery app, assessing impact on precision, diversity, and semantic consistency. Shows effectiveness of persona-augmented prompting in enhancing relevance and personalization.

Conclusion: Persona-augmented prompting effectively enhances relevance and personalization of automated responses without requiring model fine-tuning, addressing the limitation of LLMs producing generic responses when lacking user context.

Abstract: Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.

[5] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, Bernard Ghanem

Main category: cs.CL

TL;DR: CL approaches effectively mitigate safety degradation during LLM fine-tuning, with DER performing best across tasks and models.

Details

Motivation: Safety alignment of LLMs is crucial as they become more democratized, but fine-tuning for new tasks causes safety degradation due to catastrophic forgetting.

Method: Framed safety preservation as continual learning problem, adapted CL approaches (regularization-based, memory-based, model merging) and evaluated in fine-tuning-as-a-service setup with benign and poisoned user data.

Result: CL approaches consistently achieve lower attack success rates than standard fine-tuning, with DER outperforming other CL methods and existing safety-preserving baselines while maintaining task utility.

Conclusion: Continual learning is a practical solution to preserve safety during LLM fine-tuning, generalizing across multiple downstream tasks and model families.

Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user’s selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

[6] Unsupervised Acquisition of Discrete Grammatical Categories

David Ph. Shakouri, Crit Cremers, Niels O. Schiller

Main category: cs.CL

TL;DR: A computational multi-agent system demonstrates how a daughter language model can acquire abstract grammatical knowledge from mother-generated exemplars through statistical analysis and hierarchical clustering.

Details

Motivation: To investigate how abstract grammatical knowledge can be acquired through computational modeling without direct access to internal linguistic knowledge, using only language exemplars as input.

Method: A multi-agent system with mother and daughter language models; daughter learns from mother-generated utterances using hierarchical agglomerative cluster analysis to identify grammatical patterns and form discrete rules.

Result: The system successfully acquired non-trivial grammatical categories resembling those proposed by linguists, and the parameter configuration was validated with a test set showing consistent acquisition.

Conclusion: Abstract grammatical knowledge can be computationally acquired through statistical analysis of language exemplars, demonstrating a viable approach to modeling language acquisition without direct access to internal linguistic knowledge.

Abstract: This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi-agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non-trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non-trivial categories.

[7] AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Gyutaek Oh, Sangjoon Park, Byung-Hoon Kim

Main category: cs.CL

TL;DR: AutoMedic is a multi-agent simulation framework that automatically evaluates LLMs as clinical conversational agents by transforming static medical QA datasets into virtual patient profiles for realistic multi-turn dialogues, assessed using the multi-faceted CARE metric.

Details

Motivation: Current medical LLM evaluation focuses on static QA benchmarks, lacking assessment of dynamic clinical conversations and multi-faceted evaluation strategies. The vast combinatorial space of patient states and interactions makes formal evaluation of clinical conversational scenarios difficult to standardize and measure quantitatively.

Method: AutoMedic transforms off-the-shelf static medical QA datasets into virtual patient profiles, enabling realistic multi-turn clinical dialogues between LLM agents. It uses a multi-agent simulation framework to create dynamic conversational scenarios and evaluates performance using the CARE metric (Clinical conversational Accuracy, Efficiency/strategy, Empathy, and Robustness).

Result: The framework demonstrates validity as an automated evaluation tool for clinical conversational agents, with findings validated by human experts. It provides practical guidelines for effective LLM development in conversational medical applications.

Conclusion: AutoMedic addresses the critical gap in evaluating LLMs for dynamic clinical conversations, offering a standardized, automated framework with comprehensive multi-faceted assessment that enables better development of trustworthy medical conversational agents.

Abstract: Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

[8] Multilingual VLM Training: Adapting an English-Trained VLM to French

Jules Lahmi, Alexis Roger

Main category: cs.CL

TL;DR: This paper investigates methods for adapting English-trained Vision-Language Models to other languages, comparing translation-based pipelines, LoRA finetuning, and two-stage finetuning strategies, finding that dataset translation quality is a major bottleneck.

Details

Motivation: Current Vision-Language Model advancements are primarily limited to English, reducing accessibility for non-English speakers. There's a need to extend VLM capabilities to a broader range of languages to improve global accessibility.

Method: The paper explores three adaptation methods: 1) translation-based pipeline, 2) LoRA finetuning, and 3) two-stage finetuning strategy that separates vision adaptation from language adaptation. Evaluation uses translated multimodal benchmarks and manual assessments by native experts.

Result: Dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of both training and evaluation. The quality of translated datasets significantly impacts model adaptation success.

Conclusion: Future efforts should focus on native-language dataset collection and improved translation strategies rather than relying solely on translation-based approaches for multilingual VLM adaptation.

Abstract: Artificial intelligence has made great progress in recent years, particularly in the development of Vision–Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non–English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.

[9] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Yining Yang, Ben Maurer, Wenlin Chen, David Recordon, Yilun Du, Minlan Yu, Ying Zhang

Main category: cs.CL

TL;DR: Confucius Code Agent (CCA) is an open-source AI software engineer that achieves state-of-the-art performance (54.3% on SWE-Bench-Pro) through hierarchical memory, persistent note-taking, modular tool use, and automated configuration refinement.

Details

Motivation: Existing open-source coding agents lack industrial-scale capabilities for massive repositories, long-term memory, and complex tool coordination, while proprietary agents lack transparency and extensibility. There's a need for production-grade AI software engineers that combine strong performance with openness.

Method: Built on Confucius SDK with three perspectives: Agent Experience (hierarchical working memory for long-context reasoning), User Experience (persistent note-taking for cross-session learning), and Developer Experience (modular extensions for robust tool use). A meta-agent automates configuration synthesis, evaluation, and refinement through build-test-improve loops.

Result: CCA achieves state-of-the-art Resolve@1 performance of 54.3% on SWE-Bench-Pro, substantially improving over prior coding agents. The system provides transparent, extensible, and reproducible foundation for AI agents at industrial scale.

Conclusion: Confucius SDK and CCA bridge research prototypes and production systems, offering industrial-scale AI software engineering with transparency, extensibility, and strong performance while supporting rapid development on new tasks and tool stacks.

Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

[10] Sliding Window Attention Adaptation

Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei

Main category: cs.CL

TL;DR: FA-pretrained LLMs can be adapted to sliding window attention (SWA) for efficient long-context inference using SWAA recipes, avoiding expensive pretraining while recovering original performance.

Details

Motivation: Self-attention in Transformers has quadratic complexity with input length, making long-context inference expensive. Sliding window attention reduces this to linear complexity, but naively switching FA-pretrained models to SWA causes severe performance degradation due to training-inference mismatch.

Method: Proposes Sliding Window Attention Adaptation (SWAA) with five methods: (1) SWA only during prefilling, (2) preserving “sink” tokens, (3) interleaving FA/SWA layers, (4) chain-of-thought (CoT), and (5) fine-tuning. Investigates synergistic combinations of these methods.

Result: SWA adaptation is feasible but non-trivial: no single method suffices, but specific synergistic combinations effectively recover original long-context performance. Analysis of performance-efficiency trade-offs provides recommended recipes for different scenarios.

Conclusion: FA-pretrained LLMs can be successfully adapted to SWA without expensive pretraining through carefully designed SWAA recipes, offering practical solutions for efficient long-context inference with good performance recovery.

Abstract: The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving “sink” tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

[11] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

Youmin Ko, Sungjong Seo, Hyunjoon Kim

Main category: cs.CL

TL;DR: CoopRAG is a novel retrieval-augmented generation framework where retriever and LLM cooperate through knowledge exchange, and retriever layers cooperate for better document ranking, achieving SOTA on QA tasks.

Details

Motivation: Existing RAG methods for QA are prone to incorrect retrievals and hallucinations despite being designed to mitigate LLMs' factual inaccuracies. There's a need for better cooperation between retrieval and generation components.

Method: 1) Unroll questions into sub-questions with masked uncertain positions in reasoning chains; 2) Retrieve documents using augmented queries; 3) Rerank documents by contrasting retriever layers; 4) Reconstruct reasoning chains by filling masked positions via LLM.

Result: CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets and one simple QA dataset in both retrieval and QA performance metrics.

Conclusion: CoopRAG’s cooperative framework between retriever and LLM, and within retriever layers, effectively addresses retrieval inaccuracies and hallucinations in QA tasks, demonstrating superior performance across different QA complexities.

Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}

[12] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov, Vladislav Kruglikov, Nikita Surkov, German Abramov, Pavel Gein, Dmitry Abulkhanov, Mikhail Gashkov, Viktor Zelenkovskiy, Artem Batalov, Aleksandr Medvedev, Anatolii Potapov

Main category: cs.CL

TL;DR: T-pro 2.0 is an open-weight Russian LLM for hybrid reasoning and efficient inference with Cyrillic tokenizer and EAGLE speculative decoding.

Details

Motivation: To create an accessible open system for building and evaluating efficient, practical Russian LLM applications with support for both direct answering and reasoning-trace generation.

Method: Uses a Cyrillic-dense tokenizer and adapted EAGLE speculative-decoding pipeline to reduce latency, with hybrid reasoning capabilities.

Result: Released model weights, T-Wix 500k instruction corpus, T-Math reasoning benchmark, and EAGLE weights on Hugging Face, with public web demo showing speedups.

Conclusion: T-pro 2.0 serves as an accessible open system for reproducible and extensible research on Russian-language reasoning and efficient inference.

Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.

[13] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring “Tortured Phrases” in Scientific Literature

Agniva Maiti, Prajwal Panth, Suresh Chandra Satapathy

Main category: cs.CL

TL;DR: SRAP framework detects adversarial plagiarism using statistical anomaly detection and semantic reconstruction to recover original terminology from obfuscated scientific text.

Details

Motivation: Scientific literature faces threats from automated paraphrasing tools that generate "tortured phrases" to mask plagiarism. Existing detection methods rely on static blocklists or general-domain language models, which fail to detect novel obfuscations and cannot identify source documents.

Method: Two-stage architecture: (1) statistical anomaly detection using SciBERT with token-level pseudo-perplexity to identify improbable phrases, and (2) source-based semantic reconstruction using FAISS for dense vector retrieval and SBERT for sentence-level alignment to recover original terminology.

Result: Achieves 23.67% restoration accuracy on adversarial scientific text, significantly outperforming zero-shot baselines (0.00%). Shows static decision boundaries are necessary for robust detection in jargon-heavy scientific text, as dynamic thresholding fails under high variance.

Conclusion: SRAP enables forensic analysis by detecting adversarial plagiarism and linking obfuscated expressions back to their most probable source documents, addressing limitations of existing methods that cannot determine source content.

Abstract: The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate “tortured phrases”, statistically improbable synonyms (e.g. “counterfeit consciousness” for “artificial intelligence”), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.

[14] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT

Nour El Houda Ben Chaabene, Hamza Hammami

Main category: cs.CL

TL;DR: Integrating Knowledge Graphs with LLMs via KG-BERT improves factual reliability and reasoning in knowledge-intensive NLP tasks.

Details

Motivation: LLMs excel at NLP but lack structured knowledge, leading to factual inconsistencies. There's a need to enhance grounding and reasoning capabilities.

Method: Integrate Knowledge Graphs (KGs) with LLMs using KG-BERT to provide structured knowledge and improve grounding.

Result: Experiments show significant gains in knowledge-intensive tasks like question answering and entity linking, improving factual reliability.

Conclusion: KG integration enables more context-aware, factually reliable next-generation LLMs with enhanced reasoning capabilities.

Abstract: Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.

[15] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis

Nour El Houda Ben Chaabene, Hamza Hammami, Laid Kahloul

Main category: cs.CL

TL;DR: A psychologically-aware conversational AI agent that improves both learning outcomes and emotional well-being by combining LLMs, KG-BERT, and bidirectional LSTM with attention to analyze students’ cognitive and affective states in real-time using multimodal data.

Details

Motivation: Prior educational chatbots are limited to either tutoring or affective support, failing to address both cognitive and emotional needs simultaneously. There's a need for systems that can understand students' engagement, stress, and conceptual understanding in real-time to provide truly adaptive, student-centered interventions.

Method: Combines Large Language Models (LLMs), knowledge graph-enhanced BERT (KG-BERT), and bidirectional LSTM with attention to classify students’ cognitive and affective states. Uses multimodal data including textual semantics, prosodic speech features, and temporal behavioral trends for real-time analysis.

Result: Pilot study with university students showed improved motivation, reduced stress, and moderate academic gains compared to baseline methods. The system effectively inferred engagement, stress, and conceptual understanding.

Conclusion: Integrating semantic reasoning, multimodal fusion, and temporal modeling shows promise for creating adaptive, student-centered educational interventions that address both learning performance and emotional well-being simultaneously.

Abstract: This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students’ cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.

[16] Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs

Lars G. B. Johnsen

Main category: cs.CL

TL;DR: LLMs trained only on surface forms show sensitivity to syntactic structure by distinguishing grammatical vs. ungrammatical variants in subject-auxiliary inversion and parasitic gap licensing, suggesting they develop structural representations without explicit encoding.

Details

Motivation: To test whether large language models, trained only on surface forms without explicit grammatical rules, develop underlying structural representations similar to those posited in traditional generative grammar.

Method: Evaluated GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings for two classic constructions: subject-auxiliary inversion (testing subject boundary recognition) and parasitic gap licensing (testing abstract dependency structure).

Result: LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, showing sensitivity to structure rather than just linear order. Structural generalizations emerge from predictive training on surface forms.

Conclusion: LLMs develop functional sensitivity to syntax without explicit encoding, supporting that structural representations can emerge from surface-form training, challenging traditional views about what constitutes evidence for syntactic structure.

Abstract: What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.

[17] XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

Iñaki Lacunza, José Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas

Main category: cs.CL

TL;DR: The paper proposes XDoGE, an extended multilingual version of DoGE algorithm to optimize language distribution in LLM training, addressing performance issues in mid- and low-resource languages by reweighting data and training models from scratch or through continual pre-training.

Details

Motivation: Current LLMs over-rely on high-resource languages like English, which hampers their performance in mid- and low-resource languages. There's a need to improve multilingual capabilities, particularly for underrepresented languages.

Method: Proposes XDoGE (extended DoGE) algorithm for multilingual setup: (1) train small proxy model to optimize language distribution via domain-reweighing, (2) rescale data and train full-size model with established language weights, either from scratch or through continual pre-training. Targets six Iberian languages with varying resource levels.

Result: Developed IberianLLM-7B-Instruct model centered on Iberian languages and English, pretrained from scratch and improved using CPT with XDoGE weights. Investigated effects of data repetition on minor languages and under-sampling on dominant languages using IberoBench framework.

Conclusion: The proposed XDoGE approach effectively addresses multilingual imbalance in LLMs, resulting in improved performance for mid- and low-resource Iberian languages while maintaining capabilities in high-resource languages, demonstrated through the released IberianLLM-7B-Instruct model.

Abstract: Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.

[18] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

Amartya Roy, Elamparithy M, Kripabandhu Ghosh, Ponnurangam Kumaraguru, Adrian de Wynter

Main category: cs.CL

TL;DR: ICL alone is insufficient for reliable causal reasoning; encoder and encoder-decoder models with fine-tuning outperform decoder-only models for robust causal reasoning at smaller scales.

Details

Motivation: The role and performance of in-context learning (ICL) in causal reasoning remains unclear, especially given that causal reasoning requires multihop composition and strict conjunctive control, and LLMs might rely on spurious lexical relations leading to misleading results.

Method: Compare fine-tuned versions of encoder, encoder-decoder, and decoder-only architectures with zero and few-shot ICL in both natural language and non-natural language scenarios to evaluate their causal reasoning capabilities.

Result: ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. Decoder-only models are brittle to distributional shifts, while fine-tuned encoder and encoder-decoder models generalize more robustly across tests, including non-natural language splits. Decoder-only architectures only match or surpass them at large scales.

Conclusion: For cost-effective, short-horizon robust causal reasoning, encoder or encoder-decoder architectures with targeted fine-tuning are preferable over decoder-only models, especially at smaller scales.

Abstract: In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.

[19] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems

Hang Ding, Qiming Feng, Dongqi Liu, Qi Zhao, Tao Yao, Shuo Wang, Dongsheng Chen, Jian Li, Zhenye Gan, Jiangning Zhang, Chengjie Wang, Yabiao Wang

Main category: cs.CL

TL;DR: RoleRMBench is the first benchmark for reward modeling in role-playing dialogue, revealing gaps in existing models. RoleRM with Continuous Implicit Preferences outperforms other models by 24% on average.

Details

Motivation: Existing reward models fail in subjective, open-ended domains like role play, struggling to capture nuanced persona-grounded human judgments. There's a need for specialized evaluation and improvement in these areas.

Method: 1) Created RoleRMBench benchmark covering 7 fine-grained capabilities for role-playing dialogue evaluation. 2) Proposed RoleRM reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision with multiple structuring strategies.

Result: Evaluation shows large gaps between general-purpose reward models and human judgment, especially in narrative/stylistic dimensions. RoleRM surpasses strong open/closed-source reward models by over 24% on average, with substantial gains in narrative coherence and stylistic fidelity.

Conclusion: Continuous preference representation and annotation consistency are crucial for subjective alignment in human-centered dialogue systems. RoleRMBench establishes foundation for subjective alignment, showing specialized approaches are needed for role-playing domains.

Abstract: Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.

[20] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Shijian Li

Main category: cs.CL

TL;DR: AgriGPT-Omni is an agricultural omni-framework integrating speech, vision, and text with multilingual support, featuring the largest agricultural speech dataset, a three-stage training paradigm, and a comprehensive tri-modal benchmark.

Details

Motivation: Agricultural applications face three key constraints: lack of multilingual speech data, absence of unified multimodal architectures, and missing comprehensive evaluation benchmarks for multimodal agricultural AI systems.

Method: 1) Created scalable data synthesis pipeline for agricultural speech data (492K synthetic + 1.4K real samples across 6 languages); 2) Three-stage training: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning; 3) Developed AgriBench-Omni-2K tri-modal benchmark with standardized protocols.

Result: AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding tasks.

Conclusion: The framework addresses key agricultural AI challenges by providing unified multimodal reasoning, multilingual support, and comprehensive evaluation, with all resources released to promote reproducible research and inclusive agricultural intelligence for low-resource regions.

Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.

[21] From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

Smiljana Antonijevic Ubois

Main category: cs.CL

TL;DR: This paper examines how AI language technologies for low-resource languages like Serbian reproduce cultural/linguistic biases, and proposes a “Data Care” framework based on CARE principles to build more inclusive, culturally-grounded language technologies.

Details

Motivation: Large language models are predominantly trained on dominant languages like English, leading to cultural and linguistic biases in their representation of low-resource languages. The study uses Serbian as a case to understand the structural, historical, and sociotechnical factors shaping language technology development for such languages in the AI age.

Method: The study uses semi-structured interviews with ten scholars and practitioners (linguists, digital humanists, and AI developers) to trace challenges in Serbian language technology development. It examines historical destruction of Serbian textual heritage and contemporary issues driving reductive, engineering-first approaches.

Result: The research identifies key challenges including superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. These issues stem from historical factors and contemporary pressures that prioritize functionality over linguistic nuance.

Conclusion: The study proposes “Data Care” - a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics) that reframes bias mitigation from a post-hoc technical fix to an integral component of corpus design, annotation, and governance. This offers a replicable model for building inclusive, sustainable, and culturally-grounded language technologies in contexts where traditional LLM development reproduces power imbalances and cultural blind spots.

Abstract: Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.

[22] Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

Rebekka Görge, Sujan Sai Gannamaneni, Tabea Naeven, Hammam Abdelwahab, Héctor Allende-Cid, Armin B. Cremers, Lennard Helmer, Michael Mock, Anna Schmitz, Songkai Xue, Elif Yildirir, Maximilian Poretschkin, Stefan Wrobel

Main category: cs.CL

TL;DR: Proposes a comprehensive pipeline for detecting and mitigating two types of data bias (representation bias and explicit stereotypes) in LLM training data, with evaluation showing successful data debiasing but inconsistent model bias reduction.

Details

Motivation: LLM training data exhibits harmful biases against protected groups, and while regulations like the EU AI Act require bias mitigation, practical guidance and operationalization are lacking.

Method: Four-component pipeline: 1) LLM-generated word lists for group label detection, 2) Demographic Representation Score for representation bias quantification, 3) Sociolinguistically informed filtering for stereotype detection/mitigation, 4) Grammar- and Context-Aware Counterfactual Data Augmentation for representation bias compensation.

Result: Successfully reduces representation bias and explicit stereotypes in text datasets (validated by human evaluation), but LLMs fine-tuned on debiased data show inconsistent improvement on bias benchmarks, revealing gaps in current evaluation methods.

Conclusion: The pipeline effectively debiases training data, but current bias evaluation methodologies have critical gaps, and targeted data manipulation is needed to address manifested model bias.

Abstract: Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.

[23] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: OPV (Outcome-based Process Verifier) combines outcome and process verification to efficiently detect errors in long reasoning chains, using active learning and expert annotations to reduce annotation costs while achieving state-of-the-art performance.

Details

Motivation: Current verifiers have limitations: outcome-based verifiers (OVs) can't inspect intermediate steps in long reasoning chains, while process-based verifiers (PVs) struggle with reliable error detection due to scarcity of high-quality annotations caused by prohibitive human annotation costs.

Method: Proposes OPV that verifies rationale process of summarized outcomes from long CoTs. Uses iterative active learning framework with expert annotations: most uncertain cases are annotated and used to train new OPV through Rejection Fine-Tuning (RFT) and RLVR in each iteration.

Result: Achieves SOTA on held-out benchmark with F1 score of 83.1 vs 76.3 for Qwen3-Max-Preview. Effectively detects false positives in synthetic datasets, aligns with expert assessment. When collaborating with policy models, raises accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025.

Conclusion: OPV provides accurate and efficient verification of long reasoning chains, enabling large-scale annotation with reduced costs through active learning, and demonstrates superior performance and broad applicability across various benchmarks and model collaborations.

Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV’s superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.

[24] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage

Elroy Galbraith, Chadwick Sutherland, Donahue Morgan

Main category: cs.CL

TL;DR: TRIDENT is a dispatcher-support system for emergency calls that uses Caribbean-accent-tuned ASR, entity extraction, and bio-acoustic distress detection to provide structured inputs for human triage protocols when ASR fails.

Details

Motivation: Emergency speech recognition systems perform poorly on non-standard English varieties like Caribbean accents, creating critical service gaps for Caribbean populations during emergencies.

Method: Three-layer architecture: 1) Caribbean-accent-tuned ASR with confidence scoring, 2) Local entity extraction via LLMs for clinical indicators, 3) Bio-acoustic distress detection for vocal stress indicators. Key insight: low ASR confidence serves as queue prioritization signal when combined with vocal distress.

Result: Establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols (ESI for routine, START for mass casualty). Empirical validation on Caribbean emergency calls remains future work.

Conclusion: TRIDENT provides a practical solution for emergency systems to handle Caribbean accents by combining multiple complementary signals, ensuring equitable access to triage protocols even when ASR fails, with deployment considerations for offline disaster scenarios.

Abstract: Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal – particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.

[25] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: OPV (Outcome-based Process Verifier) combines outcome and process verification to accurately inspect long reasoning chains in LLMs with efficient large-scale annotation.

Details

Motivation: Current outcome-based verifiers can't inspect unreliable intermediate steps in long reasoning chains, while process-based verifiers struggle with reliable error detection due to high annotation costs and scarcity of high-quality annotations.

Method: Proposes OPV that verifies rationale process from summarized outcomes of long CoTs, using iterative active learning with expert annotations. Employs Rejection Fine-Tuning (RFT) and RLVR to progressively improve verification with fewer annotation costs.

Result: Achieves SOTA on OPV-Bench with F1 score of 83.1 vs 76.3 for Qwen3-Max-Preview. Effectively detects false positives in synthetic data and raises DeepSeek-R1-Distill-Qwen-32B accuracy from 55.2% to 73.3% on AIME2025.

Conclusion: OPV provides accurate and efficient verification for long reasoning chains, enabling large-scale annotation and consistently improving policy model performance across various benchmarks.

Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV’s superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.

[26] Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

Kevin Glocker, Kätriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann, Jenny Kunz

Main category: cs.CL

TL;DR: Scaling English base models is more data-efficient for language adaptation than continued pretraining, reducing catastrophic forgetting while enabling better multilingual merging.

Details

Motivation: Multilingual models underperform compared to language-specific adaptations, especially at smaller scales. Need efficient strategies to adapt pretrained models to new languages while preserving base language capabilities.

Method: Comprehensive scaling ablations with FLOP-matched models, comparing upscaling English base models vs. continued pretraining. Tested merging of scaled language-specific models to create modular multilingual systems.

Result: Larger upscaled models match/surpass smaller models continually pretrained on more data, showing scaling benefits for data efficiency. Scaling reduces catastrophic forgetting in English. Merging scaled models performs better than smaller merges but still less effective than joint multilingual training.

Conclusion: Scaling is an efficient strategy for language adaptation, improving data efficiency and reducing forgetting. Merging shows potential but needs specialized approaches for language-level integration to match joint training performance.

Abstract: Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model’s capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.

[27] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi

Main category: cs.CL

TL;DR: LLMs perform worse on romanized Indian language text vs native scripts in maternal healthcare triage, causing potential safety issues despite understanding semantic intent.

Details

Motivation: LLMs are increasingly used in high-stakes clinical applications in India, where speakers often use romanized text rather than native scripts, but existing research rarely evaluates this orthographic variation with real-world data.

Method: Benchmarked leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali, comparing performance on romanized vs native script messages in maternal and newborn healthcare triage.

Result: Consistent performance degradation for romanized messages with F1 scores trailing native scripts by 5-12 points, potentially causing nearly 2 million excess triage errors at a partner organization. LLMs often correctly infer semantic intent but final classifications remain brittle to orthographic noise.

Conclusion: Reveals a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably, highlighting the need for better handling of orthographic variations in clinical applications.

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.

[28] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz, Aijun Bai, Andrew Wang, Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry, Daniel Balle, Deepanway Ghosal, Doron Kukliansky, Dror Marcus, Elena Gribovskaya, Eran Ofek, Honglei Zhuang, Itay Laish, Jan Ackermann, Lily Wang, Meg Risdal, Megan Barnes, Michael Fink, Mohamed Amin, Moran Ambar, Natan Potikha, Nikita Gupta, Nitzan Katz, Noam Velan, Ofir Roval, Ori Ram, Polina Zablotskaia, Prathamesh Bang, Priyanka Agrawal, Rakesh Ghiya, Sanjay Ganapathy, Simon Baumgartner, Sofia Erell, Sushant Prakash, Thibault Sellam, Vikram Rao, Xuanhui Wang, Yaroslav Akulov, Yulong Yang, Zhen Yang, Zhixin Lai, Zhongru Wu, Anca Dragan, Avinatan Hassidim, Fernando Pereira, Slav Petrov, Srinivasan Venkatachary, Tulsee Doshi, Yossi Matias, Sasha Goldshtein, Dipanjan Das

Main category: cs.CL

TL;DR: The FACTS Leaderboard is a comprehensive evaluation suite that measures language models’ factuality across four scenarios: multimodal image questions, parametric knowledge, search-based information seeking, and document-grounded long-form responses.

Details

Motivation: There's a need for holistic evaluation of language models' factuality across diverse real-world scenarios, moving beyond single-dimensional assessments to capture how models perform in different factual contexts.

Method: The suite aggregates performance across four sub-leaderboards: (1) FACTS Multimodal for image-based questions, (2) FACTS Parametric for closed-book factoid questions, (3) FACTS Search for information-seeking with search API, and (4) FACTS Grounding (v2) for document-grounded long-form responses. Automated judge models score responses, and the final score averages all four components.

Result: The paper introduces an actively maintained online leaderboard with both public and private splits, available at https://www.kaggle.com/benchmarks/google/facts, providing a robust framework for evaluating model factuality.

Conclusion: The FACTS Leaderboard Suite offers a comprehensive, balanced assessment of language models’ factuality across diverse scenarios, addressing the need for holistic evaluation in the field.

Abstract: We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models’ world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model’s overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

[29] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

Michael Schlee, Christoph Weisser, Timo Kivimäki, Melchizedek Mashiku, Benjamin Saefken

Main category: cs.CL

TL;DR: LabelFusion is an ensemble method that combines traditional transformer classifiers with LLMs for text classification, using a fusion MLP to integrate both sources for improved accuracy and cost-aware predictions.

Details

Motivation: To leverage complementary strengths of traditional transformer-based classifiers and Large Language Models (LLMs) for text classification, enabling robust performance while managing practical trade-offs between accuracy, latency, and cost.

Method: Concatenates transformer embeddings with LLM-derived per-class scores (from structured prompt-engineering), feeds this joint representation into a compact multi-layer perceptron (FusionMLP) for end-to-end training.

Result: Achieves 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification, demonstrating robust performance across domains with practical accuracy-cost trade-offs.

Conclusion: LabelFusion successfully combines transformer classifiers and LLMs through learned fusion, delivering accurate, cost-aware text classification across multi-class and multi-label tasks with a simple interface for users.

Abstract: LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone’s embeddings with the LLM-derived per-class scores – obtained through structured prompt-engineering strategies – and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains – achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification – while enabling practical trade-offs between accuracy, latency, and cost.

[30] Quantifying Emotional Tone in Tolkien’s The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python

Lilin Qiu

Main category: cs.CL

TL;DR: Computational analysis of emotional tone in Tolkien’s The Hobbit reveals a generally positive, calm dialogue with increasing agency, showing how tension and comfort rhythmically alternate throughout the story.

Details

Motivation: To uncover the subtle emotional structures and rhythm in Tolkien's The Hobbit using computational text analysis methods, bridging digital tools with literary interpretation.

Method: Extracted dialogue using regular expressions, preprocessed text, and scored emotional dimensions using the NRC-VAD lexicon to quantify valence, arousal, and dominance.

Result: Dialogue maintains generally positive (high valence) and calm (low arousal) tone, with gradually increasing agency (dominance) as story progresses. Visualizations show rhythmic cycling between tension and comfort.

Conclusion: Computational methods can reveal subtle emotional structures in literature, demonstrating how Tolkien’s language creates a steady rhythm of emotional modulation that shapes The Hobbit’s storytelling.

Abstract: This study analyzes the emotional tone of dialogue in J. R. R. Tolkien’s The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel’s emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations – including emotional trajectory graphs and word clouds – highlight how Tolkien’s language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.

[31] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Hauke Licht

Main category: cs.CL

TL;DR: Multimodal AI models show high reliability in emotion analysis under ideal conditions but fail in real-world political settings, requiring careful evaluation before use in political research.

Details

Motivation: While multimodal generative AI promises advances in analyzing emotions in political communication, there's a lack of evidence about its effectiveness in emotion analysis, particularly in real-world political contexts.

Method: Evaluated current multimodal large language models (mLLMs) using two complementary datasets of human-labeled video recordings: one under ideal circumstances and another of real-world parliamentary debates.

Result: Under ideal circumstances, mLLMs’ emotional arousal ratings are highly reliable with minimal demographic bias. However, in real-world parliamentary debates, mLLMs’ arousal ratings fail to deliver reliable results, potentially affecting downstream statistical inferences.

Conclusion: The study highlights the need for thorough evaluation of emerging generative AI methods in political analysis and provides a replicable framework for such assessments, cautioning against uncritical adoption in real-world political research.

Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs’ emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs’ arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.

[32] Leveraging language models for summarizing mental state examinations: A comprehensive evaluation and dataset release

Nilesh Kumar Sahu, Manjeet Yadav, Mudita Chaturvedi, Snehil Gupta, Haroon R Lone

Main category: cs.CL

TL;DR: Researchers developed a mental health dataset from 405 participants and evaluated 5 summarization models to automatically generate concise MSE summaries, addressing mental health professional shortages in developing countries.

Details

Motivation: Limited access to mental health professionals in developing countries creates overwhelming demand and long patient wait times. Resident doctors conduct initial assessments but are constrained, necessitating automated tools to generate concise Mental State Examination summaries.

Method: Developed a 12-item descriptive MSE questionnaire, collected responses from 405 participants (9720 utterances), then evaluated 5 pre-trained summarization models with and without fine-tuning using ROUGE, SummaC, and human evaluation metrics.

Result: Language models can generate automated coherent MSE summaries for doctors. The study demonstrates the feasibility of using summarization models for mental health assessment documentation and releases the dataset and trained models publicly.

Conclusion: Automated MSE summarization using language models can help address mental health professional shortages in developing countries by reducing documentation burden and improving efficiency, with publicly available resources for further research.

Abstract: Mental health disorders affect a significant portion of the global population, with diagnoses primarily conducted through Mental State Examinations (MSEs). MSEs serve as structured assessments to evaluate behavioral and cognitive functioning across various domains, aiding mental health professionals in diagnosis and treatment monitoring. However, in developing countries, access to mental health support is limited, leading to an overwhelming demand for mental health professionals. Resident doctors often conduct initial patient assessments and create summaries for senior doctors, but their availability is constrained, resulting in extended patient wait times. This study addresses the challenge of generating concise summaries from MSEs through the evaluation of various language models. Given the scarcity of relevant mental health conversation datasets, we developed a 12-item descriptive MSE questionnaire and collected responses from 405 participants, resulting in 9720 utterances covering diverse mental health aspects. Subsequently, we assessed the performance of five well-known pre-trained summarization models, both with and without fine-tuning, for summarizing MSEs. Our comprehensive evaluation, leveraging metrics such as ROUGE, SummaC, and human evaluation, demonstrates that language models can generate automated coherent MSE summaries for doctors. With this paper, we release our collected conversational dataset and trained models publicly for the mental health research community.

[33] The Spatial Semantics of Iconic Gesture

Andy Lücking, Alexander Henlein, Alexander Mehler

Main category: cs.CL

TL;DR: The paper proposes a spatial gesture semantics framework that separates linguistic and visual meaning, analyzes iconicity through three aspects (iconic model, embedding with transformations, and informational evaluation), and explains how gesture meaning composes with speech.

Details

Motivation: To address the gap in multimodal linguistic theory regarding how iconic gestures convey meaning and how that meaning composes with speech meaning, since current theories don't adequately explain the semantics of visual gestures.

Method: Introduces a spatial gesture semantics with three components: 1) iconic model (kinematic annotations to vector sequences), 2) embedding with transformations (rotation, scaling, perspective fixation, quotation), and 3) informational evaluation (heuristic classification that enables interaction with verbal content).

Result: Develops a formal framework that differentiates three aspects of iconicity, identifies necessary transformations for embedding iconic models, and explains how gesture meaning can interact with speech through lexicon-driven inferences in dynamic semantics.

Conclusion: The proposed spatial gesture semantics successfully bridges the gap between linguistic and visual meaning, providing a systematic way to analyze iconic gestures and their composition with speech in multimodal communication.

Abstract: The current multimodal turn in linguistic theory leaves a crucial question unanswered: what is the meaning of iconic gestures, and how does it compose with speech meaning? We argue for a separation of linguistic and visual levels of meaning and introduce a spatial gesture semantics that closes this gap. Iconicity is differentiated into three aspects: Firstly, an interpretation of the form of a gesture in terms of a translation from kinematic gesture annotations into vector sequences (iconic model). Secondly, a truth-functional evaluation of the iconic model within spatially extended domains (embedding). Since a simple embedding is too strong, we identify a number of transformations that can be applied to iconic models, namely rotation, scaling, perspective fixation, and quotation of handshape. Thirdly, the linguistic description or classification of an iconic model (informational evaluation). Since the informational evaluation of an iconic gesture is a heuristic act, it needs a place in a semantic theory of visual communication. Informational evaluation lifts a gesture to a quasi-linguistic level that can interact with verbal content. This interaction is either vacuous, or regimented by usual lexicon-driven inferences discussed in dynamic semantic frameworks.

[34] Anthropocentric bias in language model evaluation

Raphaël Millière, Charles Rathkopf

Main category: cs.CL

TL;DR: The paper argues that evaluating LLM cognition requires overcoming anthropocentric biases, particularly “auxiliary oversight” (ignoring non-competence factors) and “mechanistic chauvinism” (dismissing non-human strategies). It proposes an iterative, empirical approach combining behavioral experiments with mechanistic studies.

Details

Motivation: Current evaluation of LLM cognitive capacities suffers from anthropocentric biases that prevent accurate assessment of their true capabilities. Researchers need better frameworks that don't assume human-like cognition or dismiss non-human strategies.

Method: Proposes an empirically-driven, iterative approach to map cognitive tasks to LLM-specific capacities and mechanisms. This involves supplementing carefully designed behavioral experiments with mechanistic studies to understand how LLMs actually solve problems.

Result: Identifies two specific anthropocentric biases: “auxiliary oversight” (overlooking non-competence factors that impede performance) and “mechanistic chauvinism” (dismissing LLM strategies that differ from human approaches as not genuinely competent).

Conclusion: To properly evaluate LLM cognition, researchers must move beyond anthropocentric biases and adopt task-specific, empirically-driven approaches that recognize LLMs may use different mechanisms than humans while still demonstrating genuine competence.

Abstract: Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (“auxiliary oversight”), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (“mechanistic chauvinism”). Mitigating these biases necessitates an empirically-driven, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, which can be done by supplementing carefully designed behavioral experiments with mechanistic studies.

[35] Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

Main category: cs.CL

TL;DR: Vist is a vision-centric token compression framework that uses a slow-fast approach: fast path converts distant tokens to images for lightweight vision encoding, slow path processes proximal tokens with LLM. Achieves 2.3x token reduction with 16% FLOPs and 50% memory savings.

Details

Motivation: As LLMs grow to trillions of parameters and context windows expand to hundreds of thousands of tokens, compute and memory costs skyrocket, making token compression essential for practical applications.

Method: Slow-fast compression framework inspired by human reading: fast path renders distant tokens into images processed by frozen lightweight vision encoder; slow path feeds proximal window to LLM. Uses Probability-Informed Visual Enhancement (PVE) objective to mask high-frequency tokens during training, focusing on semantically rich regions.

Result: Achieves same accuracy with 2.3x fewer tokens, reducing FLOPs by 16% and memory by 50%. Outperforms strongest text encoder-based compression method CEPE by 7.6% on average across 11 benchmarks including TriviaQA, NQ, PopQA, NLUI, and CLIN.

Conclusion: Vist sets a new standard for token efficiency in LLMs by combining vision-based compression with human reading-inspired architecture, offering significant computational savings while maintaining accuracy.

Abstract: Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.

[36] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

Weixiang Zhao, Jiahe Guo, Yang Deng, Tongtong Wu, Wenxuan Zhang, Yulin Hu, Xingyu Sui, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu

Main category: cs.CL

TL;DR: Language-specific representation ablation at inference boosts multilingual reasoning in LLMs without training, inspired by human cognitive neuroscience.

Details

Motivation: Multilingual reasoning in LLMs favors high-resource languages; inspired by cognitive neuroscience showing human reasoning is language-independent, the authors hypothesize LLMs encode reasoning and language as separable components that can be disentangled.

Method: Causal intervention by ablating language-specific representations at inference time across 10 open-weight LLMs and 11 typologically diverse languages, with layer-wise analyses to confirm disentanglement.

Result: Language-specific ablation consistently boosts multilingual reasoning performance; reasoning and language representations can be effectively disentangled throughout models while preserving top-layer language features for linguistic fidelity.

Conclusion: Training-free language-reasoning disentanglement achieves comparable or superior results to post-training methods with minimal computational overhead, offering a lightweight, interpretable strategy for improving cross-lingual generalization in LLMs.

Abstract: Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-weight LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively disentangled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training methods such as supervised fine-tuning or reinforcement learning, our training-free language-reasoning disentanglement achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.

[37] Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo, Haixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, Ting Liu

Main category: cs.CL

TL;DR: RLPA framework uses reinforcement learning with simulated users to dynamically infer and refine user profiles for personalized dialogue, achieving SOTA performance that surpasses commercial models.

Details

Motivation: Existing prompt-based and offline optimization methods for personalized LLM alignment are static and shallow, failing in cold-start scenarios and long-term personalization.

Method: RLPA framework where LLM interacts with simulated user model to iteratively infer/refine user profiles via dialogue, guided by dual-level rewards (Profile Reward for accurate user representation, Response Reward for profile-consistent responses).

Result: Qwen-RLPA (fine-tuned Qwen-2.5-3B-Instruct) achieves SOTA in personalized dialogue, outperforms prompting/offline fine-tuning baselines, and surpasses Claude-3.5 and GPT-4o. Shows robustness in handling conflicting preferences, sustaining long-term personalization, and efficient inference.

Conclusion: Dynamic profile inference through RLPA is a more effective paradigm for building personalized dialogue systems compared to static approaches.

Abstract: Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA’s robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.

[38] V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

Qi Lin, Weikai Xu, Lisi Chen, Bin Dai

Main category: cs.CL

TL;DR: V-VAE framework enables LLMs to generate persona-consistent responses by learning fine-grained latent traits from high-quality human chat data, outperforming existing methods.

Details

Motivation: Existing persona-based chatbots rely on static role descriptions and synthetic data, failing to capture dynamic, fine-grained human traits like emotional tone and situational awareness needed for truly human-like conversations.

Method: Proposes Verbal Variational Auto-Encoding (V-VAE) framework with variational auto-encoding module and fine-grained control space that learns interpretable latent variables for talking style, interaction patterns, and personal attributes. Also creates HumanChatData dataset and HumanChatBench benchmark.

Result: LLMs based on V-VAE consistently outperform standard baselines on both HumanChatBench and DialogBench benchmarks, demonstrating effectiveness of the approach.

Conclusion: The V-VAE framework combined with high-quality human chat data enables LLMs to generate more human-like, persona-consistent responses by capturing subtle, dynamic conversational traits that previous methods missed.

Abstract: With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.

[39] Better Language Model Inversion by Compactly Representing Next-Token Distributions

Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta

Main category: cs.CL

TL;DR: PILS method recovers hidden prompts from language models using next-token probabilities, achieving 2-3.5x higher recovery rates than previous methods by exploiting low-dimensional structure in model outputs.

Details

Motivation: Language model inversion has security implications for API-protected models, as it could leak private information from system messages. Current methods are limited, so better inversion techniques are needed to understand vulnerabilities.

Method: PILS (Prompt Inversion from Logprob Sequences) uses next-token probability distributions across multiple generation steps. Key insight: model outputs occupy low-dimensional subspace, enabling lossless compression via linear map to use more output information for inversion.

Result: Massive gains over SOTA: 2-3.5x higher exact recovery rates (e.g., 17% to 60%). Good generalization: inverter trained on 16 steps performs well on 32 steps (+5-27 points). Strong performance on hidden system message recovery. Cross-family model transfer possible.

Conclusion: Next-token probabilities are more vulnerable to inversion attacks than previously known. PILS demonstrates significant security implications for language model deployments, showing that current protections may be insufficient against sophisticated inversion methods.

Abstract: Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model’s system message. We propose a new method – prompt inversion from logprob sequences (PILS) – that recovers hidden prompts by gleaning clues from the model’s next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

[40] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Tianyi Zhou, Johanne Medina, Sanjay Chawla

Main category: cs.CL

TL;DR: LLMs generate confabulations (fluent but incorrect content), especially risky in multi-turn applications. The paper investigates how in-context information affects model behavior and proposes a reliability estimation method using token-level uncertainty to identify unreliable responses.

Details

Motivation: LLMs are prone to confabulation (generating fluent but incorrect content), which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. There's a need to understand how in-context information influences model behavior and whether LLMs can identify their own unreliable responses.

Method: Proposes a reliability estimation method that leverages token-level uncertainty to guide aggregation of internal model representations. Computes aleatoric and epistemic uncertainty from output logits to identify salient tokens, then aggregates their hidden states into compact representations for response-level reliability prediction.

Result: Through controlled experiments on open QA benchmarks: correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses (misalignment between uncertainty and correctness). The probing-based method captures these behavioral shifts and improves detection of unreliable outputs across multiple open-source LLMs.

Conclusion: The results underscore limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation in LLMs.

Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

[41] Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

Main category: cs.CL

TL;DR: ICRL enables training-free integration of non-text modality representations into LLMs using in-context learning with FM representations instead of text inputs.

Details

Motivation: Existing approaches for integrating non-text modalities into LLMs require costly supervised training, limiting on-the-fly adaptation to new domains and modalities.

Method: In-Context Representation Learning (ICRL) replaces text inputs with foundational model representations, enabling LLMs to perform multi-modal inference without fine-tuning through few-shot learning.

Result: ICRL is evaluated on molecular domain tasks, investigating mapping strategies, performance factors, and underlying mechanisms of training-free multi-modal integration.

Conclusion: ICRL presents the first training-free framework for integrating non-text modality representations into text-based LLMs, enabling adaptable multi-modal generalization.

Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

[42] Towards Personalized Deep Research: Benchmarks and Evaluations

Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: PDR-Bench is the first benchmark for evaluating personalization in Deep Research Agents, featuring 50 diverse tasks across 10 domains paired with 25 authentic user profiles, assessed through a PQR framework measuring Personalization Alignment, Content Quality, and Factual Reliability.

Details

Motivation: Existing benchmarks for Deep Research Agents focus on generic quality metrics and overlook personalization, which is critical for individual users. Current evaluations rely on close-ended benchmarks, while open-ended deep research benchmarks are scarce and typically neglect personalized scenarios.

Method: Introduced Personalized Deep Research Bench (PDR-Bench) with 50 diverse research tasks across 10 domains paired with 25 authentic user profiles combining structured persona attributes with dynamic real-world contexts, creating 250 realistic user-task queries. Proposed PQR Evaluation Framework to jointly measure Personalization Alignment, Content Quality, and Factual Reliability.

Result: Experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. The benchmark provides a rigorous evaluation foundation for personalized AI research assistants.

Conclusion: PDR-Bench establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants, addressing the critical gap in personalization evaluation for Deep Research Agents.

Abstract: Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing benchmarks primarily evaluate DRAs on generic quality metrics and overlook personalization, a critical dimension for individual users. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench (PDR-Bench), the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures Personalization Alignment, Content Quality, and Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

[43] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

Shuzhou Yuan, Ercong Nie, Yinuo Sun, Chenxuan Zhao, William LaCroix, Michael Färber

Main category: cs.CL

TL;DR: LLMs often make false refusals on safe requests containing words similar to unsafe queries. The paper introduces two benchmarks (XSB and MS-XSB) to measure this problem and proposes three lightweight, model-agnostic methods to reduce exaggerated refusals without retraining.

Details

Motivation: Large language models frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. This exaggerated safety behavior undermines helpfulness and needs systematic measurement and mitigation.

Method: 1) Created two benchmarks: XSB for single-turn prompts with annotated “Focus” keywords, and MS-XSB for multi-turn dialog scenarios. 2) Used post-hoc explanation methods to identify refusal triggers. 3) Deployed three inference-time approaches: ignore-word instructions, prompt rephrasing, and attention steering - all model-agnostic without retraining or parameter access.

Result: Experiments on four instruction-tuned Llama models show that the proposed strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios.

Conclusion: The paper establishes a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments through lightweight, model-agnostic interventions at inference time.

Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with “Focus” keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.

[44] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue

Main category: cs.CL

TL;DR: TheMCPCompany is a benchmark for evaluating tool-calling agents on real-world service tasks using MCP servers with 18,000+ tools, showing advanced models can discover tools in simple environments but struggle with complex enterprise navigation.

Details

Motivation: Current general-purpose agents rely heavily on web browsers for environment interaction, but the Model Context Protocol (MCP) has enabled many task-specific tools that are easier to develop than GUIs. There's a need to evaluate how well agents can use these specialized tools instead of general-purpose browsers.

Method: Created TheMCPCompany benchmark using REST APIs of real-world services to build MCP servers with over 18,000 tools. Provided manually annotated ground-truth tools for each task. Evaluated agents in two settings: 1) with ground-truth tools to show potential assuming perfect tool retrieval, and 2) with tool retrieval to study real-world practicality.

Result: With ground-truth tools, tool-calling agents improve performance and reduce costs. With tool retrieval, all models perform similarly or better than browser-based agents, but smaller models can’t fully utilize available tools. GPT-5’s performance with retrieval is close to its ground-truth performance. Advanced models can discover tools in simpler environments but struggle with complex enterprise navigation.

Conclusion: Navigating tens of thousands of tools and combining them in non-trivial ways for complex problems remains challenging for current models. TheMCPCompany reveals the need for better reasoning and retrieval models to handle complex enterprise environments effectively.

Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5’s performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

[45] SCALE: Upscaled Continual Learning of Large Language Models

Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim, JeongSeon Yi, Joonseok Lee, DongYoung Jung, Jaeseon Park, Kyoungwon Park, Suk-hoon Jung

Main category: cs.CL

TL;DR: SCALE introduces a width expansion architecture for continual pre-training that adds lightweight expansion to linear modules while freezing pre-trained parameters, achieving better stability-plasticity trade-off than depth expansion.

Details

Motivation: Progress in continual pre-training for LLMs depends more on scaling the right architecture than just scaling parameters. Current approaches suffer from catastrophic forgetting when adding new knowledge.

Method: SCALE inserts lightweight expansion into linear modules while freezing all pre-trained parameters, preserving residual and attention topologies. Uses two principles: Persistent Preservation (maintains base model behavior) and Collaborative Adaptation (selectively trains expansion components). Three variants: SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route (token-level routing between heads).

Result: On synthetic biography benchmark: mitigates severe forgetting seen with depth expansion while acquiring new knowledge. On Korean continual pre-training: achieves less forgetting on English evaluations and competitive gains on Korean benchmarks, offering best stability-plasticity trade-off.

Conclusion: SCALE architecture provides effective continual pre-training by balancing preservation and adaptation through width expansion, with analysis showing when preservation holds and why this approach stabilizes optimization compared to standard continual learning.

Abstract: We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model’s original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model’s behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.

[46] Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

Lucia Makaiova, Martin Fajcik, Antonin Jarolim

Main category: cs.CL

TL;DR: This paper addresses the challenge of evaluating document-level claim extraction by proposing alignment-based methods to compare claim sets, with experiments on Czech/Slovak news comments highlighting limitations of current approaches.

Details

Motivation: Document-level claim extraction is an open challenge in fact-checking, and current evaluation methods for extracted claims have received limited attention. There's a need for reliable evaluation frameworks to assess model performance and inter-annotator agreement.

Method: The authors explore alignment approaches to match two sets of claims from the same source document and compute similarity scores. They investigate techniques to identify optimal alignment and evaluation methods between claim sets, enabling comparison between model-extracted and human-annotated claims.

Result: Experiments on a new dataset of claims extracted from Czech and Slovak news article comments reveal limitations of current evaluation approaches. The informal language, strong local context, and linguistic subtleties of these closely related languages pose additional challenges.

Conclusion: Current evaluation methods are insufficient for document-level claim extraction. More advanced approaches are needed that can properly capture semantic similarity and evaluate essential claim properties like atomicity, checkworthiness, and decontextualization.

Abstract: Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.

[47] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

Sizhe Zhou, Jiawei Han

Main category: cs.CL

TL;DR: Event-centric memory representation for LLM conversational agents using enriched elementary discourse units (EDUs) organized in a heterogeneous graph, improving long-term coherence and personalization.

Details

Motivation: LLM-based conversational agents struggle with long-term coherence and personalization due to fixed context windows and limitations of existing memory approaches that either use coarse retrieval over large chunks or fine-grained but fragmented views of dialogue.

Method: Decompose each session into enriched elementary discourse units (EDUs) - self-contained statements with normalized entities and source turn attributions. Organize sessions, EDUs, and their arguments in a heterogeneous graph supporting associative recall. Build two retrieval variants using dense similarity search with LLM filtering, optionally with graph-based propagation to connect evidence across related EDUs.

Result: Experiments on LoCoMo and LongMemEval$_S$ benchmarks show event-centric memories match or surpass strong baselines while operating with much shorter QA contexts.

Conclusion: Structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents, preserving information in non-compressive form for better accessibility.

Abstract: LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) – self-contained statements with normalized entities and source turn attributions – and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.

[48] LMSpell: Neural Spell Checking for Low-Resource Languages

Akesh Gunathilake, Nadil Karunarathna, Tharusha Bandaranayake, Nisansa de Silva, Surangika Ranathunga

Main category: cs.CL

TL;DR: First empirical study comparing pretrained language models for spell correction, including low-resource languages, showing LLMs outperform other architectures with sufficient fine-tuning data, even in languages they weren’t pre-trained on.

Details

Motivation: Spell correction remains challenging for low-resource languages, and while pretrained language models have been used, there's been no proper comparison across different PLM types and limited application to LRLs.

Method: Conducted empirical study comparing effectiveness of different PLM architectures (LLMs, encoder-based, encoder-decoder) for spell correction, including low-resource languages. Developed LMSpell toolkit with evaluation function to compensate for LLM hallucination. Included case study with Sinhala language.

Result: Large Language Models outperform encoder-based and encoder-decoder models when fine-tuning dataset is large, even for languages the LLM wasn’t pre-trained on. Released LMSpell toolkit for easy spell correction across PLMs.

Conclusion: LLMs show strong potential for spell correction tasks, including for low-resource languages, when sufficient fine-tuning data is available. The study provides practical tools and insights for improving spell correction in underserved languages.

Abstract: Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.

[49] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

Giorgos Antoniou, Giorgos Filandrianos, Aggelos Vlachos, Giorgos Stamou, Lampros Kollimenos, Konstantinos Skianis, Michalis Vazirgiannis

Main category: cs.CL

TL;DR: Researchers create a 1-million-document corpus of Greek government decisions from Diavgeia platform, with high-quality text extraction, RAG task design, and evaluation showing potential for transparency and AI applications.

Details

Motivation: To create a large-scale, machine-readable corpus of government decisions to support transparency, advanced information access, and AI development in the legal/governmental domain.

Method: Extract 1 million Greek government decisions from Diavgeia platform, convert PDFs to Markdown text, design reproducible extraction pipeline, conduct qualitative analyses of boilerplate patterns, and create RAG task with representative questions and answers.

Result: Successfully created a high-quality corpus with reproducible extraction pipeline, demonstrated RAG system’s ability to retrieve and reason over public decisions, showing potential for chat-based assistants and AI model training.

Conclusion: The corpus enables transparency, supports AI development for legal/governmental domains, and provides foundation for domain adaptation, knowledge-grounded generation, and explainable AI, with data and code made publicly accessible.

Abstract: We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.

[50] Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments

Yifei Tong

Main category: cs.CL

TL;DR: Interruptions in Supreme Court oral arguments don’t change argument content but show gendered patterns: female advocates face more negatively-toned interruptions.

Details

Motivation: To understand how interruptions during Supreme Court oral arguments affect argument content and emotional tone, with specific focus on gendered dynamics in judicial discourse.

Method: Analyzed 12,663 speech chunks from advocate-justice interactions (2010-2019) using ConvoKit Supreme Court Corpus. Used GloVe-based sentence embeddings to quantify semantic shifts and lexicon-based analysis to measure sentiment.

Result: Semantic similarity between pre- and post-interruption speech remains high (interruptions don’t alter argument content), but interruptions directed at female advocates contain significantly higher levels of negative sentiment.

Conclusion: Interruptions maintain argument content but reveal gendered communication patterns in elite institutions, demonstrating computational linguistics’ value for studying power, discourse, and equity in judicial proceedings.

Abstract: This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates’ speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate’s argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.

[51] Luxical: High-Speed Lexical-Dense Text Embeddings

DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Paul Burstein, Parth Doshi, Paul Burnstein, Pratyush Maini, Ricardo Monti, Rishabh Adiga, Scott Loftin, Siddharth Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.CL

TL;DR: Luxical is a high-speed lexical-dense text embedding library that combines sparse TF-IDF features with a small ReLU network and knowledge distillation to approximate transformer embeddings at much lower computational cost, achieving 3x-100x speedups while maintaining quality comparable to neural baselines.

Details

Motivation: Current tools for organizing web-scale text corpora face a trade-off: lexical classifiers (like FastText) are fast but limited to classification outputs, while transformer embedding models are flexible (supporting clustering, classification, retrieval) but computationally expensive. There's a need for a solution that combines the speed of lexical approaches with the flexibility of neural embeddings.

Method: Luxical combines sparse TF-IDF features with a small ReLU neural network and uses knowledge distillation training to approximate large transformer embedding models. The architecture aims to produce “lexical-dense” embeddings that maintain the computational efficiency of lexical methods while achieving the representational power of neural embeddings.

Result: Luxical achieves speedups ranging from 3x to 100x over varying-sized neural baselines, with performance comparable to FastText during data curation tasks. In evaluations including webcrawl document retrieval and language model data curation, Luxical matches the quality of neural baselines while offering significantly better compute/quality trade-offs for large-scale text organization.

Conclusion: Luxical provides a practical solution for web-scale text organization by combining the speed of lexical methods with the flexibility of neural embeddings, offering favorable compute/quality trade-offs and making high-quality text embeddings more accessible for large-scale applications. The library is available as open-source software.

Abstract: Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today’s dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed “lexical-dense” text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF–IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.

[52] LLMs in Interpreting Legal Documents

Simone Corbo

Main category: cs.CL

TL;DR: LLMs in legal domain: potential to optimize legal tasks but face challenges like hallucinations, algorithmic monoculture, and regulatory compliance across EU, US, and China.

Details

Motivation: To explore how Large Language Models can enhance traditional legal work by automating and improving tasks like statutory interpretation, contract analysis, and legal research, while addressing the challenges and regulatory frameworks surrounding their implementation.

Method: Analysis of possible LLM use cases in legal domain, examination of challenges (algorithmic monoculture, hallucinations, regulatory compliance), and presentation of two different benchmarks for evaluation.

Result: LLMs show potential for optimizing legal tasks such as interpreting statutes/contracts/case law, enhancing legal summarization, contract negotiation, and information retrieval, but face significant implementation challenges.

Conclusion: While LLMs offer promising applications for legal domain optimization, careful consideration of challenges and compliance with evolving regulations (EU AI Act, US initiatives, Chinese approaches) is essential for successful implementation.

Abstract: This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU’s AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.

cs.CV

[53] Neuromorphic Eye Tracking for Low-Latency Pupil Detection

Paul Hueber, Luca Peres, Florian Pitters, Alejandro Gloriani, Oliver Rhodes

Main category: cs.CV

TL;DR: This paper presents a neuromorphic eye-tracking system using spiking neural networks that achieves 3.7-4.1px mean error with 20x smaller model size and 850x less compute than ANN variants, enabling 3.9-4.9 mW power consumption and 3 ms latency for wearable AR/VR applications.

Details

Motivation: Conventional frame-based eye tracking suffers from motion blur, high computational cost, and limited temporal resolution, which are problematic for wearable AR/VR systems requiring low latency and milliwatt-level power. Existing SNN approaches are either too specialized or underperform compared to modern ANN architectures.

Method: The authors create neuromorphic versions of top-performing event-based eye-tracking models by replacing recurrent and attention modules with lightweight Leaky Integrate-and-Fire (LIF) layers and using depth-wise separable convolutions to reduce model complexity.

Result: The models achieve 3.7-4.1px mean error (approaching Retina’s 3.24px), with 20x smaller model size and 850x theoretical compute reduction compared to ANN variants. Estimated performance: 3.9-4.9 mW power consumption and 3 ms latency at 1 kHz.

Conclusion: High-performing event-based eye-tracking architectures can be successfully redesigned as SNNs with substantial efficiency gains while maintaining accuracy suitable for real-time wearable deployment in AR/VR applications.

Abstract: Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.

[54] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering

Cai Xu, Jinlong Liu, Yilin Zhang, Ziyu Guan, Wei Zhao

Main category: cs.CV

TL;DR: ISMVC proposes selective imputation for incomplete multi-view clustering, evaluating informativeness of missing positions and only imputing when sufficient support exists, combined with variational autoencoder for robust representation learning.

Details

Motivation: Existing methods for incomplete multi-view clustering have limitations: imputation-based approaches introduce noise/bias when information is insufficient, while imputation-free methods struggle under severe incompleteness due to lack of cross-view complementarity.

Method: ISMVC evaluates imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, selectively imputes only when sufficient support exists, and integrates this with a variational autoencoder with mixture-of-Gaussians prior for clustering-friendly latent representations.

Result: Extensive experiments on multiple benchmark datasets under realistic unbalanced missing scenarios show ISMVC outperforms both imputation-based and imputation-free approaches.

Conclusion: ISMVC provides a lightweight, data-driven, model-agnostic selective imputation approach that can be integrated as a plug-in module into existing incomplete multi-view clustering models, offering robust performance by balancing the benefits of imputation while avoiding its risks.

Abstract: Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.

[55] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

Woojin Lee, Hyugjae Chang, Jaeho Moon, Jaehyup Lee, Munchurl Kim

Main category: cs.CV

TL;DR: ABBSPO is a weakly supervised oriented object detection framework that improves scale estimation using adaptive bounding box scaling and leverages object symmetry for better orientation prediction.

Details

Motivation: Previous HBox-supervised OOD methods directly compare ground truth HBoxes with predicted RBoxes' minimum circumscribed rectangles, leading to inaccurate scale estimation. There's also a need to better exploit object symmetry and prevent learning collapse in orientation prediction.

Method: Proposes two key components: 1) Adaptive Bounding Box Scaling (ABBS) that scales GT HBoxes to optimize for each predicted RBox size, and 2) Symmetric Prior Angle (SPA) loss that uses inherent symmetry of aerial objects for self-supervised learning, preventing collapse when all augmented view predictions are wrong.

Result: Extensive experiments show ABBSPO achieves state-of-the-art performance, outperforming existing weakly supervised oriented object detection methods.

Conclusion: ABBSPO effectively addresses scale estimation limitations in HBox-supervised OOD and improves orientation prediction through symmetry exploitation, establishing a new benchmark for weakly supervised oriented object detection.

Abstract: Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.

[56] Diffusion Is Your Friend in Show, Suggest and Tell

Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

Main category: cs.CV

TL;DR: SST combines diffusion models with autoregressive generation for image captioning, achieving SOTA results on COCO by using diffusion to suggest improvements to autoregressive captions.

Details

Motivation: Diffusion models underperform autoregressive models in discrete domains like text generation. The authors want to leverage diffusion's bidirectional/refining capabilities while preserving autoregressive models' strong linguistic structure.

Method: Show, Suggest and Tell (SST) uses diffusion models to provide suggestions to autoregressive generation rather than replacing it. The diffusion model refines and improves the autoregressive captions.

Result: Achieves 125.1 CIDEr-D on COCO without RL, outperforming both autoregressive and diffusion SOTA by 1.5 and 2.5 points. Shows positive correlation between suggestion quality and caption quality.

Conclusion: Combining diffusion suggestions with autoregressive generation is a promising underexplored direction that yields SOTA results by leveraging strengths of both approaches.

Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show_suggest_tell.

[57] Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video

Seonghwa Choi, Moonkyeong Choi, Mingyu Jang, Jaekyung Kim, Jianfei Cai, Wen-Huang Cheng, Sanghoon Lee

Main category: cs.CV

TL;DR: RnD-Avatar: A 3DGS-based framework for creating relightable and animatable human avatars from monocular video with accurate pose-variant deformation and fine geometric details.

Details

Motivation: Existing NeRF and 3DGS methods for human avatar modeling often produce unsatisfactory results with insufficient geometrical details related to body motion (like clothing wrinkles), and lack realistic relighting capabilities.

Method: Proposes a 3D Gaussian Splatting-based framework with dynamic skinning weights for pose-based articulation, learns additional motion-induced deformations, introduces novel regularization for fine geometric details under sparse visual cues, and creates a multi-view dataset with varied lighting for evaluation.

Result: Achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting, enabling realistic rendering of novel poses/views with photo-realistic lighting effects under arbitrary lighting conditions.

Conclusion: RnD-Avatar successfully addresses limitations of previous methods by combining dynamic skinning weights, motion-induced deformation learning, and specialized regularization to create high-fidelity, relightable human avatars with accurate pose-variant deformation and fine geometric details.

Abstract: Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar’s articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.

[58] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

Yihao Liu, Chenyu Gao, Lianrui Zuo, Michael E. Kim, Brian D. Boyd, Lisa L. Barnes, Walter A. Kukull, Lori L. Beason-Held, Susan M. Resnick, Timothy J. Hohman, Warren D. Taylor, Bennett A. Landman

Main category: cs.CV

TL;DR: MetaVoxel is a joint diffusion model that learns a single diffusion process spanning both medical imaging data and clinical metadata, enabling flexible zero-shot inference across multiple tasks without task-specific retraining.

Details

Motivation: Most current deep learning approaches in medical AI are trained for specific predictive directions with specific input variables, requiring separate models for different tasks. There's a need for a unified framework that can handle multiple clinical applications without task-specific retraining.

Method: MetaVoxel is a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. It captures the complete joint distribution rather than conditional distributions.

Result: Using over 10,000 T1-weighted MRI scans with clinical metadata from nine datasets, MetaVoxel performs image generation, age estimation, and sex prediction with performance comparable to established task-specific baselines. It demonstrates flexible zero-shot inference capabilities.

Conclusion: Joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability by supporting flexible inference across multiple tasks without requiring separate models or retraining.

Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.

[59] Independent Density Estimation

Jiahao Liu

Main category: cs.CV

TL;DR: IDE method improves compositional generalization in vision-language models by learning connections between individual words and image features, outperforming current models on unseen compositions.

Details

Motivation: Current large-scale vision-language models struggle with achieving human-like compositional generalization despite their success in tasks like image captioning and conditioned image generation.

Method: Proposes Independent Density Estimation (IDE) to learn connections between individual words and corresponding image features. Builds two models: one using fully disentangled visual representations, and another using a Variational Auto-Encoder for partially disentangled features. Also introduces entropy-based compositional inference to combine word predictions.

Result: Models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

Conclusion: IDE effectively addresses compositional generalization challenges in vision-language models by establishing word-image feature connections and using entropy-based inference.

Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

[60] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing

Jiachen Tao, Junyi Wu, Haoxuan Wang, Zongxin Yang, Dawen Cai, Yan Yan

Main category: cs.CV

TL;DR: TraceFlow: A framework for high-fidelity rendering of dynamic specular scenes using Gaussian splatting with material augmentation and hybrid rendering pipeline.

Details

Motivation: Address two key challenges in dynamic specular scene rendering: precise reflection direction estimation and physically accurate reflection modeling for complex dynamic environments.

Method: 1) Residual Material-Augmented 2D Gaussian Splatting for dynamic geometry and material properties; 2) Dynamic Environment Gaussian and hybrid rendering pipeline decomposing into diffuse/specular components; 3) Coarse-to-fine training strategy for optimization stability.

Result: Outperforms prior methods on dynamic scene benchmarks both quantitatively and qualitatively, producing sharper and more realistic specular reflections.

Conclusion: TraceFlow successfully addresses reflection challenges in dynamic scenes through novel Gaussian representations and hybrid rendering, achieving state-of-the-art performance in specular reflection rendering.

Abstract: We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.

[61] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

Neelima Prasad, Jarek Reynolds, Neel Karsanbhai, Tanusree Sharma, Lotus Zhang, Abigale Stangl, Yang Wang, Leah Findlater, Danna Gurari

Main category: cs.CV

TL;DR: Proposes hierarchical instance tracking task with new benchmark dataset of 2,765 entities in 552 videos across 40 object/part categories, showing current models struggle with this challenge.

Details

Motivation: Current tracking methods focus on individual objects without considering hierarchical relationships between objects and their parts. There's a need for tracking that maintains these structural relationships for more comprehensive scene understanding.

Method: Introduces a new benchmark dataset with 2,765 unique entities tracked in 552 videos across 40 categories (objects and parts). Evaluates seven variants of four existing models adapted to the hierarchical tracking task.

Result: The evaluation reveals the dataset is challenging for current models, indicating that hierarchical instance tracking is a difficult problem that existing approaches struggle with.

Conclusion: Hierarchical instance tracking is a novel and challenging task that requires new approaches beyond current tracking methods. The introduced benchmark dataset will facilitate research in this direction.

Abstract: We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/

[62] Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition

Xu Yang, Juantao Zhong, Daoyuan Wu, Xiao Yi, Jimmy H. M. Lee, Tan Lee, Peng Han

Main category: cs.CV

TL;DR: iExam is an online exam proctoring system that combines real-time face detection with post-exam deep face recognition to monitor student presence and detect cheating behaviors like face disappearance, rotation, and identity substitution.

Details

Motivation: Online exams via platforms like Zoom have become popular since COVID-19, but ensuring exam integrity is challenging because traditional invigilation struggles to effectively monitor multiple student video feeds in real time.

Method: iExam uses lightweight real-time face detection for continuous monitoring and deep face recognition for post-exam analysis. It addresses three challenges: 1) efficient real-time video stream analysis, 2) enhanced OCR to automatically extract student identities from Zoom name tags for ground truth labeling, and 3) optimized training/inference pipeline for ordinary teacher devices.

Result: iExam achieves 90.4% accuracy for real-time face detection and 98.4% accuracy for post-exam face recognition while maintaining low overhead, demonstrating substantial enhancement in automation and reliability of online exam proctoring.

Conclusion: iExam can substantially enhance the automation and reliability of online exam proctoring in practice by effectively combining real-time monitoring with post-exam analysis to detect cheating behaviors.

Abstract: Online exams, conducted via video conferencing platforms such as Zoom, have become popular in educational institutions since COVID-19. While convenient, ensuring the integrity and security of online exams remains challenging, as traditional invigilation struggles to effectively monitor multiple student video feeds in real time. In this paper, we present iExam, an effective online exam proctoring and analysis system that combines lightweight face detection and deep recognition. iExam employs real-time face detection to assist invigilators in continuously monitoring student presence, and leverages deep face recognition for post-exam video analysis to identify abnormal behaviors–including face disappearance, face rotation, and identity substitution. To realize this system, we address three core challenges: (i) designing a lightweight approach to efficiently capture and analyze exam video streams in real time; (ii) developing an enhanced OCR method to automatically extract student identities from dynamically positioned Zoom name tags, enabling reliable ground truth labeling without manual intervention; and (iii) optimizing the training and inference pipeline to significantlyreduce resource and time requirements on ordinary teacher devices. Extensive experiments demonstrate that iExam achieves 90.4% accuracy for real-time face detection and 98.4% accuracy for post-exam face recognition, while maintaining low overhead. These results show that iExam can substantially enhance the automation and reliability of online exam proctoring in practice.

[63] Topological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization

Charles Fanning, Mehmet Emin Aktas

Main category: cs.CV

TL;DR: Using topological data analysis (wavelet persistence) to create stable multi-scale maps that improve breast cancer detection model performance across different scanners and populations.

Details

Motivation: Breast cancer screening mammography suffers from false positives/negatives and model degradation when deployed across different scanners, modalities, and patient populations. Need for more robust models that maintain accuracy across diverse clinical settings.

Method: Propose wavelet-based vectorization of persistent homology to create spatial, multi-scale maps that are provably stable to intensity perturbations. Integrate these topological maps into a two-stage detection pipeline through input-level channel concatenation with ConvNeXt Tiny architecture.

Result: On INbreast dataset, augmenting ConvNeXt Tiny with wavelet persistence channels increased patient-level AUC from 0.55 to 0.75 under limited training budget. Model trained on US CBIS DDSM digitized film mammography and evaluated on Portuguese (INbreast) and Chinese (CMMD) full-field digital mammography cohorts.

Conclusion: Wavelet persistence conditioning signals significantly improve external performance of breast cancer detection models across different scanners and populations, demonstrating the value of topological data analysis for creating robust medical imaging models.

Abstract: Breast cancer is the most commonly diagnosed cancer in women and a leading cause of cancer death worldwide. Screening mammography reduces mortality, yet interpretation still suffers from substantial false negatives and false positives, and model accuracy often degrades when deployed across scanners, modalities, and patient populations. We propose a simple conditioning signal aimed at improving external performance based on a wavelet based vectorization of persistent homology. Using topological data analysis, we summarize image structure that persists across intensity thresholds and convert this information into spatial, multi scale maps that are provably stable to small intensity perturbations. These maps are integrated into a two stage detection pipeline through input level channel concatenation. The model is trained and validated on the CBIS DDSM digitized film mammography cohort from the United States and evaluated on two independent full field digital mammography cohorts from Portugal (INbreast) and China (CMMD), with performance reported at the patient level. On INbreast, augmenting ConvNeXt Tiny with wavelet persistence channels increases patient level AUC from 0.55 to 0.75 under a limited training budget.

[64] Panoramic Out-of-Distribution Segmentation

Mengfei Duan, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Kailun Yang, Zhiyong Li

Main category: cs.CV

TL;DR: Proposes POS, the first method for Panoramic Out-of-distribution Segmentation (PanOoS), addressing panoramic image challenges with text-guided prompt distribution learning and establishing two new benchmarks.

Details

Motivation: Current panoramic semantic segmentation fails to identify outliers, and pinhole OoS models perform poorly on panoramic images due to pixel distortions and background clutter, creating a need for comprehensive and safe scene understanding in panoramic domains.

Method: POS uses text-guided prompt distribution learning with disentanglement strategy to leverage CLIP’s cross-domain generalization. Includes Prompt-based Restoration Attention for semantic decoding and Bilevel Prompt Distribution Learning to refine mask embeddings via semantic prototype supervision.

Result: POS achieves superior performance with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS benchmark, outperforming state-of-the-art pinhole-OoS methods while also achieving leading closed-set segmentation capabilities.

Conclusion: POS effectively addresses panoramic out-of-distribution segmentation challenges, establishes new benchmarks to compensate for dataset scarcity, and advances panoramic understanding for applications like autonomous driving and augmented reality.

Abstract: Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to pixel distortions and background clutter. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.

[65] Feature Coding for Scalable Machine Vision

Md Eimran Hossain Eimon, Juan Merlos, Ashan Perera, Hari Kalva, Velibor Adzic, Borko Furht

Main category: cs.CV

TL;DR: FCTM achieves ~85% bitrate reduction for edge-cloud split DNN inference by compressing intermediate features, enabling efficient deployment in bandwidth-limited applications.

Details

Motivation: DNNs are computationally expensive for edge devices, but full offloading to cloud causes latency, bandwidth, and privacy issues. Edge-cloud split inference needs efficient feature compression to reduce transmission overhead.

Method: MPEG’s Feature Coding for Machines (FCM) standard defines bitstream syntax and codec pipeline for compressing intermediate DNN features. The paper presents Feature Coding Test Model (FCTM) implementation.

Result: FCTM achieves average 85.14% bitrate reduction across multiple vision tasks while preserving accuracy, demonstrating significant bandwidth savings for split inference.

Conclusion: FCM provides scalable, interoperable solution for efficient deployment of intelligent features in bandwidth-limited, privacy-sensitive consumer applications through standardized feature compression.

Abstract: Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.

[66] Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic

Main category: cs.CV

TL;DR: LCDrive introduces latent chain-of-thought reasoning for autonomous driving, using action-aligned latent tokens instead of natural language to improve reasoning efficiency and driving performance.

Details

Motivation: Text-based reasoning in Vision-Language-Action models may not be the most efficient representation for autonomous driving. Natural language CoT reasoning has limitations in capturing action outcomes and decision-making processes effectively.

Method: Uses latent language for CoT reasoning with two token types: action-proposal tokens (same vocabulary as output actions) and world model tokens (grounded in learned latent world model). Cold-starts with supervision from ground-truth future rollouts, then post-trains with closed-loop reinforcement learning.

Result: Achieves faster inference, better trajectory quality, and larger improvements from interactive RL compared to both non-reasoning and text-reasoning baselines on large-scale end-to-end driving benchmark.

Conclusion: Latent CoT reasoning unifies reasoning and decision-making in action-aligned latent space, outperforming text-based approaches in autonomous driving tasks through more efficient representation and better integration with reinforcement learning.

Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model’s output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model’s action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

[67] Emerging Standards for Machine-to-Machine Video Coding

Md Eimran Hossain Eimon, Velibor Adzic, Hari Kalva, Borko Furht

Main category: cs.CV

TL;DR: FCM (Feature Coding for Machines) compresses neural features instead of pixels for machine-to-machine communication, reducing bandwidth while maintaining accuracy close to edge inference, with existing codecs like HEVC performing well.

Details

Motivation: Current machine-to-machine systems rely on pixel-based video streaming optimized for humans, which is bandwidth-intensive, scales poorly, and exposes raw images to third parties. There's a need for more efficient, privacy-preserving alternatives.

Method: Two approaches: 1) Video Coding for Machines (VCM) applies task-aware coding in pixel domain, 2) Feature Coding for Machines (FCM) compresses intermediate neural features. FCM uses H.26X codecs (H.264/AVC, H.265/HEVC, H.266/VVC) as inner codecs for feature compression.

Result: FCM maintains accuracy close to edge inference while significantly reducing bitrate. HEVC and VVC achieve nearly identical performance (1.39% BD-Rate increase when replacing VVC with HEVC), while AVC shows 32.28% increase vs VVC. For tracking tasks, codec choice has minimal impact, with HEVC even outperforming VVC.

Conclusion: FCM enables efficient machine-to-machine communication with reduced bandwidth and privacy preservation. Existing hardware for deployed codecs (especially HEVC) can support this paradigm without performance degradation, facilitating practical adoption.

Abstract: Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.

[68] Multi-dimensional Preference Alignment by Conditioning Reward Itself

Jiho Jang, Jinyoung Kim, Kyungjune Baek, Nojun Kwak

Main category: cs.CV

TL;DR: MCDPO addresses reward conflicts in DPO for diffusion models by introducing a disentangled Bradley-Terry objective with conditional preference vectors, enabling independent optimization across multiple reward axes.

Details

Motivation: Standard DPO formulation has a fundamental limitation: it aggregates diverse evaluation axes (aesthetic quality, semantic alignment) into a single scalar reward using the Bradley-Terry model, creating reward conflicts where models unlearn desirable features from globally non-preferred samples.

Method: Proposes Multi Reward Conditional DPO (MCDPO) with: 1) Disentangled Bradley-Terry objective, 2) Explicit injection of preference outcome vectors as training conditions, 3) Dimensional reward dropout for balanced optimization across dimensions, enabling independent learning of optimization directions for each reward axis within a single network.

Result: Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate superior performance on benchmarks. The conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.

Conclusion: MCDPO effectively resolves reward conflicts in DPO-based alignment of diffusion models, providing a more flexible framework that maintains desirable features across different evaluation dimensions while enabling inference-time control over specific reward axes.

Abstract: Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.

[69] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

Main category: cs.CV

TL;DR: SWIFT enables effective semi-supervised few-shot learning by finetuning Vision-Language Models with simple techniques (classifier initialization & temperature tuning) to overcome flat probability distributions and leverage unlabeled data.

Details

Motivation: Semi-supervised few-shot learning (SSFSL) for real-world auto-annotation should leverage open-source Vision-Language Models (VLMs) and their pretraining data, but existing SSL methods fail when applied to VLMs due to flat probability distributions that prevent effective use of unlabeled data.

Method: SWIFT (Stage-Wise Finetuning with Temperature Tuning) uses classifier initialization and temperature tuning to increase confidence scores of pseudo-labels, enabling SSL methods to effectively finetune VLMs on limited labeled data, abundant unlabeled data, and task-relevant noisy data from VLM pretraining sets.

Result: SWIFT outperforms recent FSL and SSL methods by ~5 accuracy points on five SSFSL benchmarks, and even rivals supervised learning that uses ground truth labels for unlabeled data.

Conclusion: Simple techniques can overcome VLM limitations for SSFSL, enabling effective auto-annotation by leveraging open-source VLMs and their pretraining data, bridging the gap between SSFSL and FSL in utilizing modern vision-language models.

Abstract: Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ‘‘auto-annotation’’, as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ‘‘flat’’ distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM’s pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!

[70] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Zhuo Wang, Xiliang Liu, Ligang Sun

Main category: cs.CV

TL;DR: RobustSora benchmark evaluates how digital watermarks affect AI-generated video detection, showing detectors partially rely on watermarks with 2-8pp performance variations when watermarks are manipulated.

Details

Motivation: Current AIGC video detection benchmarks overlook that many generative models embed digital watermarks, and detectors may rely on these patterns, creating a vulnerability in detection systems.

Method: Created RobustSora benchmark with 6,500 videos in four categories: Authentic-Clean, Authentic-Spoofed (with fake watermarks), Generated-Watermarked, and Generated-DeWatermarked. Evaluated ten models across two tasks: Task-I tests detection on watermark-removed AI videos, Task-II assesses false alarms on authentic videos with fake watermarks.

Result: Models show 2-8pp performance variations under watermark manipulation. Transformer-based models have consistent moderate dependency (6-8pp), while MLLMs show diverse patterns (2-8pp), indicating partial watermark dependency in current detectors.

Conclusion: AIGC video detectors partially rely on watermarks, creating vulnerabilities. RobustSora provides essential tools for developing watermark-aware training strategies and advancing robust detection research.

Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

[71] THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose

Eunho Lee, Chaehyeon Song, Seunghoon Jeong, Ayoung Kim

Main category: cs.CV

TL;DR: THE-Pose improves category-level 6D pose estimation by integrating topological features from images with 3D point cloud features via hybrid graph fusion, overcoming limitations of pure 3D graph convolution methods.

Details

Motivation: Existing 3D graph convolution methods focus only on local geometry and depth, making them vulnerable to complex objects and visual ambiguities. They lack global context needed for robustness against intra-class variations.

Method: THE-Pose extracts consistent topological features from images and fuses them with point-cloud features using a Hybrid Graph Fusion (HGF) module, bridging 2D image context with 3D geometric structure.

Result: On REAL275 dataset, THE-Pose achieves 35.8% improvement over 3D-GC baseline (HS-Pose) and surpasses previous state-of-the-art by 7.2% across all key metrics.

Conclusion: Integrating topological priors from images with 3D geometric features enables more robust category-level pose estimation, especially for unseen/complex objects under occlusion.

Abstract: Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on https://github.com/EHxxx/THE-Pose

[72] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

Rui Wang, Yimu Sun, Jingxing Guo, Huisi Wu, Jing Qin

Main category: cs.CV

TL;DR: GDKVM: A novel architecture for echocardiography video segmentation using Linear Key-Value Association for inter-frame correlations, Gated Delta Rule for memory storage, and Key-Pixel Feature Fusion for multi-scale feature integration, achieving state-of-the-art accuracy with real-time performance.

Details

Motivation: Accurate cardiac chamber segmentation in echocardiography is crucial for clinical diagnosis, but existing methods struggle with balancing long-range spatiotemporal dependency capture, computational efficiency, and fine-grained feature representation due to imaging noise, artifacts, and heart deformation/motion.

Method: GDKVM architecture with three key components: 1) Linear Key-Value Association (LKVA) for modeling inter-frame correlations, 2) Gated Delta Rule (GDR) for efficient intermediate memory state storage, and 3) Key-Pixel Feature Fusion (KPFF) module for integrating local and global features at multiple scales to handle boundary blurring and noise.

Result: Outperforms state-of-the-art methods on two mainstream echocardiography datasets (CAMUS and EchoNet-Dynamic) in segmentation accuracy and robustness while maintaining real-time performance.

Conclusion: GDKVM effectively addresses the trade-off between capturing long-range spatiotemporal dependencies and computational efficiency in echocardiography video segmentation, providing superior performance for clinical applications.

Abstract: Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.

[73] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models

Yuetong Su, Baoguo Wei, Xinyu Wang, Xu Li, Lixin Li

Main category: cs.CV

TL;DR: LLM-NCD: A multimodal framework that fuses visual-textual semantics with prototype-guided clustering for Novel Class Discovery, achieving significant accuracy improvements and unique resilience to long-tail distributions.

Details

Motivation: Existing NCD methods for images rely solely on visual features, which suffer from insufficient feature discriminability and are vulnerable to long-tail distribution problems in unlabelled data.

Method: Multimodal framework that fuses visual-textual semantics and uses prototype-guided clustering. Key innovations: 1) Joint optimization of known class image and text features to model cluster centers and semantic prototypes, 2) Dual-phase discovery mechanism that dynamically separates known/novel samples via semantic affinity thresholds and adaptive clustering.

Result: On CIFAR-100 dataset, achieves up to 25.3% improvement in accuracy for unknown classes compared to current methods. Shows unique resilience to long-tail distributions - a first in NCD literature.

Conclusion: LLM-NCD successfully overcomes limitations of visual-only NCD methods by leveraging multimodal fusion and prototype-guided clustering, demonstrating superior performance and robustness to data distribution challenges.

Abstract: Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.

[74] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin

Main category: cs.CV

TL;DR: Long-LRM++ improves upon Long-LRM by using semi-explicit scene representation with lightweight decoder, achieving real-time 14 FPS rendering while matching LaCT’s quality, scaling to 64 input views, and providing better depth prediction.

Details

Motivation: Direct Gaussian parameter prediction is error-sensitive (causing blurring), while implicit methods like LVSM/LaCT have high fidelity but require computationally intensive decompression for each frame, making real-time rendering infeasible. Need to retain implicit representation benefits while enabling real-time performance.

Method: Adopts semi-explicit scene representation combined with lightweight decoder. Scales to 64 input views at 950×540 resolution. Uses design that overcomes speed limitations of prior implicit methods.

Result: Matches LaCT’s rendering quality on DL3DV while achieving real-time 14 FPS on A100 GPU. Scales to 64 input views. Delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians.

Conclusion: Long-LRM++ successfully addresses the trade-off between rendering quality and speed by combining semi-explicit representation with lightweight decoder, enabling real-time performance while maintaining high fidelity, with extensive ablation studies validating the framework.

Abstract: Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential “decompression” process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

[75] Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Hongsin Lee, Hye Won Chung

Main category: cs.CV

TL;DR: Stronger robust teachers don’t always produce more robust students due to robust saturation; adversarial transferability is key, leading to SAAD method that reweights examples by transferability.

Details

Motivation: Existing adversarial distillation methods often don't use state-of-the-art robust teachers, and stronger teachers don't necessarily yield more robust students (robust saturation). The authors aim to understand why this happens and improve robustness transfer.

Method: Propose Sample-wise Adaptive Adversarial Distillation (SAAD) that reweights training examples based on measured adversarial transferability (fraction of student-crafted adversarial examples effective against teacher) without extra computational cost.

Result: SAAD consistently improves AutoAttack robustness over prior methods on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets.

Conclusion: Adversarial transferability is a key factor in successful robustness transfer, not just capacity gaps. SAAD effectively addresses robust saturation by adaptively reweighting examples based on transferability.

Abstract: Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at https://github.com/HongsinLee/saad.

[76] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu

Main category: cs.CV

TL;DR: MotionEdit is a new dataset for motion-centric image editing, with a benchmark showing current models struggle, and MotionNFT framework improves motion editing quality.

Details

Motivation: Existing image editing datasets focus on static appearance changes or have low-quality motion edits, lacking realistic motion transformations. Motion-centric editing is scientifically challenging and practically important for applications like video synthesis and animation.

Method: 1) Create MotionEdit dataset with high-fidelity image pairs from continuous videos showing realistic motion transformations. 2) Develop MotionEdit-Bench benchmark with generative, discriminative, and preference metrics. 3) Propose MotionNFT (Motion-guided Negative-aware Fine Tuning) framework that computes motion alignment rewards based on how well motion flow matches ground truth.

Result: MotionEdit-Bench reveals motion editing is highly challenging for state-of-the-art diffusion models. MotionNFT consistently improves editing quality and motion fidelity on FLUX.1 Kontext and Qwen-Image-Edit models without sacrificing general editing ability.

Conclusion: MotionEdit addresses a critical gap in motion-centric image editing, providing a valuable dataset and benchmark. The proposed MotionNFT framework effectively improves motion editing performance, demonstrating practical significance for applications requiring accurate motion transformations.

Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

[77] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao

Main category: cs.CV

TL;DR: ShotDirector is a framework for controllable multi-shot video generation that integrates precise camera control with professional editing patterns to create film-like shot transitions, addressing the gap in current methods that focus only on visual consistency without intentional narrative design.

Details

Motivation: Current multi-shot video generation methods focus primarily on low-level visual consistency across shots but neglect how transitions are designed and how cinematographic language contributes to coherent narrative expression, resulting in mere sequential shot changes without intentional film-editing patterns.

Method: ShotDirector integrates parameter-level camera control (6-DoF poses and intrinsic settings) with hierarchical editing-pattern-aware prompting. It uses a camera control module for precise camera information injection and a shot-aware mask mechanism for hierarchical prompts that provide fine-grained control over shot content based on professional editing patterns.

Result: The framework effectively combines parameter-level conditions with high-level semantic guidance to achieve film-like controllable shot transitions. The authors also created ShotWeaver40K dataset capturing film-like editing pattern priors and developed evaluation metrics for controllable multi-shot video generation.

Conclusion: ShotDirector addresses the limitation of current video generation methods by incorporating both precise camera control and professional editing patterns, enabling more intentional and narrative-driven shot transitions that better emulate film-making techniques.

Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.

[78] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings

Karthikeya KV, Narendra Bandaru

Main category: cs.CV

TL;DR: Disentangled360 is a 3D-aware framework that combines direction disentangled volume rendering with single-image 360° view synthesis for medical imaging and natural scene reconstruction, outperforming existing methods in quality and efficiency.

Details

Motivation: Current techniques either oversimplify anisotropic light behavior or lack generalizability across different contexts (medical vs. natural scenes). There's a need for a unified framework that can handle both isotropic and anisotropic contributions while maintaining structural realism without scene-specific fine-tuning.

Method: Uses Gaussian Splatting backbone with distinct separation of isotropic and anisotropic contributions. Implements dual-branch conditioning: one for CT intensity driven scattering in volumetric data, another for real-world RGB scenes via normalized camera embeddings. Includes hybrid pose agnostic anchoring method for adaptive depth and material transition sampling to address scale ambiguity.

Result: Superior SSIM and LPIPS performance on Mip-NeRF 360, RealEstate10K, and DeepDRR datasets. Runtime assessments confirm viability for interactive applications. Enables rapid, photorealistic view synthesis with inherent directionality.

Conclusion: Disentangled360 successfully integrates preoperative radiography simulation and consumer-grade 360° rendering into a single inference pipeline, facilitating mixed-reality medical supervision, robotic perception, and immersive content creation without requiring scene-specific fine-tuning or expensive photon simulations.

Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

Main category: cs.CV

TL;DR: Efficient-VLN: A training-efficient VLN model that reduces computational overhead through progressive memory, recursive memory, and dynamic mixed policy, achieving SOTA performance with dramatically reduced training time.

Details

Motivation: Multimodal LLMs show promise in Vision-Language Navigation but face severe training overhead from (1) quadratic computational burden from processing long-horizon observations as massive token sequences, and (2) exploration-efficiency trade-off in DAgger where more exploration yields better error recovery but increases trajectory lengths.

Method: Proposes Efficient-VLN with two efficient memory mechanisms: progressive memory (dynamically allocates more tokens to recent observations) and learnable recursive memory (uses key-value cache of learnable tokens as memory state). Also introduces dynamic mixed policy to balance exploration-efficiency trade-off.

Result: Achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR) while consuming only 282 H800 GPU hours, demonstrating dramatic reduction in training overhead compared to SOTA methods.

Conclusion: Efficient-VLN successfully addresses the training efficiency challenges in VLN through novel memory mechanisms and policy design, enabling practical development of MLLMs for VLN with substantially reduced computational requirements.

Abstract: Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

[80] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Anh M. Vu, Khang P. Le, Trang T. K. Vo, Ha Thach, Huy Hung Nguyen, David Yang, Han H. Huynh, Quynh Nguyen, Tuan M. Pham, Tuan-Anh Le, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han, Hien Van Nguyen

Main category: cs.CV

TL;DR: A prototype-driven framework using vision-language alignment improves weakly supervised semantic segmentation in histopathology by combining text and image prototypes with multi-scale pyramid features.

Details

Motivation: Current weakly supervised semantic segmentation (WSSS) in histopathology faces limitations: inter-class homogeneity (different classes look similar), intra-class heterogeneity (same class looks different), and region-shrinkage effect from CAM-based supervision. These issues hinder accurate region discovery with only image-level labels.

Method: Proposes a prototype-driven framework with vision-language alignment. Uses CoOp-style learnable prompt tuning to generate text-based prototypes, combines them with learnable image prototypes to create a dual-modal prototype bank capturing both semantic and appearance cues. Incorporates a multi-scale pyramid module to address oversmoothing in ViT representations and improve spatial precision.

Result: Experiments on BCSS-WSSS benchmark show the approach surpasses existing state-of-the-art methods. Detailed analyses demonstrate benefits of text description diversity, context length, and complementary behavior of text and image prototypes.

Conclusion: The framework effectively leverages textual semantics and visual prototype learning for WSSS in digital pathology, showing that joint use of text and image prototypes improves region discovery under weak supervision.

Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

[81] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation

Khang Le, Ha Thach, Anh M. Vu, Trang T. K. Vo, Han H. Huynh, David Yang, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han, Hien Van Nguyen

Main category: cs.CV

TL;DR: A prototype learning framework for weakly supervised semantic segmentation in histopathology that combines CONCH’s morphology-aware representations, SegFormer’s spatial cues, and text-guided semantic alignment to generate high-quality pseudo masks without dense annotations.

Details

Motivation: Current WSSS methods in histopathology rely on classification backbones that only localize discriminative regions and fail to capture full tissue structures. Vision-language models like CONCH offer semantic alignment but combining them with segmentation backbones like SegFormer is challenging under weak supervision without dense annotations.

Method: Proposes a prototype learning framework integrating: 1) morphology-aware representations from CONCH, 2) multi-scale structural cues from SegFormer, and 3) text-guided semantic alignment. Uses text-guided prototype initialization with pathology descriptions, and structural distillation to transfer spatial knowledge from SegFormer while preserving fine-grained patterns and boundaries.

Result: The approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets show it outperforms existing WSSS methods while remaining computationally efficient through frozen foundation models and lightweight adapters.

Conclusion: The prototype learning framework successfully integrates complementary strengths of vision-language models and segmentation backbones for WSSS in histopathology, achieving better performance than existing methods while maintaining computational efficiency.

Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.

[82] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset

Hyunsoo Lee, Daeum Jeon, Hyeokjae Oh

Main category: cs.CV

TL;DR: Point2Pose: A generative framework for 3D human pose estimation using sequential point clouds and pose history, with a new large-scale indoor dataset MVPose3D.

Details

Motivation: 3D human pose estimation faces challenges from complex human body geometry, self-occluding joints, and the need for large-scale real-world motion datasets.

Method: Point2Pose uses a spatio-temporal point cloud encoder and pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor that models pose distribution conditioned on sequential point clouds and pose history.

Result: The method outperforms baseline models across various datasets, demonstrating superior performance in 3D human pose estimation.

Conclusion: Point2Pose effectively addresses 3D human pose estimation challenges through generative modeling and is supported by the new MVPose3D dataset containing multi-modal data (IMU, point clouds, RGB images).

Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen

Main category: cs.CV

TL;DR: EchoingPixels is an audio-visual token reduction framework that uses cross-modal attention to dynamically reduce tokens from a combined audio-visual pool, achieving 2-3x speedup with only 5-20% of original tokens.

Details

Motivation: AV-LLMs face prohibitive computational overhead from massive audio and video tokens. Existing unimodal token reduction methods can't leverage audio-visual cross-modal synergies, and static per-modality budgets are suboptimal due to distinct dynamic information densities.

Method: Introduces Cross-Modal Semantic Sieve (CS2) for early audio-visual interaction, co-attending to joint multimodal stream and reducing tokens from a combined pool rather than fixed per-modality budgets. Uses Synchronization-Augmented RoPE (Sync-RoPE) to preserve temporal relationships for sparsely selected tokens.

Result: Achieves performance comparable to strong baselines using only 5-20% of original tokens, with 2-3x speedup and memory reduction.

Conclusion: EchoingPixels effectively addresses the token reduction bottleneck in AV-LLMs by enabling cross-modal synergistic token selection and adaptive budget allocation across modalities.

Abstract: Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.

[84] StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology

Jiawen Li, Jiali Hu, Xitong Ling, Yongqiang Lv, Yuxuan Chen, Yizhi Wang, Tian Guan, Yifei Liu, Yonghong He

Main category: cs.CV

TL;DR: StainNet is a specialized pathology foundation model for special stains (like immunohistochemistry) that addresses limitations of existing H&E-only models, trained on 1.4M patches from 20K+ special stain WSIs using self-distillation SSL.

Details

Motivation: Existing pathology foundation models are primarily pre-trained on H&E-stained images, which limits their effectiveness for clinical applications involving special stains like immunohistochemistry that are frequently used in practice.

Method: StainNet uses vision transformer architecture with self-distillation SSL approach, trained on over 1.4 million patch images cropped from 20,231 publicly available special staining WSIs from the HISTAI database.

Result: StainNet demonstrates strong performance on slide-level liver malignancy classification and two public ROI-level datasets, shows effectiveness in few-ratio learning and retrieval tasks, and compares favorably with larger pathology foundation models.

Conclusion: StainNet addresses the gap in specialized foundation models for special stains, providing a valuable resource for computational pathology applications involving immunohistochemistry and other special staining techniques beyond standard H&E.

Abstract: Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.

[85] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images

Yi Liu, Yichi Zhang

Main category: cs.CV

TL;DR: A conditional generative framework using Pix2Pix with filament-aware structural loss generates realistic filament microscopy images from binary masks to address annotation shortage.

Details

Motivation: Filament segmentation is crucial for biological analysis but suffers from data shortage due to the extreme difficulty of manual pixel-level annotation for dense, geometrically complex filaments like microtubules and actin filaments.

Method: Proposes a conditional generative framework based on Pix2Pix architecture to generate realistic filament microscopy images from binary masks, enhanced with a novel filament-aware structural loss to improve structural similarity.

Result: The approach demonstrates effectiveness and outperforms existing models trained without synthetic data, showing improved filament segmentation performance.

Conclusion: The proposed generative framework with filament-aware structural loss successfully addresses the data shortage problem in filament segmentation by generating high-quality synthetic training data.

Abstract: Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.

[86] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

Yi-Cheng Liao, Shyang-En Weng, Yu-Syuan Xu, Chi-Wei Hsiao, Wei-Chen Chiu, Ching-Chun Huang

Main category: cs.CV

TL;DR: HD-CLIP is a plug-and-play module that decomposes low-quality images into semantic and ordinal degradation embeddings to guide diffusion models for real-world image super-resolution, improving detail fidelity and perceptual realism without requiring training.

Details

Motivation: Real-world image super-resolution faces challenges with diverse, coupled degradations of unknown severity. Existing methods struggle because they assume known degradation severity and use CLIP text encoders that can't capture numerical severity information, limiting generalization to real-world scenarios.

Method: Proposes HD-CLIP (Hierarchical Degradation CLIP) that decomposes low-quality images into: 1) semantic embedding, and 2) ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen degradation levels. Integrates this into diffusion models via classifier-free guidance (CFG) and proposes classifier-free projection guidance (CFPG). Uses semantic cues to guide restoration while degradation cues suppress hallucinations and artifacts.

Result: HD-CLIP significantly improves detail fidelity and perceptual realism across diverse real-world datasets. As a plug-and-play module, it can be seamlessly integrated into various super-resolution frameworks without requiring additional training.

Conclusion: HD-CLIP addresses the limitations of existing real-world super-resolution methods by providing richer degradation guidance through ordinal embeddings, enabling better generalization to diverse real-world degradation scenarios while maintaining plug-and-play compatibility with existing frameworks.

Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.

[87] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

Main category: cs.CV

TL;DR: VLMs struggle with visual sequential planning involving errors; CoSPlan benchmark tests error detection and correction; SGI method improves performance by 5.2% through intermediate reasoning steps.

Details

Motivation: Large-scale Vision-Language Models have impressive reasoning capabilities but remain unexplored in visual sequential planning, especially when dealing with non-optimal (erroneous) steps that need detection and correction.

Method: Proposes CoSPlan benchmark for evaluating VLMs in error-prone sequential planning across 4 domains. Introduces Scene Graph Incremental updates (SGI) training-free method that adds intermediate reasoning steps between initial and goal states to help VLMs reason about sequences.

Result: State-of-the-art VLMs (Intern-VLM, Qwen2) struggle on CoSPlan even with advanced reasoning techniques. SGI method yields average 5.2% performance gain and generalizes to traditional planning tasks like Plan-Bench and VQA.

Conclusion: VLMs need improvement in corrective sequential planning; SGI provides effective training-free enhancement for reasoning about sequences with errors, improving reliability and generalization to traditional planning tasks.

Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

[88] Topology-Agnostic Animal Motion Generation from Text Prompt

Keyi Chen, Mingze Sun, Zhenyu Liu, Zhangquan Chen, Ruqi Huang

Main category: cs.CV

TL;DR: OmniZoo: A large-scale animal motion dataset and unified generative framework for text-driven motion generation across arbitrary skeletal topologies.

Details

Motivation: Current motion generation methods rely on fixed skeletal templates, limiting generalization to skeletons with different or perturbed topologies. There's a lack of large-scale heterogeneous animal motion data and unified frameworks for modeling arbitrary skeletal topologies with textual conditions.

Method: Introduces OmniZoo dataset (140 species, 32,979 sequences) and a generalized autoregressive motion generation framework with Topology-aware Skeleton Embedding Module that encodes geometric/structural properties of any skeleton into shared token space for fusion with textual semantics.

Result: Method generates temporally coherent, physically plausible, semantically aligned motions from text prompts and target skeletons, enabling cross-species motion style transfer.

Conclusion: The approach addresses core limitations of current motion generation by providing both large-scale heterogeneous animal motion data and a unified framework capable of handling arbitrary skeletal topologies with textual conditioning.

Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.

[89] Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation

Yiheng Lyu, Lian Xu, Mohammed Bennamoun, Farid Boussaid, Coen Arrow, Girish Dwivedi

Main category: cs.CV

TL;DR: TranSamba is a hybrid Transformer-Mamba architecture for weakly supervised volumetric medical segmentation that efficiently captures 3D context using linear-complexity state space models across slices.

Details

Motivation: Existing weakly supervised segmentation methods for volumetric medical imaging rely on 2D encoders that neglect the inherent 3D nature of the data, limiting their ability to capture volumetric context.

Method: TranSamba combines Vision Transformer blocks with Cross-Plane Mamba blocks that use state space models for efficient information exchange across neighboring slices, enhancing pairwise self-attention within slices for better object localization.

Result: TranSamba achieves state-of-the-art performance on three datasets across diverse modalities and pathologies, with linear time complexity scaling with volume depth and constant memory usage for batch processing.

Conclusion: The hybrid Transformer-Mamba architecture effectively addresses volumetric modeling for weakly supervised medical segmentation, offering efficient 3D context capture while maintaining computational efficiency.

Abstract: Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.

[90] mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar

Tarik Reza Toha, Shao-Jung, Lu, Shahriar Nirjon

Main category: cs.CV

TL;DR: mmCounter uses mmWave radar to count static people in dense indoor spaces by detecting ultra-low frequency breathing and micro-movements, achieving 87% F1 score in familiar environments.

Details

Motivation: mmWave radars struggle with detecting static people in dense groups due to spatial resolution limitations and reliance on movement. Existing breathing rate estimation methods assume known number of people, but mmCounter addresses the counting problem directly.

Method: Extracts ultra-low frequency (<1 Hz) signals from breathing and micro-scale body movements, uses novel multi-stage signal processing pipeline to differentiate these subtle signals from background noise and nearby static objects, maps sources to individual people for counting.

Result: 87% average F1 score and 0.6 mean absolute error in familiar environments; 60% average F1 score and 1.1 mean absolute error in untested environments. Can count up to 7 individuals in 3 square meter space with minimal spacing.

Conclusion: mmCounter successfully addresses the challenging problem of counting static people in dense indoor environments using mmWave radar, overcoming limitations of existing methods by focusing on ultra-low frequency physiological signals rather than movement.

Abstract: mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.

[91] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang

Main category: cs.CV

TL;DR: The paper introduces a Video Toolkit and Spatiotemporal Reasoning Framework (STAR) to enhance MLLMs’ spatiotemporal reasoning for VideoQA tasks, achieving significant performance gains on benchmarks.

Details

Motivation: Existing Multimodal Large Language Models struggle with simultaneously modeling spatial relationships within video frames and understanding causal dynamics of temporal evolution on complex VideoQA tasks.

Method: Equip MLLMs with a comprehensive Video Toolkit and propose STAR framework that strategically schedules temporal and spatial tools to progressively localize key areas in videos.

Result: STAR framework enhances GPT-4o using lightweight tools, achieving 8.2% gain on VideoMME and 4.6% on LongVideoBench benchmarks.

Conclusion: The proposed Video Toolkit and STAR framework represent an important step toward building autonomous and intelligent video analysis assistants.

Abstract: Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM’s spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.

[92] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim

Main category: cs.CV

TL;DR: Visual Funnel is a training-free two-step approach that addresses “Contextual Blindness” in MLLMs by creating hierarchical context portfolios through attention entropy-based crop sizing and center refinement.

Details

Motivation: MLLMs struggle with fine-grained visual perception despite having reasoning capabilities. Existing crop-based methods introduce "Contextual Blindness" - a structural disconnect between high-fidelity details from crops and global context from original images, limiting performance in precision-demanding tasks.

Method: Visual Funnel uses a two-step approach: 1) Contextual Anchoring identifies regions of interest in a single forward pass, 2) Entropy-Scaled Portfolio constructs hierarchical context by dynamically determining crop sizes based on attention entropy and refining crop centers to preserve focal detail to broader surroundings.

Result: Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Adding more unstructured crops provides limited or detrimental benefits, confirming hierarchical structure is key to resolving Contextual Blindness.

Conclusion: The limitation in MLLM visual perception stems from lack of “Structural Diversity” rather than information quantity. Visual Funnel’s hierarchical portfolio approach effectively resolves Contextual Blindness without requiring training, enabling better fine-grained visual understanding.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: “Contextual Blindness”. This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information ‘Quantity’, but from a lack of ‘Structural Diversity’ in the model’s input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

[93] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

Main category: cs.CV

TL;DR: P2S is a novel training-free framework for zero-shot long video moment retrieval that addresses inefficient search and costly refinement phases through adaptive span generation and query decomposition, outperforming supervised methods.

Details

Motivation: Existing approaches to long video moment retrieval face limitations: supervised methods have poor scalability and generalization despite high resource consumption, while zero-shot methods suffer from candidate explosion in search phase and require expensive VLM verification in refine phase.

Method: P2S introduces two key innovations: (1) Adaptive Span Generator to prevent candidate explosion in the search phase, and (2) Query Decomposition to refine candidates without relying on high-cost VLM verification, making it a completely training-free framework.

Result: P2S outperforms supervised state-of-the-art methods by significant margins (e.g., +3.7% on R5@0.1 on MAD dataset) and is the first zero-shot framework capable of temporal grounding in hour-long videos.

Conclusion: P2S successfully addresses the core challenges of zero-shot long video moment retrieval by eliminating both inefficient search and costly refinement, establishing a new paradigm for efficient and effective video moment retrieval without task-specific training.

Abstract: Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a ‘Search-then-Refine’ approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a ‘search’ phase candidate explosion, and (2) the ‘refine’ phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient ‘search’ and costly ‘refine’ phases. P2S overcomes these challenges with two key innovations: an ‘Adaptive Span Generator’ to prevent the search phase candidate explosion, and ‘Query Decomposition’ to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7% on R5@0.1 on MAD).

[94] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views

Zhankuo Xu, Chaoran Feng, Yingtao Li, Jianbin Zhao, Jiashu Yang, Wangbo Yu, Li Yuan, Yonghong Tian

Main category: cs.CV

TL;DR: CoherentGS addresses 3D reconstruction from sparse and blurry images by combining deblurring and diffusion priors to break the vicious cycle between sparse views and motion blur.

Details

Motivation: 3D Gaussian Splatting (3DGS) requires dense, high-quality images but real-world data is often sparse and motion-blurred. These issues create a vicious cycle where sparse views can't resolve motion blur, and motion blur erases details needed for view alignment, leading to catastrophic reconstruction failures.

Method: CoherentGS uses a dual-prior strategy combining: 1) a specialized deblurring network for sharp detail restoration and photometric guidance, and 2) a diffusion model for geometric priors to fill unobserved regions. Key techniques include consistency-guided camera exploration and depth regularization for geometric plausibility.

Result: The method is evaluated on synthetic and real-world scenes with as few as 3, 6, and 9 input views. CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for sparse and blurry 3D reconstruction.

Conclusion: CoherentGS successfully breaks the vicious cycle between sparse views and motion blur through its dual-prior approach, enabling high-fidelity 3D reconstruction from challenging real-world imagery where traditional 3DGS fails.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.

[95] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao

Main category: cs.CV

TL;DR: First systematic study applying reinforcement learning to text-to-3D autoregressive generation, addressing challenges of 3D spatial complexity through reward design, RL algorithms, new benchmarks, and hierarchical optimization.

Details

Motivation: While RL has been effective for 2D image generation and other domains, applying it to 3D generation remains unexplored due to higher spatial complexity requiring globally consistent geometry and fine-grained local textures, making 3D generation sensitive to reward designs and RL algorithms.

Method: Systematic study across four dimensions: (1) Reward design evaluation focusing on human preference alignment and multi-modal models, (2) RL algorithm study of GRPO variants with token-level optimization, (3) Introduction of MME-3DR benchmark to measure implicit reasoning, (4) Proposed Hi-GRPO for hierarchical global-to-local optimization with reward ensembles.

Result: Developed AR3D-R1, the first RL-enhanced text-to-3D model that excels from coarse shape to texture refinement, demonstrating the effectiveness of RL-driven reasoning for 3D generation.

Conclusion: This pioneering study provides insights into RL-driven reasoning for 3D generation, showing that systematic approaches to reward design, RL algorithms, benchmarking, and hierarchical optimization can successfully address the unique challenges of 3D spatial complexity.

Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

[96] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds

Jingyun Fu, Zhiyu Xiang, Na Zhao

Main category: cs.CV

TL;DR: First joint scene flow learning framework for 4D radar and LiDAR fusion, addressing radar noise and sparsity challenges with novel fusion module and loss functions.

Details

Motivation: Radar complements LiDAR as cheaper, weather-robust, and velocity-aware, but radar-LiDAR fusion for scene flow estimation is unexplored due to radar noise/sparsity and lack of datasets.

Method: Proposes RaLiFlow framework with: 1) Radar-LiDAR dataset construction with preprocessing for denoising and reliable flow labels, 2) Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module integrating radar dynamic cues via cross-attention, 3) Loss functions mitigating unreliable radar data and enhancing instance-level consistency.

Result: Outperforms existing LiDAR-based and radar-based single-modal methods by significant margin on repurposed scene flow dataset.

Conclusion: First successful radar-LiDAR fusion for scene flow estimation, demonstrating radar’s complementary value despite challenges, with proposed framework effectively handling radar limitations.

Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.

[97] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching

Alberto Rota, Elena De Momi

Main category: cs.CV

TL;DR: A self-supervised deep learning pipeline for establishing feature correspondences in endoscopic images using novel-view synthesis and contrastive learning to improve matching precision for surgical applications.

Details

Motivation: Surgical endoscopy requires precise pixel-level correspondences for 3D reconstruction and camera tracking, but faces challenges like weak perspective cues, non-Lambertian reflections, and deformable anatomy that degrade conventional computer vision techniques. Deep learning models trained on natural scenes need domain-specific adaptation for surgical images.

Method: Proposes a self-supervised pipeline using novel-view synthesis to generate ground-truth correspondences, then employs contrastive learning with triplet mining. Augments DINOv2 backbone with an additional Transformer layer optimized to produce embeddings for direct matching via cosine similarity thresholding.

Result: The pipeline outperforms state-of-the-art methods on SCARED datasets, achieving improved matching precision and lower epipolar error compared to related work.

Conclusion: The framework enables more accurate high-level computer vision applications in surgical endoscopy by providing better feature correspondences through domain-adapted deep learning with self-supervised optimization.

Abstract: Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.

[98] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou

Main category: cs.CV

TL;DR: The paper introduces FROW benchmark for evaluating fine-grained recognition in LVLMs and proposes optimization strategies using mosaic and open-world data, showing significant accuracy improvements.

Details

Motivation: Existing LVLM benchmarks focus on reasoning tasks but neglect fine-grained recognition, which is crucial for practical applications. There's a need for comprehensive evaluation of LVLMs' detailed recognition capabilities.

Method: 1) Created FROW benchmark using GPT-4o for fine-grained recognition evaluation; 2) Proposed optimization strategy with mosaic data (combining multiple short-answer responses) and open-world data (real-world Q&A from GPT-4o); 3) Incorporated fine-grained data into pre-training phase.

Result: Mosaic data improved category recognition accuracy by 1%; Open-world data boosted FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%; Fine-grained data in pre-training improved category recognition accuracy by up to 10%.

Conclusion: The FROW benchmark addresses the gap in fine-grained recognition evaluation for LVLMs, and the proposed data optimization strategies significantly enhance model performance in detailed recognition tasks.

Abstract: Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model’s category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.

[99] Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method

Ge Zhang, Chunyang Wang, Bo Xiao, Xuelian Liu, Bin Liu

Main category: cs.CV

TL;DR: Proposes adaptive dual-weight gravitational-based point cloud denoising method that achieves high accuracy, strong edge preservation, and real-time performance through spatial partitioning, adaptive noise removal, and gravitational scoring.

Details

Motivation: LiDAR point clouds often contain noise that degrades object detection accuracy. Existing methods trade off between computational efficiency and denoising quality, failing to simultaneously achieve high accuracy, edge preservation, and real-time performance.

Method: 1) Octree spatial partitioning for parallel acceleration; 2) Adaptive voxel-based occupancy statistics and kNN density estimation to remove isolated/low-density noise; 3) Gravitational scoring function combining density weights with adaptive distance weights to distinguish noise from object points.

Result: Experiments on Stanford 3D, CADC, and in-house FMCW LiDAR datasets show consistent improvements in F1, PSNR, and Chamfer Distance across various noise conditions while reducing single-frame processing time.

Conclusion: The proposed method achieves high accuracy, robustness, and real-time performance in multi-noise scenarios, addressing the trade-off between denoising quality and computational efficiency in existing approaches.

Abstract: High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.

[100] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

Main category: cs.CV

TL;DR: MultiHateLoc: First weakly-supervised framework for temporal localization of multimodal hate speech in videos, using only video-level labels to identify when hateful segments occur across visual, acoustic, and textual streams.

Details

Motivation: The rapid growth of video content on platforms like TikTok and YouTube has intensified multimodal hate speech spread, where harmful cues emerge subtly and asynchronously across different modalities. Existing research focuses only on video-level classification, leaving the crucial task of temporal localization largely unaddressed, especially under weak supervision with only video-level labels.

Method: MultiHateLoc incorporates: (1) modality-aware temporal encoders to model heterogeneous sequential patterns with tailored text preprocessing; (2) dynamic cross-modal fusion to adaptively emphasize the most informative modality at each moment, plus cross-modal contrastive alignment for feature consistency; (3) modality-aware MIL objective to identify discriminative segments under video-level supervision.

Result: Despite using only coarse video-level labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip datasets show state-of-the-art performance in the multimodal hate localization task.

Conclusion: MultiHateLoc successfully addresses the challenging problem of weakly-supervised multimodal hate speech temporal localization, providing a framework that can identify when hateful segments occur across different modalities using only video-level supervision, with superior performance on benchmark datasets.

Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

[101] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Cheng Min, Yu Hu

Main category: cs.CV

TL;DR: The paper introduces WildRoad dataset and MaGRoad model for off-road road network extraction, addressing domain gaps and topological errors in existing methods.

Details

Motivation: Current deep learning models for vectorized road extraction work well in urban settings but fail in off-road environments due to domain gaps, lack of large-scale datasets, and structural weaknesses in node-centric approaches that are fragile to occlusions and ambiguous junctions.

Method: Two complementary approaches: 1) Release WildRoad - a global off-road road network dataset created with a dedicated interactive annotation tool; 2) Introduce MaGRoad - a path-centric framework that aggregates multi-scale visual evidence along candidate paths to robustly infer connectivity.

Result: MaGRoad achieves state-of-the-art performance on the challenging WildRoad benchmark while generalizing well to urban datasets. The streamlined pipeline yields roughly 2.5x faster inference, improving practical applicability.

Conclusion: The WildRoad dataset and MaGRoad’s path-centric paradigm provide a stronger foundation for mapping roads in off-road environments, addressing limitations of existing urban-focused approaches.

Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.

Phu Pham, Damon Conover, Aniket Bera

Main category: cs.CV

TL;DR: TransLocNet is a cross-modal attention framework for aerial-ground localization that fuses LiDAR geometry with aerial imagery using bidirectional attention and contrastive learning, achieving sub-meter, sub-degree accuracy.

Details

Motivation: Aerial-ground localization is challenging due to large viewpoint and modality gaps between ground-level LiDAR scans and overhead aerial imagery, requiring robust cross-modal alignment techniques.

Method: Projects LiDAR scans to bird’s-eye-view, aligns with aerial features via bidirectional attention, uses likelihood map decoder for spatial probability distributions, and employs contrastive learning for shared embedding space.

Result: Outperforms state-of-the-art baselines on CARLA and KITTI datasets, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy.

Conclusion: TransLocNet provides robust and generalizable aerial-ground localization that works effectively in both synthetic and real-world environments.

Abstract: Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird’s-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.

[103] Neural Collapse in Test-Time Adaptation

Xiao Chen, Zhongjing Du, Jiazhen Huang, Xu Jiang, Li Lu, Jingyan Jiang, Zhi Wang

Main category: cs.CV

TL;DR: The paper proposes NCTTA, a novel test-time adaptation method that addresses performance degradation under domain shifts by correcting sample-wise misalignment between feature embeddings and classifier weights, using hybrid targets to mitigate unreliable pseudo-labels.

Details

Motivation: Existing Test-Time Adaptation (TTA) methods lack theoretical insights into why models degrade under domain shifts. The paper aims to understand the fundamental causes of performance degradation and develop a principled solution based on Neural Collapse theory.

Method: The authors extend Neural Collapse to sample-wise level, discovering Sample-wise Alignment Collapse (NC3+). They propose NCTTA, a feature-classifier alignment method with hybrid targets that blends geometric proximity (based on NC3+ insights) with predictive confidence to address unreliable pseudo-labels during adaptation.

Result: Extensive experiments show NCTTA significantly outperforms existing methods, achieving 14.52% improvement over Tent on ImageNet-C, demonstrating enhanced robustness to domain shifts.

Conclusion: Performance degradation in TTA stems from sample-wise misalignment between features and classifier weights under domain shifts. NCTTA effectively addresses this by realigning features with classifier weights using hybrid targets, providing a theoretically-grounded solution for test-time adaptation.

Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample’s feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.

[104] An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time

Stylianos Kandylakis, Christos Orfanopoulos, Georgios Siolas, Panayiotis Tsanakas

Main category: cs.CV

TL;DR: Real-time mobile system for physiotherapy exercise analysis using pose estimation, angle features, and dynamic programming for movement recognition and error detection.

Details

Motivation: Enable remote physiotherapy supervision and m-health applications by providing real-time, client-side exercise analysis on mobile devices without server dependency.

Method: Uses pose-estimation neural network to extract body keypoints, converts them to trigonometric angle features, classifies poses with lightweight models, and employs modified Levenshtein distance dynamic programming for full movement recognition and deviation detection.

Result: System operates entirely on client side with real-time performance, effectively identifies and evaluates physiotherapy exercises while detecting deviations from prescribed patterns.

Conclusion: The framework demonstrates effectiveness for remote physiotherapy supervision and m-health applications, offering scalable real-time exercise analysis on mobile devices.

Abstract: This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.

[105] Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment

Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Xinlong Pan, Haipeng Wang, Junni Zou, Hongkai Xiong

Main category: cs.CV

TL;DR: Proposes a unified-transform framework for learned video compression with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to eliminate error propagation while maintaining competitive rate-distortion performance.

Details

Motivation: Existing learned video compression frameworks face a dilemma: separate-transform frameworks cause error propagation, while unified-transform frameworks have inferior motion estimation/compensation in shared latent domains. Need to achieve error-propagation-free streaming with quality consistency.

Method: 1) Dual-domain progressive temporal alignment: coarse pixel-domain alignment (simple motion) + refined latent-domain alignment using Flow-Guided Deformable Transformer (FGDT) for long-term motion refinement (complex motion). 2) Quality-conditioned mixture-of-expert (QCMoE) module for continuous bit-rate adaptation that dynamically assigns experts to adjust quantization steps per pixel based on target quality and content.

Result: Achieves competitive rate-distortion performance compared with state-of-the-art methods while successfully eliminating error propagation. Enables continuous and consistent rate control with appealing R-D performance.

Conclusion: The proposed unified-transform framework with dual-domain progressive temporal alignment and QCMoE effectively addresses the error propagation problem in learned video compression while maintaining competitive compression performance and enabling quality-consistent streaming.

Abstract: Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.

[106] Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network

Khurram Ashfaq, Muhammad Tariq Mahmood

Main category: cs.CV

TL;DR: A hybrid Shape-from-Focus method combining traditional DDL kernels for focus volume computation with lightweight GRU-based iterative refinement and learned upsampling, achieving state-of-the-art depth estimation accuracy.

Details

Motivation: Existing deep learning SFF methods use heavy feature encoders and simple one-step aggregation that introduces artifacts and amplifies noise in depth maps. There's a need for more robust and accurate depth estimation.

Method: Hybrid framework: 1) Compute multi-scale focus volumes using handcrafted Directional Dilated Laplacian (DDL) kernels to capture long-range directional focus variations. 2) Feed focus volumes into lightweight multi-scale GRU-based depth extraction module that iteratively refines initial depth estimate at lower resolution. 3) Use learned convex upsampling module within recurrent network to reconstruct high-resolution depth maps while preserving details.

Result: Outperforms state-of-the-art deep learning and traditional methods on both synthetic and real-world datasets, achieving superior accuracy and generalization across diverse focal conditions.

Conclusion: The proposed hybrid approach combining traditional focus volume computation with lightweight deep learning refinement effectively addresses artifacts and noise issues in SFF, delivering robust and accurate depth estimation.

Abstract: Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.

[107] 3D Blood Pulsation Maps

Maurice Rohr, Tobias Reinhardt, Tizian Dege, Justus Thies, Christoph Hoog Antink

Main category: cs.CV

TL;DR: Pulse3DFace is the first dataset for 3D blood pulsation map estimation, enabling synthetic video generation for remote pulse estimation validation and multi-view illumination mitigation research.

Details

Motivation: To address the lack of datasets for 3D blood pulsation analysis, which is needed to improve remote photoplethysmography imaging methods and develop multi-view approaches for illumination effect mitigation.

Method: Collected raw videos from 15 subjects at 30 Hz using RGB cameras from 23 viewpoints, captured blood pulse reference measurements, generated facial 3D scans using monocular structure-from-motion, and processed 3D pulsation maps compatible with FLAME 3D head model texture space.

Result: Created the Pulse3DFace dataset containing multi-view videos, pulse references, 3D scans, and processed 3D pulsation maps with signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. Comprehensive evaluation shows dataset captures physiologically meaningful features in facial and neck regions.

Conclusion: Pulse3DFace is a pioneering dataset that enables research in 3D blood pulsation analysis, synthetic video generation for remote pulse estimation validation, and multi-view illumination mitigation approaches, with demonstrated capability to capture physiologically relevant facial blood flow patterns.

Abstract: We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset’s illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.

[108] Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano

Main category: cs.CV

TL;DR: TaP (Take a Peek) enhances few-shot semantic segmentation by using Low-Rank Adaptation (LoRA) to fine-tune encoders on support sets, improving adaptability to novel classes with minimal computational overhead.

Details

Motivation: Prior FSS research focused mainly on decoders, but encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. The paper addresses this critical limitation in encoder generalization to novel classes.

Method: TaP uses Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. The method is model-agnostic and can be integrated into existing FSS pipelines.

Result: Extensive experiments across COCO 20^i, Pascal 5^i, and cross-domain datasets (DeepGlobe, ISIC, Chest X-ray) show consistent performance improvements across diverse models and shot settings. Significant gains in complex multi-class scenarios demonstrate practical effectiveness.

Conclusion: TaP addresses the critical limitation of encoder generalization in FSS, paving the way for more robust, efficient, and generalizable segmentation systems. Strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency.

Abstract: Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder’s limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks–including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray–demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS–the encoder’s generalization to novel classes–TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

[109] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Main category: cs.CV

TL;DR: Blink is a dynamic visual token resolution framework for MLLMs that mimics human visual scanning by selectively allocating more computation to salient regions through token super-resolution and adaptive token dropping.

Details

Motivation: Current MLLMs have limited visual perception compared to humans who efficiently scan scenes by focusing on salient regions. The authors investigate whether MLLMs exhibit similar behavior and find they naturally attend to different visual regions across layers, suggesting selective computation allocation could enhance perception.

Method: Blink framework with two modules: 1) Saliency-guided scanning that estimates token importance using attention maps, and 2) Dynamic token resolution that extends important tokens via plug-and-play TokenSR module, then drops extended tokens when they lose focus in subsequent layers.

Result: Extensive experiments validate Blink’s effectiveness in enhancing visual perception and multimodal understanding, demonstrating improved performance through dynamic computation allocation to salient visual regions.

Conclusion: Blink successfully emulates human-inspired visual scanning within MLLMs, achieving adaptive and efficient visual perception by balancing broad exploration with fine-grained focus on salient regions.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential “blink-like” process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

[110] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Pius Horn, Janis Keuper

Main category: cs.CV

TL;DR: Novel benchmarking framework for evaluating PDF formula extraction using synthetic PDFs with LaTeX ground truth and LLM-as-a-judge for semantic assessment.

Details

Motivation: Existing benchmarks for PDF parsing either exclude mathematical formulas or lack semantically-aware evaluation metrics, creating a critical gap for training LLMs and building scientific knowledge bases from academic literature.

Method: 1) Create synthetic PDFs with precise LaTeX ground truth for systematic control; 2) Pioneer LLM-as-a-judge for semantic formula assessment; 3) Develop two-stage matching pipeline to handle parser output inconsistencies; 4) Validate with human evaluation (750 ratings from 30 evaluators).

Result: LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) vs. CDM (r=0.34) and text similarity (r~0). Evaluation of 20+ PDF parsers on 100 synthetic documents with 2,000+ formulas reveals significant performance disparities.

Conclusion: Provides crucial insights for practitioners selecting parsers and establishes a robust, scalable methodology for reproducible evaluation of PDF formula extraction quality, with code and benchmark data publicly available.

Abstract: Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

[111] Grounding Everything in Tokens for Multimodal Large Language Models

Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma

Main category: cs.CV

TL;DR: GETok introduces spatial tokens for MLLMs to improve object grounding in 2D space without changing the autoregressive architecture.

Details

Motivation: Current MLLMs struggle with accurate object grounding in 2D space due to tokenization limitations of autoregressive Transformers, which raises the question of how to improve sequential language tokens for better spatial reasoning.

Method: GETok integrates specialized learnable tokens: grid tokens to partition images into spatial anchors, and offset tokens for iterative refinement of localization predictions, embedding spatial relationships directly into tokens.

Result: Extensive experiments show GETok achieves superior performance over state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

Conclusion: GETok significantly advances MLLMs in native 2D space reasoning by embedding spatial relationships into tokens, enabling precise object grounding without modifying the autoregressive architecture.

Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

[112] Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks

Meher Md Saad

Main category: cs.CV

TL;DR: Proposes a few-shot prototypical network with ST-GCN and Multi-Scale Temporal Aggregation for skeleton-based sign language recognition, achieving 43.75% Top-1 accuracy on WLASL and strong zero-shot generalization.

Details

Motivation: Address data scarcity and long-tail distribution in isolated sign language recognition, where standard classifiers overfit frequent classes and fail on rare signs due to limited training examples.

Method: Few-shot prototypical network framework with skeleton-based encoder using ST-GCN and novel Multi-Scale Temporal Aggregation module; episodic training learns semantic metric space where classification is based on proximity to dynamic class prototypes.

Result: Achieves 43.75% Top-1 and 77.10% Top-5 accuracy on WLASL test set, outperforming standard classification baseline by over 13%; shows strong zero-shot generalization with nearly 30% accuracy on unseen SignASL dataset without fine-tuning.

Conclusion: Prototypical training strategy effectively handles data scarcity where standard classification fails, offering scalable pathway for recognizing extensive sign vocabularies with limited data through metric learning paradigm.

Abstract: Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.

[113] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang

Main category: cs.CV

TL;DR: AVI-Edit is a framework for audio-sync video instance editing that addresses the lack of audio-visual synchronization and fine-grained control in existing video editing methods.

Details

Motivation: Existing video editing methods overlook audio-visual synchronization and lack fine-grained spatial and temporal controllability needed for precise instance-level edits, which is crucial for engaging content creation.

Method: Proposes AVI-Edit with two key components: 1) granularity-aware mask refiner that iteratively refines coarse user masks into precise instance-level regions, and 2) self-feedback audio agent that curates high-quality audio guidance for fine-grained temporal control. Also constructs a large-scale dataset with instance-centric correspondence and comprehensive annotations.

Result: Extensive experiments show AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization.

Conclusion: AVI-Edit successfully addresses the audio-visual synchronization problem in video editing and provides fine-grained spatial and temporal control for precise instance-level edits, advancing the field of audio-sync video instance editing.

Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.

[114] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

Wenlong Jiao, Heyang Lee, Ping Wang, Pengfei Zhu, Qinghua Hu, Dongwei Ren

Main category: cs.CV

TL;DR: A symmetric U-Net architecture (SymUNet) achieves state-of-the-art all-in-one image restoration with simpler design than existing complex methods, using aligned feature scales and streamlined cross-scale propagation. A semantic-enhanced variant (SE-SymUNet) adds CLIP features via cross-attention for further improvement.

Details

Motivation: Existing all-in-one image restoration methods rely on complex architectures (Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. The authors argue that well-crafted feature extraction inherently encodes degradation information, and a symmetric U-Net architecture can effectively unleash these cues without complex designs.

Method: Proposes SymUNet: a symmetric U-Net architecture with aligned feature scales across encoder-decoder and streamlined cross-scale propagation, using simple additive fusion in skip connections. Also introduces SE-SymUNet: a semantic-enhanced variant that integrates frozen CLIP features via cross-attention to explicitly amplify degradation priors.

Result: SymUNet achieves better results across benchmark datasets than existing approaches while reducing computational cost. SE-SymUNet further improves performance through semantic enhancement. Both methods establish simpler and stronger foundations for all-in-one image restoration.

Conclusion: A symmetric U-Net design with aligned feature scales is sufficient for state-of-the-art all-in-one image restoration, challenging the need for complex architectures. The approach provides a simpler yet more effective foundation for future advancements in the field.

Abstract: All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.

[115] Salient Object Detection in Complex Weather Conditions via Noise Indicators

Quan Chen, Xiaokai Yang, Tingyu Wang, Rongfeng Lu, Xichun Sheng, Yaoqi Sun, Chenggang Yan

Main category: cs.CV

TL;DR: A weather-robust salient object detection framework with noise indicator fusion that improves segmentation accuracy under diverse weather conditions.

Details

Motivation: Most SOD methods assume clean visual conditions and fail to handle weather-induced noise in real-world scenarios, which degrades segmentation accuracy.

Method: Proposes a SOD framework with specific encoder and replaceable decoder. Introduces one-hot noise indicator vector for different weather types and Noise Indicator Fusion Module (NIFM) that takes semantic features and noise indicator as inputs, inserted between encoder stages for adaptive feature modulation.

Result: Extensive experiments on WXSOD dataset with varying training data scales (100%, 50%, 30%) and multiple encoder/decoder configurations show the proposed framework (especially NIFM-enhanced encoder) improves segmentation accuracy under complex weather conditions compared to vanilla encoder.

Conclusion: The proposed weather-aware SOD framework effectively handles diverse weather conditions through noise indicator fusion, maintaining compatibility with mainstream decoders while improving robustness to weather-induced noise.

Abstract: Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.

[116] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira, M. Prasad

Main category: cs.CV

TL;DR: Training-free text-only retrieval method for remote sensing images using VLM-generated structured captions, achieving competitive results with supervised models.

Details

Motivation: Address the semantic gap in remote sensing image retrieval and the limitations of existing VLM methods that require costly domain-specific training, while lacking benchmarks for evaluating VLM-generated text in zero-shot retrieval.

Method: Introduce RSRT dataset with structured captions, propose TRSLLaVA method that reformulates cross-modal retrieval as text-to-text matching using rich text descriptions as queries against VLM-generated captions in unified textual embedding space, completely bypassing training/fine-tuning.

Result: Achieves mean Recall of 42.62% on RSITMD, nearly doubling standard zero-shot CLIP baseline (23.86%) and surpassing several top supervised models; shows competitive performance on RSICD benchmark.

Conclusion: High-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval, validating the utility of training-free text-only approaches.

Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model’s low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62%, nearly doubling the 23.86% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

[117] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos

Bishoy Galoaa, Sarah Ostadabbas

Main category: cs.CV

TL;DR: TCAM is a motion-centric framework that automatically discovers and describes multiple motion patterns in videos without user queries, using motion-field attention for spatial grounding and achieving strong performance on video understanding tasks.

Details

Motivation: Video understanding in challenging conditions (occlusion, camouflage, rapid movement) relies more on motion dynamics than static appearance. Current methods often require user queries, but TCAM aims to autonomously discover and describe motion patterns.

Method: TCAM uses a motion-centric framework with motion-field attention mechanism to identify multiple motion activities and spatially ground natural language descriptions to corresponding trajectories. It aligns motion patterns with contrastive vision-language representations through unified training combining global video-text alignment with fine-grained spatial correspondence, enabling query-free discovery via multi-head cross-attention.

Result: On MeViS benchmark: 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

Conclusion: Motion patterns aligned with vision-language representations provide powerful semantic signals for action recognition and description. TCAM successfully enables autonomous discovery of multiple motion expressions without user queries, showing strong performance across video understanding tasks.

Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

[118] Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation

Mohammad Sadegh Gholizadeh, Amir Arsalan Rezapour

Main category: cs.CV

TL;DR: Deep learning pipeline combining feature extraction with interpretable image processing for automated ocular disease diagnosis, using retinal vessel segmentation as auxiliary task to improve clinical interpretability.

Details

Motivation: Rising incidence of vision-threatening eye diseases requires scalable, accurate screening solutions. Need to address "black-box" limitations of standard CNNs and bridge gap between algorithmic output and expert medical validation to reduce false positives and improve clinical deployment viability.

Method: Pipeline combining deep feature extraction with interpretable image processing modules. Uses high-fidelity retinal vessel segmentation as auxiliary task to guide classification process. Grounds model predictions in clinically relevant morphological features.

Result: Not specified in abstract (paper presents comprehensive study but results not detailed in provided content).

Conclusion: Approach aims to improve interpretability and clinical validation of automated ocular disease diagnosis systems, potentially reducing false positives and enhancing deployment viability in clinical settings.

Abstract: In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the “black-box” limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model’s predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.

[119] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

Main category: cs.CV

TL;DR: Lang2Motion generates object motion trajectories from text by aligning motion manifolds with CLIP embeddings, outperforming video-based methods on retrieval and motion accuracy.

Details

Motivation: Prior work focuses on human motion or video synthesis, but there's a need for generating explicit motion trajectories for arbitrary objects from language descriptions.

Method: Transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP’s frozen encoders.

Result: Achieves 34.2% Recall@1 on text-to-trajectory retrieval (12.5 points better than video methods), improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3), and shows 88.3% Top-1 accuracy on human action recognition despite training only on object motions.

Conclusion: Lang2Motion effectively aligns motion manifolds with joint embedding spaces, enabling text-guided trajectory generation with strong cross-domain transfer and supporting style transfer, semantic interpolation, and latent-space editing.

Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP’s frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.

[120] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, Wentao Zhang

Main category: cs.CV

TL;DR: DOCR-Inspector is a VLM-based system that formalizes document parsing assessment as fine-grained error detection and analysis, outperforming commercial models on real-world benchmarks and providing guidance for parsing refinement.

Details

Motivation: Current document parsing evaluation faces challenges: standard benchmarks have dataset biases leading to inconsistent model rankings, overall scores obscure error patterns, and there's no reliable way to assess parsing quality in real-world scenarios.

Method: DOCR-Inspector uses VLM-as-a-Judge to analyze document images and parsed outputs, identifies errors across 28 predefined types, employs Chain-of-Checklist reasoning for hierarchical assessment, and is trained on DOCRcase-200K dataset.

Result: DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro and leading open-source models on DOCRcaseBench (882 real-world cases with manual annotations), and its quality assessments effectively guide parsing refinement.

Conclusion: DOCR-Inspector provides a comprehensive, reliable assessment framework for document parsing that addresses limitations of current benchmarks, serving as both a practical evaluator and a driver for advancing parsing systems at scale.

Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.

[121] K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices

Bishoy Galoaa, Pau Closas, Sarah Ostadabbas

Main category: cs.CV

TL;DR: K-Track is a hybrid acceleration framework that combines sparse deep learning keyframe updates with lightweight Kalman filtering to achieve 5-10X speedup for point tracking while maintaining over 85% accuracy on edge devices.

Details

Motivation: Deep learning-based point trackers achieve state-of-the-art accuracy but require per-frame GPU inference, making them impractical for deployment on resource-constrained edge devices where compute, power, and connectivity are limited.

Method: K-Track uses a hybrid approach: sparse deep learning updates on keyframes combined with lightweight Kalman filtering for intermediate frame prediction, with principled Bayesian uncertainty propagation to maintain temporal coherence. It’s a general-purpose, tracker-agnostic acceleration framework.

Result: Achieves 5-10X speedup while retaining over 85% of original trackers’ accuracy. Demonstrated real-time performance on edge platforms like NVIDIA Jetson Nano and RTX Titan across multiple state-of-the-art point trackers.

Conclusion: K-Track provides a practical path to deploy high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.

Abstract: Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers’ accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.

[122] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng

Main category: cs.CV

TL;DR: TriDF is a comprehensive benchmark for interpretable DeepFake detection that evaluates models across three key aspects: Perception (identifying manipulation artifacts), Detection (classification performance), and Hallucination (explanation reliability).

Details

Motivation: Advances in generative AI make it easy to create realistic fake portrayals of individuals, creating serious risks for security, communication, and public trust. There's a need for systems that not only detect manipulated content but also provide clear, reliable reasoning.

Method: TriDF benchmark contains high-quality forgeries from advanced synthesis models covering 16 DeepFake types across image, video, and audio modalities. It evaluates models on three aspects: Perception (using human-annotated evidence), Detection (classification across diverse forgery families), and Hallucination (quantifying explanation reliability).

Result: Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects.

Conclusion: TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems to address real-world synthetic media threats.

Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

Hanfeng Wu, Marlon Steiner, Michael Schmidt, Alvaro Marcos-Ramiro, Christoph Stiller

Main category: cs.CV

TL;DR: NaviHydra: A controllable navigation-guided end-to-end model for autonomous driving that accepts high-level navigation commands and generates compliant trajectories, achieving SOTA results on NAVSIM benchmark.

Details

Motivation: Traditional rule-based systems struggle in dynamic environments, while end-to-end methods have difficulty complying with explicit navigation commands. There's a need for robust models that can interpret high-level commands and generate safe, compliant trajectories.

Method: NaviHydra is a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. It uses BEV-based trajectory gathering for enhanced feature extraction, accepts navigation commands as control signals, and introduces a novel navigation compliance metric for evaluation.

Result: The method significantly outperforms baseline models and achieves state-of-the-art results in the NAVSIM benchmark, demonstrating effectiveness in advancing autonomous driving with improved controllability and navigation safety.

Conclusion: NaviHydra successfully addresses the challenge of making end-to-end autonomous driving models compliant with explicit navigation commands while maintaining robustness in dynamic environments, representing an important advancement in controllable autonomous driving systems.

Abstract: The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird’s Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model’s controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.

[124] XDen-1K: A Density Field Dataset of Real-World Objects

Jingxuan Zhang, Tianqi Yu, Yatu Zhang, Jinze Wu, Kaixin Yao, Jingyang Liu, Yuyao Zhang, Jiayuan Gu, Jingyi Yu

Main category: cs.CV

TL;DR: XDen-1K is the first large-scale multi-modal dataset for real-world physical property estimation, focusing on volumetric density, with 1,000 objects across 148 categories including 3D models and biplanar X-ray scans.

Details

Motivation: Current AI models capture object surface geometry but neglect internal physical properties like volumetric density, which are critical for predicting center of mass, stability, and interaction dynamics in robotics and simulation applications. The main bottleneck has been the lack of large-scale real-world data.

Method: 1) Created XDen-1K dataset with 1,000 real-world objects across 148 categories, providing high-resolution 3D geometric models with part-level annotations and corresponding biplanar X-ray scans. 2) Developed a novel optimization framework that recovers high-fidelity volumetric density fields from sparse X-ray views. 3) Used X-ray images as conditioning signals for volumetric segmentation and conducted robotics task experiments.

Result: The dataset enables effective improvement in center-of-mass estimation accuracy and robotic manipulation success rates. The optimization framework successfully recovers volumetric density fields, and X-ray conditioning improves volumetric segmentation performance.

Conclusion: XDen-1K serves as a foundational resource and benchmark for physically grounded visual inference and embodied AI research, bridging the gap in real-world physical property data and enabling better prediction of object behavior in robotics and simulation applications.

Abstract: A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object’s surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object’s center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.

[125] Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

Javier Villena Toro, Mehdi Tarkian

Main category: cs.CV

TL;DR: Geo6DPose is a lightweight, training-free pipeline for zero-shot 6D object pose estimation that runs locally on commodity hardware, achieving real-time performance while matching larger cloud-based models.

Details

Motivation: Current zero-shot 6D pose estimation methods rely on large-scale models and cloud inference, causing high latency, energy consumption, and deployment risks related to connectivity, cost, and data governance. These limitations conflict with practical robotics constraints where compute is limited and on-device inference is required.

Method: Combines foundation model visual features with geometric filtering: computes similarity maps between template DINO descriptors and scene patches, establishes mutual correspondences by projecting scene patch centers to 3D and template descriptors to object model coordinates, recovers poses via correspondence-driven RANSAC, and ranks poses using a weighted geometric alignment metric that accounts for reprojection consistency and spatial support.

Result: Achieves sub-second inference on a single commodity GPU while matching average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). Requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones.

Conclusion: Geo6DPose advances practical, fully local 6D perception for robotic deployment by trading model scale for geometric reliability, enabling real-time performance without cloud dependencies while maintaining competitive accuracy.

Abstract: Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.

[126] Optimal transport unlocks end-to-end learning for single-molecule localization

Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal

Main category: cs.CV

TL;DR: A new deep learning approach for single-molecule localization microscopy that reformulates training as a set-matching problem with optimal-transport loss, eliminating non-differentiable NMS layers and enabling end-to-end training for faster, more accurate super-resolution imaging.

Details

Motivation: Current SMLM requires non-overlapping fluorophores, leading to long acquisition times that hinder live-cell imaging. Existing deep learning approaches rely on non-differentiable NMS layers that may discard true positives and use local fusion strategies that limit performance.

Method: Reformulates SMLM training as a set-matching problem using optimal-transport loss to eliminate NMS during inference. Proposes an iterative neural network that integrates knowledge of the microscope’s optical system for end-to-end training.

Result: Experiments on synthetic benchmarks and real biological data show superior performance compared to state-of-the-art methods at both moderate and high emitter densities.

Conclusion: The proposed approach enables faster, more accurate super-resolution imaging by eliminating non-differentiable components and integrating optical system knowledge, making live-cell SMLM more feasible.

Abstract: Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores – fluorescent molecules stained onto the observed specimen – over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope’s optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.

[127] Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Amaël Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

Main category: cs.CV

TL;DR: SHARP enables photorealistic view synthesis from a single image using 3D Gaussian representation, achieving real-time rendering with metric scale and state-of-the-art performance.

Details

Motivation: The paper aims to solve the problem of generating photorealistic novel views from just a single input image, which is challenging due to limited information. Current methods often require multiple images, complex processing, or lack real-time capabilities.

Method: SHARP uses a neural network to regress parameters of a 3D Gaussian representation from a single image in a single feedforward pass (<1 second on GPU). This representation supports metric scale and can be rendered in real time for novel views.

Result: SHARP achieves state-of-the-art performance, reducing LPIPS by 25-34% and DISTS by 21-43% compared to prior models, while being three orders of magnitude faster. It shows robust zero-shot generalization across datasets.

Conclusion: SHARP demonstrates that efficient single-image view synthesis is possible with 3D Gaussian representations, enabling real-time photorealistic rendering with metric scale and strong generalization.

Abstract: We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp

[128] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images

Matias Cosarinsky, Nicolas Gaggion, Rodrigo Echeveste, Enzo Ferrante

Main category: cs.CV

TL;DR: The paper introduces uncertainty estimation methods for anatomical landmark-based segmentation in chest X-rays, releasing a large-scale dataset with per-node uncertainty estimates to enhance robustness and safe clinical deployment.

Details

Motivation: Uncertainty estimation is essential for safe clinical deployment of medical image segmentation systems, but prior work has focused on pixel-level uncertainty while landmark-based segmentation offers topological guarantees but remains underexplored from an uncertainty perspective.

Method: Uses hybrid neural network architectures combining standard image convolutional encoders with graph-based generative decoders, leveraging their variational latent space to derive two complementary uncertainty measures: latent uncertainty (from learned distribution parameters) and predictive uncertainty (from multiple stochastic output predictions).

Result: Both uncertainty measures increase with perturbation severity in controlled corruption experiments, reflecting global and local degradation. The uncertainty signals can identify unreliable predictions and support out-of-distribution detection on the CheXmask dataset.

Conclusion: Establishes uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-rays, and releases CheXmask-U dataset with 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates.

Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.

[129] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell

Main category: cs.CV

TL;DR: SpaceDrive introduces a spatial-aware VLM framework for autonomous driving that uses explicit positional encodings for 3D coordinates instead of textual digit tokens, improving spatial reasoning and planning accuracy.

Details

Motivation: Current vision language models struggle with fine-grained 3D spatial understanding, which is essential for autonomous driving systems interacting with the physical world. Textual digit tokens are insufficient for representing spatial relationships.

Method: Proposes SpaceDrive with a universal positional encoder that processes 3D coordinates from multi-view depth estimation, historical ego-states, and text prompts. These positional encodings augment visual tokens and serve as task-agnostic coordinate representations, replacing digit-wise numerical tokens for both inputs and outputs.

Result: Achieves state-of-the-art open-loop performance on nuScenes dataset and second-best Driving Score of 78.02 on Bench2Drive closed-loop benchmark among VLM-based methods.

Conclusion: Explicit spatial positional encodings enable better spatial reasoning and direct trajectory coordinate regression, significantly improving autonomous driving performance compared to traditional VLM approaches using textual digit tokens.

Abstract: End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.

[130] UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

Main category: cs.CV

TL;DR: UltraCUA is a foundation model that unifies primitive GUI actions (click, type, scroll) with high-level tool execution via APIs, enabling more resilient computer-use agents through hybrid action selection.

Details

Motivation: Current computer-use agents rely exclusively on brittle primitive GUI actions prone to cascading failures, while API-driven agents have rich structured interfaces. There's a need to bridge this gap and create agents that can intelligently choose between low-level visual interactions and high-level tool execution.

Method: Four key advances: 1) Automated pipeline extracts tool capabilities from software documentation and code repositories; 2) Synthetic data engine produces 17,000+ verifiable tasks; 3) Hybrid action trajectory collection incorporating both GUI primitives and tool calls; 4) Two-stage training combining supervised fine-tuning with online reinforcement learning for intelligent GUI vs. API action selection.

Result: UltraCUA achieves 22% relative improvement on OSWorld while executing 11% faster than existing approaches. Cross-domain validation on WindowsAgentArena shows 21.7% success rate, surpassing Windows-trained baselines. Hybrid action reduces error propagation and improves execution efficiency.

Conclusion: The hybrid action paradigm bridges primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer-use agents for diverse environments and complex real-world tasks, establishing a scalable paradigm for future computer-use agents.

Abstract: Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

[131] Video Depth Propagation

Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool

Main category: cs.CV

TL;DR: VeloDepth is an efficient online video depth estimation pipeline that uses flow-based feature propagation with learned corrections to achieve temporal consistency and real-time performance.

Details

Motivation: Existing video depth estimation methods either suffer from temporal inconsistencies (frame-by-frame monocular models) or are computationally demanding (complex temporal modeling), limiting their practical applicability for real-time applications.

Method: Proposes VeloDepth with a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections, structurally enforcing temporal consistency while maintaining efficiency.

Result: Comprehensive zero-shot evaluation shows state-of-the-art temporal consistency, competitive accuracy, and significantly faster inference compared to existing video-based depth estimators.

Conclusion: VeloDepth provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks, with code and models publicly available.

Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth

[132] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation

Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling-An Zeng, Jian-Fang Hu, Xihan Wei, Wei-Shi Zheng

Main category: cs.CV

TL;DR: IRMoGen introduces a novel paradigm that interleaves motion generation with assessment and refinement through iterative text-motion dialogue, achieving better text-motion alignment and outperforming baselines on generation benchmarks.

Details

Motivation: Current motion-aware LLMs treat understanding and generation separately, missing potential benefits from interactive feedback between tasks. The paper identifies motion assessment and refinement as crucial bridges for bidirectional knowledge flow between understanding and generation.

Method: Proposes IRMoGen paradigm with IRG-MotionLLM model that seamlessly interleaves motion generation, assessment, and refinement. Uses three-stage training scheme and automated data engine to synthesize interleaved reasoning annotations from existing datasets.

Result: Assessment and refinement tasks significantly improve text-motion alignment; interleaving steps yields consistent performance gains; IRG-MotionLLM outperforms baselines and achieves advanced performance on standard text-to-motion generation benchmarks.

Conclusion: The interleaved reasoning paradigm effectively bridges motion understanding and generation through assessment and refinement, demonstrating that tight coupling of these tasks leads to improved motion generation performance.

Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.

[133] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

Tianyu Zhou, Junyi Tang, Zehui Li, Dahong Qian, Suncheng Xiang

Main category: cs.CV

TL;DR: LDP is a multimodal LLM framework for generating professional polyp diagnosis reports from colonoscopy images, using a curated endoscopic dataset and efficient fine-tuning to outperform existing methods while reducing computational costs by 833x.

Details

Motivation: Traditional automated polyp reporting suffers from inconsistencies and hallucinations due to scarcity of high-quality multimodal medical data, creating a need for more reliable, clinically-aligned diagnostic systems.

Method: Proposes LDP framework using Qwen2-VL-7B backbone fine-tuned with LoRA for parameter efficiency and DPO for clinical alignment, trained on curated MMEndo dataset of expert-annotated colonoscopy image-text pairs.

Result: Outperforms existing baselines on automated metrics and clinical expert evaluations (Physician Score 7.2/10), reduces training computational costs by 833x compared to full fine-tuning, and shows robustness on IU-XRay dataset.

Conclusion: LDP offers a scalable, clinically viable solution for primary healthcare polyp diagnosis, demonstrating effectiveness through multimodal LLM fine-tuning with significant computational efficiency gains.

Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.

[134] Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography

Rene Lisasi, Michele Esposito, Chen Zhao

Main category: cs.CV

TL;DR: A diffusion-based AI model predicts coronary blood pressure from CT scans, bypassing slow CFD simulations for faster CAD diagnosis.

Details

Motivation: CFD simulations for coronary blood flow are computationally expensive and time-consuming, limiting their clinical adoption and availability of labeled data for AI training in CAD diagnosis.

Method: Developed an end-to-end pipeline automating coronary geometry extraction from CCTA, streamlining simulation data generation, and introducing a diffusion-based regression model to predict blood pressure directly from CT features without CFD during inference.

Result: The model achieves state-of-the-art performance with R² of 64.42%, RMSE of 0.0974, and normalized RMSE of 0.154, outperforming baseline approaches on simulated coronary hemodynamics data.

Conclusion: Provides a scalable, accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis, reducing manual burden and computational costs of traditional CFD workflows.

Abstract: Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.

[135] What matters for Representation Alignment: Global Information or Spatial Structure?

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie

Main category: cs.CV

TL;DR: The paper shows that spatial structure (patch token similarities), not global semantic performance, drives generation quality in representation alignment for diffusion models.

Details

Motivation: To investigate what aspect of target representations matters most for generative training: global semantic information (e.g., ImageNet accuracy) or spatial structure (patch token similarities).

Method: Large-scale empirical analysis across 27 vision encoders, then introducing iREPA with two simple modifications: replacing MLP projection with convolution layer and adding spatial normalization for external representations.

Result: Spatial structure, not global semantic performance, drives generation quality. iREPA consistently improves convergence speed across diverse encoders, model sizes, and training variants.

Conclusion: The work challenges prevailing assumptions about representation alignment and provides simple but effective modifications that improve training efficiency for generative models.

Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa

[136] Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading

Masum Shah Junayed, John Derek Van Vessem, Qian Wan, Gahie Nam, Sheida Nabavi

Main category: cs.CV

TL;DR: GLAT-IRM: A graph-based transformer with iterative patch refinement for prostate cancer grading from whole-slide images, achieving state-of-the-art performance through spatial consistency and relevant region selection.

Details

Motivation: Prostate cancer grading from WSIs is challenging due to large image scale, heterogeneous tissue structures, and difficulty selecting diagnostically relevant regions. Existing methods using random/static patch selection include redundant/non-informative regions that degrade performance.

Method: Proposes Graph Laplacian Attention-Based Transformer (GLAT) with Iterative Refinement Module (IRM). IRM iteratively refines patch selection using pretrained ResNet50 for local features and foundation model for importance scoring. GLAT models tissue connectivity via graph where patches are nodes, uses graph Laplacian constraints for spatial consistency, and learnable filtering to enhance discriminative features. Includes convex aggregation for dynamic patch importance adjustment.

Result: Extensive experiments on five public and one private dataset show the model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.

Conclusion: The proposed GLAT-IRM framework effectively addresses challenges in prostate cancer grading by combining iterative patch refinement with graph-based spatial modeling, leading to improved diagnostic accuracy and computational efficiency.

Abstract: Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.

[137] Self-Ensemble Post Learning for Noisy Domain Generalization

Wang Lu, Jindong Wang

Main category: cs.CV

TL;DR: SEPL (Self-Ensemble Post Learning) addresses domain generalization with noisy labels by leveraging intermediate features and ensemble learning to reduce spurious feature reliance.

Details

Motivation: Domain generalization methods degrade when encountering noisy labels, as noise exacerbates spurious feature emergence in deep layers (spurious feature enlargement). Existing algorithms need to be made robust to noise while maintaining domain generalization capabilities.

Method: SEPL consists of feature probing training and prediction ensemble inference. It trains multiple probing classifiers on intermediate feature representations from pre-trained models using semi-supervised algorithms (to handle noisy labels), then integrates predictions via crowdsourcing inference.

Result: Extensive experiments show SEPL enhances robustness of existing methods against noisy labels in domain generalization settings and demonstrates significant potential for real-world applications with high flexibility.

Conclusion: SEPL effectively addresses the combined challenge of domain generalization and label noise by leveraging latent feature diversity and ensemble learning, offering a practical solution for real-world scenarios.

Abstract: While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.

[138] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

Main category: cs.CV

TL;DR: PoseGAM is a geometry-aware multi-view framework that directly predicts 6D object pose from query and template images without explicit feature matching, achieving state-of-the-art performance on unseen objects.

Details

Motivation: Existing 6D object pose estimation methods rely on explicit feature correspondences between query images and object models/templates, which can be challenging for unseen objects. The authors aim to develop a more direct approach that eliminates explicit matching while incorporating geometry information.

Method: PoseGAM uses a multi-view framework based on foundation model architectures, integrating object geometry through two mechanisms: explicit point-based geometry and learned features from geometry representation networks. The approach processes query images with multiple template images directly to predict pose.

Result: The method achieves state-of-the-art performance across multiple benchmarks, with average AR improvement of 5.1% over prior methods and up to 17.6% gains on individual datasets. A large-scale synthetic dataset with 190k+ objects under diverse conditions was created to enhance robustness.

Conclusion: PoseGAM demonstrates strong generalization to unseen objects by directly predicting pose without explicit matching and effectively integrating geometry information through complementary mechanisms, showing significant improvements over existing approaches.

Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

[139] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

Main category: cs.CV

TL;DR: SWiT-4D is a Sliding-Window Transformer that converts monocular videos into high-quality 4D meshes with strong temporal consistency, requiring minimal 4D supervision by leveraging existing image-to-3D priors.

Details

Motivation: Current challenges in converting monocular videos to animated 3D assets with explicit 4D meshes due to limited 4D datasets, while image-to-3D generation has strong priors from extensive datasets that could be leveraged.

Method: SWiT-4D integrates with Diffusion Transformer-based image-to-3D generators using a sliding-window approach for spatial-temporal modeling across video frames, preserving original single-image forward process. Includes optimization-based trajectory module for global translation recovery in static-camera videos.

Result: Achieves high-fidelity geometry and stable temporal consistency with only single short video fine-tuning. Outperforms baselines in temporal smoothness on in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, in-the-wild videos).

Conclusion: SWiT-4D demonstrates strong data efficiency and practical deployability under limited 4D supervision, effectively bridging the gap between image-to-3D priors and video-to-4D generation.

Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/

[140] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang

Main category: cs.CV

TL;DR: MMSI-Video-Bench is a comprehensive human-annotated benchmark for evaluating video-based spatial intelligence in multimodal large language models, covering perception, planning, prediction, and cross-video reasoning across 1,106 questions from diverse video sources.

Details

Motivation: Current MLLMs lack comprehensive evaluation for spatial understanding in continuous visual input, which is crucial for them to become effective assistants in physical environments. There's no holistic benchmark assessing progress toward this goal.

Method: Created a four-level framework (Perception, Planning, Prediction, Cross-Video Reasoning) with 1,106 questions grounded in 1,278 video clips from 25 datasets and in-house videos. Each item was carefully designed and reviewed by 3DV experts with explanatory rationales. Also includes three domain-oriented sub-benchmarks for targeted assessment.

Result: Evaluation of 25 strong MLLMs revealed a significant human-AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. Spatially fine-tuned models fail to generalize effectively. Systematic failures were found in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence.

Conclusion: The benchmark establishes a solid testbed for advancing video-based spatial intelligence in MLLMs, revealing current limitations and providing insights for future model development and evaluation strategies.

Abstract: Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human–AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

[141] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong

Main category: cs.CV

TL;DR: BabyVLM-V2 is a developmentally grounded vision-language framework that improves upon V1 with longitudinal infant-centric pretraining data and DevCV Toolbox for cognitive evaluation, enabling compact models to achieve competitive performance on infant-aligned multimodal tasks.

Details

Motivation: Early children's developmental trajectories provide a natural goal for sample-efficient pretraining of vision foundation models, aiming to create more developmentally plausible AI systems that learn like infants.

Method: 1) Creates a longitudinal, multifaceted pretraining set from infant-centric audiovisual corpus (video-utterance, image-utterance, multi-turn conversational data); 2) Develops DevCV Toolbox adapting NIH Baby Toolbox vision measures into 10 multimodal tasks covering spatial reasoning, memory, and vocabulary; 3) Trains compact models from scratch.

Result: Compact models pretrained from scratch achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks, demonstrating sample-efficient learning aligned with early children’s capabilities.

Conclusion: The principled BabyVLM-V2 framework accelerates research in developmentally plausible pretraining of vision foundation models by providing unified infrastructure for infant-inspired learning and evaluation.

Abstract: Early children’s developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children’s capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

[142] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

Main category: cs.CV

TL;DR: MiSI-Bench: A benchmark for evaluating Vision-Language Models on Microscopic Spatial Intelligence tasks with 163K QA pairs and 587K images from molecular structures, showing current VLMs lag behind humans but fine-tuned models show promise in spatial transformations.

Details

Motivation: To assess VLMs' capability in Microscopic Spatial Intelligence - perceiving and reasoning about spatial relationships of invisible microscopic entities, which is crucial for scientific discovery.

Method: Proposed MiSI-Bench benchmark framework with 9 complementary tasks covering elementary spatial transformations to complex relational identifications, using over 163,000 QA pairs and 587,000 images derived from ~4,000 molecular structures.

Result: Current state-of-the-art VLMs perform significantly below human level on the benchmark. However, a fine-tuned 7B model shows substantial potential, even surpassing humans in spatial transformation tasks, but performs poorly in scientifically-grounded tasks like hydrogen bond recognition.

Conclusion: Explicit domain knowledge integration is necessary for progress toward scientific AGI, as current VLMs lack scientific reasoning capabilities despite showing promise in basic spatial transformations.

Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

[143] Any4D: Unified Feed-Forward Metric 4D Reconstruction

Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan

Main category: cs.CV

TL;DR: Any4D is a scalable multi-view transformer for dense 4D reconstruction that directly generates per-pixel motion and geometry predictions across multiple frames, supporting various input modalities and achieving significantly better performance than prior methods.

Details

Motivation: Prior work on 4D reconstruction has limitations: they typically focus on either 2-view dense scene flow or sparse 3D point tracking, and most methods are limited to monocular RGB videos. There's a need for a more flexible framework that can handle multiple modalities and sensors while achieving better accuracy and efficiency.

Method: Any4D uses a scalable multi-view transformer architecture with a modular 4D scene representation. It encodes per-view predictions using egocentric factors (depthmaps and camera intrinsics in local camera coordinates) and allocentric factors (camera extrinsics and scene flow in global world coordinates). The system can process various input modalities including RGB videos, RGB-D frames, IMU-based egomotion, and Radar Doppler measurements.

Result: The method achieves superior performance with 2-3X lower error compared to prior methods and 15X faster computation. It demonstrates strong performance across diverse setups and opens avenues for multiple downstream applications.

Conclusion: Any4D provides a flexible, efficient, and accurate framework for dense 4D reconstruction that can leverage multiple sensor modalities, representing a significant advancement over existing approaches in both performance and versatility.

Abstract: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

[144] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

Main category: cs.CV

TL;DR: MoCapAnything is a category-agnostic motion capture system that generates BVH animations for arbitrary 3D assets from monocular video prompts, using reference-guided factorization and constraint-aware inverse kinematics.

Details

Motivation: Existing motion capture pipelines are limited to specific species or templates, creating a gap for generalizable systems. The authors formalize this as Category-Agnostic Motion Capture (CAMoCap) to enable scalable, prompt-driven 3D motion capture for any rigged asset.

Method: A factorized framework with three learnable modules: (1) Reference Prompt Encoder extracts joint queries from asset skeleton, mesh, and rendered images; (2) Video Feature Extractor computes visual descriptors and reconstructs coarse 4D deforming mesh; (3) Unified Motion Decoder fuses cues for temporally coherent trajectories, followed by constraint-aware inverse kinematics to recover asset-specific rotations.

Result: The system delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs. Experiments on in-domain benchmarks and in-the-wild videos demonstrate effectiveness, supported by the curated Truebones Zoo dataset with 1038 motion clips.

Conclusion: MoCapAnything enables scalable, prompt-driven 3D motion capture for arbitrary assets, bridging the gap between video input and asset-specific animation output through a novel reference-guided, factorized framework.

Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset’s skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

[145] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna

Main category: cs.CV

TL;DR: OmniView is a unified diffusion framework for 4D consistency tasks that separates space, time, and view conditions, enabling flexible combinations for novel view synthesis, video generation, and camera control across diverse inputs.

Details

Motivation: Prior camera control methods in diffusion models are fragmented, focusing on specific 4D tasks (novel view synthesis, text-to-video, etc.) and trained on disjoint data slices, lacking a unified approach.

Method: Separately represents space, time, and view conditions to enable flexible combinations of these inputs, allowing the model to handle static, dynamic, and multiview inputs for various 4D consistency tasks.

Result: Competitive with task-specific models across benchmarks, improving image quality scores by up to 33% in multiview NVS, 60% in dynamic NVS, 20% in static camera control, and reducing camera trajectory errors by 4x in text-to-video.

Conclusion: OmniView demonstrates strong generalizability as a unified 4D video model, showing feasibility of a generalist approach to 4D consistency tasks that outperforms fragmented specialized methods.

Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33% in multiview NVS LLFF dataset, 60% in dynamic NVS Neural 3D Video benchmark, 20% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

[146] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland

Main category: cs.CV

TL;DR: PubTables-v2 is a new large-scale dataset for table extraction tasks, enabling evaluation of vision-language models and development of Page-Object Table Transformer (POTATR) for comprehensive page-level table extraction.

Details

Motivation: Progress in table extraction has been hindered by lack of annotated data, especially for multi-page table structure recognition. Current vision-language models need proper benchmarks to demonstrate their capabilities in full document context.

Method: Created PubTables-v2 dataset supporting multiple challenging table extraction tasks. Used this dataset to evaluate domain-specialized VLMs and developed POTATR - an image-to-graph extension of Table Transformer for page-level table extraction.

Result: PubTables-v2 is the first large-scale benchmark for multi-page table structure recognition. The dataset enables proper evaluation of VLMs on table extraction tasks and facilitates development of POTATR model for comprehensive table extraction.

Conclusion: PubTables-v2 addresses the data scarcity problem in table extraction research, provides a benchmark for evaluating vision-language models, and enables development of advanced models like POTATR for comprehensive page-level table understanding.

Abstract: Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

[147] Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

Main category: cs.CV

TL;DR: Mull-Tokens introduces modality-agnostic latent tokens that can hold intermediate information in either image or text modalities, enabling free-form multimodal reasoning without relying on specialist tools or costly image generation.

Details

Motivation: Real-world reasoning requires thinking about space, time, affordances, and other concepts that words alone cannot convey. Existing multimodal models are brittle, don't scale well, and rely on expensive approaches like calling specialist tools, generating images, or using handcrafted reasoning data to switch between text and image thoughts.

Method: Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities. The approach involves: 1) Training Mull-Tokens using supervision from interleaved text-image traces, and 2) Fine-tuning without any supervision using only final answers. The method is inspired by latent reasoning frameworks.

Result: Across four challenging spatial reasoning benchmarks (including puzzle solving and perspective-taking tasks), Mull-Tokens outperformed several baselines using text-only reasoning or interleaved image-text reasoning. It achieved a +3% average improvement and up to +16% improvement on a puzzle solving reasoning-heavy split compared to the strongest baseline.

Conclusion: Mull-Tokens offers a simple solution for abstract thinking in multiple modalities, addressing challenges in grounding textual and visual reasoning. The approach enables models to think free-form toward correct answers without relying on brittle, expensive methods.

Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative – Mull-Tokens – modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

[148] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu

Main category: cs.CV

TL;DR: DuetSVG is a unified multimodal model that jointly generates image tokens and SVG tokens end-to-end, using visual guidance to improve SVG quality.

Details

Motivation: Existing VLM-based SVG generation methods struggle with complex semantics and produce visually unappealing or geometrically incoherent SVGs because they generate only text and lack visual signals during decoding.

Method: DuetSVG is trained on both image and SVG datasets to jointly generate image tokens and corresponding SVG tokens end-to-end. At inference, it uses a novel test-time scaling strategy that leverages the model’s native visual predictions as guidance to improve SVG decoding quality.

Result: Extensive experiments show DuetSVG outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

Conclusion: Joint multimodal generation with visual guidance significantly improves SVG quality compared to text-only approaches, enabling better handling of complex semantics and producing more visually appealing results.

Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model’s native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

[149] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov

Main category: cs.CV

TL;DR: AlcheMinT introduces explicit timestamp conditioning for subject-driven video generation, enabling precise temporal control over when subjects appear/disappear in videos.

Details

Motivation: Existing subject-driven video generation methods lack fine-grained temporal control over subject appearance and disappearance, which is essential for applications like compositional video synthesis, storyboarding, and controllable animation.

Method: Proposes a unified framework with novel positional encoding for temporal intervals (subject identities), integrates subject-descriptive text tokens for better identity binding, and uses token-wise concatenation without additional cross-attention modules.

Result: AlcheMinT achieves visual quality matching state-of-the-art video personalization methods while enabling precise temporal control over multi-subject generation within videos for the first time.

Conclusion: The framework successfully addresses the temporal control limitation in subject-driven video generation, opening new possibilities for applications requiring precise timing of subject appearances in generated videos.

Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

[150] FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li

Main category: cs.CV

TL;DR: FoundationMotion is an automated pipeline for creating large-scale motion datasets using object tracking and LLMs, enabling models to achieve state-of-the-art motion understanding performance.

Details

Motivation: Current models struggle with motion understanding due to scarcity of large-scale, fine-grained motion datasets, which are expensive to create through manual annotation.

Method: Automated pipeline that: 1) detects and tracks objects in videos to extract trajectories, 2) uses trajectories and video frames with LLMs to generate fine-grained captions and diverse QA pairs about motion and spatial reasoning.

Result: Fine-tuned models (NVILA-Video-15B, Qwen2.5-7B) achieve substantial improvements in motion understanding, outperforming strong baselines like Gemini-2.5 Flash and Qwen2.5-VL-72B across diverse motion benchmarks.

Conclusion: FoundationMotion provides a scalable solution for curating fine-grained motion datasets that effectively enhance motion understanding and spatial reasoning capabilities in diverse models.

Abstract: Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

[151] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh

Main category: cs.CV

TL;DR: Real-time talking head generation using Gaussian Splatting mapped to 3D Morphable Models with transformer-based audio-to-parameter prediction for temporal stability.

Details

Motivation: Current speech-driven talking head methods face limitations: diffusion methods struggle with oneshot settings, while Gaussian Splatting approaches suffer from facial tracking inaccuracies and inconsistent mappings leading to unstable outputs and video artifacts that hinder realistic applications.

Method: The method maps Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. It introduces transformer-based prediction of model parameters directly from audio to ensure temporal consistency. The system works from monocular video and independent audio speech inputs.

Result: The method enables generation of real-time talking head videos with competitive quantitative and qualitative performance compared to existing approaches.

Conclusion: The proposed approach addresses the temporal instability and artifact issues in current talking head methods by combining Gaussian Splatting with 3D Morphable Models and transformer-based audio parameter prediction, enabling realistic real-time avatar generation.

Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

[152] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang

Main category: cs.CV

TL;DR: SceneMaker: A decoupled 3D scene generation framework that separates de-occlusion from object generation and uses unified pose estimation with global-local attention mechanisms to handle severe occlusion in open-set scenes.

Details

Motivation: Existing methods struggle with producing high-quality geometry and accurate poses under severe occlusion and open-set settings due to insufficient de-occlusion and pose estimation priors.

Method: 1) Decouple de-occlusion model from 3D object generation, enhanced with image datasets and collected de-occlusion data; 2) Unified pose estimation model integrating global and local mechanisms for both self-attention and cross-attention; 3) Construct open-set 3D scene dataset for generalization.

Result: Comprehensive experiments demonstrate superiority on both indoor and open-set scenes. Codes and datasets released publicly.

Conclusion: The decoupled framework effectively addresses occlusion challenges in 3D scene generation through specialized de-occlusion handling and improved pose estimation with global-local attention mechanisms.

Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

[153] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

Main category: cs.CV

TL;DR: VL-JEPA is a vision-language model using Joint Embedding Predictive Architecture that predicts continuous text embeddings instead of generating tokens, achieving better performance with fewer parameters and supporting selective decoding.

Details

Motivation: To overcome limitations of traditional autoregressive VLMs that generate tokens, which can be inefficient and focus on surface-level linguistic details rather than task-relevant semantics.

Method: Uses Joint Embedding Predictive Architecture (JEPA) to predict continuous embeddings of target texts in an abstract representation space, with a lightweight text decoder invoked only when needed for text generation.

Result: Achieves stronger performance with 50% fewer trainable parameters than standard token-space VLMs, reduces decoding operations by 2.85x with selective decoding, and outperforms CLIP, SigLIP2, and Perception Encoder on video tasks while matching classical VLMs on VQA datasets with only 1.6B parameters.

Conclusion: VL-JEPA demonstrates that predicting continuous embeddings in abstract representation space is more efficient and effective than token generation, enabling better performance with fewer parameters while supporting diverse vision-language tasks without architectural changes.

Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA’s embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

Main category: cs.CV

TL;DR: MeViS is a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting/tracking objects based on motion descriptions. It addresses limitations of existing datasets that underemphasize motion and introduces new benchmarks showing current methods’ weaknesses.

Details

Motivation: Existing referring video segmentation datasets focus on static attributes and salient objects, allowing target identification from single frames. They underemphasize motion in both videos and language expressions, limiting exploration of motion reasoning for pixel-level video understanding.

Method: Introduces MeViS dataset with 33,072 human-annotated motion expressions (text+audio) covering 8,171 objects in 2,006 complex videos. Benchmarks 15 existing methods across 4 tasks: RVOS, AVOS, RMOT, and new RMEG task. Proposes LMPM++ approach for RVOS/AVOS/RMOT.

Result: Benchmarking reveals weaknesses and limitations of existing methods in motion expression-guided video understanding. The proposed LMPM++ approach achieves new state-of-the-art results on RVOS/AVOS/RMOT tasks.

Conclusion: MeViS provides a platform for developing motion expression-guided video understanding algorithms in complex scenes. The dataset and code are publicly available, addressing the gap in motion-focused video understanding research.

Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method’s source code are publicly available at https://henghuiding.com/MeViS/

[155] Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

Main category: cs.CV

TL;DR: Flex is a geometry-agnostic scene encoder for autonomous driving that uses learnable scene tokens to jointly encode multi-camera data across time, achieving 2.2x faster inference and better performance than state-of-the-art methods without relying on 3D priors.

Details

Motivation: To address the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving systems, which typically rely on explicit 3D inductive biases like BEV, occupancy, or tri-plane representations.

Method: Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. It’s geometry-agnostic, learning compact scene representations directly from data without explicit 3D inductive biases, and aggressively compresses visual input for downstream LLM-based policy models.

Result: On a 20,000-hour proprietary dataset, Flex achieves 2.2x greater inference throughput while significantly improving driving performance compared to state-of-the-art methods. The compact scene tokens also develop emergent capability for scene decomposition without explicit supervision.

Conclusion: The findings challenge the prevailing assumption that 3D priors are necessary for autonomous driving, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient, and effective path for future autonomous driving systems.

Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

[156] ClusIR: Towards Cluster-Guided All-in-One Image Restoration

Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing, Lefei Zhang, Dacheng Tao

Main category: cs.CV

TL;DR: ClusIR is a cluster-guided image restoration framework that uses learnable clustering to model degradation semantics and adaptively restore images across diverse degradations through spatial and frequency domain modulation.

Details

Motivation: Existing All-in-One Image Restoration methods fail to explicitly model degradation types and struggle to adapt restoration behavior to complex or mixed degradations, limiting their effectiveness across diverse degradation scenarios.

Method: ClusIR uses two key components: 1) Probabilistic Cluster-Guided Routing Mechanism (PCGRM) that disentangles degradation recognition from expert activation for discriminative perception and stable routing, and 2) Degradation-Aware Frequency Modulation Module (DAFMM) that leverages cluster-guided priors for adaptive frequency decomposition and targeted modulation.

Result: Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios, achieving remarkable restoration results across a wide range of degradations.

Conclusion: The cluster-guided synergy effectively bridges semantic cues with frequency-domain modulation, enabling adaptive restoration across diverse degradations and outperforming existing methods in complex scenarios.

Abstract: All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.

[157] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang

Main category: cs.CV

TL;DR: E-RayZer is a self-supervised 3D vision model that learns truly 3D-aware representations directly from unlabeled multi-view images through explicit 3D reconstruction, outperforming previous methods on pose estimation and 3D downstream tasks.

Details

Motivation: Self-supervised pre-training has transformed foundation models for languages, 2D images, and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. Prior methods like RayZer infer 3D indirectly through latent-space view synthesis rather than operating directly in 3D space.

Method: E-RayZer performs self-supervised 3D reconstruction with explicit geometry directly in 3D space, eliminating shortcut solutions. It introduces a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an unsupervised manner for convergence and scalability.

Result: E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models like VGGT. Its learned representations outperform leading visual pre-training models (DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks.

Conclusion: E-RayZer establishes a new paradigm for 3D-aware visual pre-training by learning geometrically grounded representations through explicit 3D reconstruction, demonstrating superior performance on 3D understanding tasks compared to existing methods.

Abstract: Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

[158] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Sicheng Mo, Thao Nguyen, Richard Zhang, Nick Kolkin, Siddharth Srinivasan Iyer, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li

Main category: cs.CV

TL;DR: Group Diffusion enables collaborative image generation by sharing attention across multiple images during inference, achieving significant FID improvements through cross-sample attention.

Details

Motivation: Previous diffusion models generate images independently at inference, missing potential benefits from collaborative generation. The authors explore whether samples can be generated collaboratively by sharing information across images during the denoising process.

Method: Proposes Group Diffusion which unlocks attention mechanisms to be shared across images rather than just within patches of a single image. This enables joint denoising of multiple images at inference time, learning both intra-image and inter-image correspondence. Built on standard diffusion transformers.

Result: Shows clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Achieves up to 32.2% FID improvement on ImageNet-256x256. Introduces a qualitative measure that correlates closely with FID.

Conclusion: Cross-sample inference is an effective, previously unexplored mechanism for generative modeling. Group Diffusion reveals the benefits of collaborative generation through shared attention across images during inference.

Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.

[159] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov

Main category: cs.CV

TL;DR: Omni-Attribute: A new open-vocabulary image attribute encoder that learns disentangled, attribute-specific representations for better visual concept personalization without information leakage.

Details

Motivation: Existing visual concept personalization methods use holistic embeddings from general-purpose image encoders that entangle multiple visual factors, making it difficult to isolate single attributes and leading to information leakage and incoherent synthesis.

Method: Joint data-model design: (1) curating semantically linked image pairs annotated with positive/negative attributes to explicitly teach what to preserve or suppress, and (2) adopting a dual-objective training paradigm balancing generative fidelity with contrastive disentanglement.

Result: The resulting embeddings are effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Conclusion: Omni-Attribute successfully addresses the limitation of entangled representations in existing methods by providing high-fidelity, attribute-specific embeddings for better visual concept personalization.

Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng

Main category: cs.CV

TL;DR: StereoWalker improves robot navigation foundation models by adding stereo vision and mid-level vision modules (depth estimation, tracking), achieving better performance with far less training data.

Details

Motivation: Current navigation foundation models rely solely on monocular vision and ignore mid-level vision modules, which is inefficient. Monocular vision suffers from depth-scale ambiguity and requires massive amounts of pixel-to-action supervision that's difficult to obtain, especially in dynamic unstructured environments.

Method: StereoWalker augments navigation foundation models with stereo inputs and explicit mid-level vision modules (depth estimation and dense pixel tracking). The approach also includes curating a large stereo navigation dataset with automatic action annotation from Internet stereo videos.

Result: StereoWalker achieves comparable performance to state-of-the-art using only 1.5% of training data, and surpasses state-of-the-art using full data. Stereo vision yields higher navigation performance than monocular input.

Conclusion: Relying solely on monocular vision and ignoring mid-level vision priors is inefficient. Incorporating stereo inputs and explicit mid-level vision modules significantly improves navigation foundation models’ efficiency and performance.

Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.

[161] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

Main category: cs.CV

TL;DR: WorldLens is a comprehensive benchmark for evaluating generative world models in embodied AI, assessing visual realism, geometric consistency, physical plausibility, and functional reliability across five dimensions, with a supporting dataset and evaluation agent.

Details

Motivation: Current generative world models for driving environments produce convincing visuals but often fail physically or behaviorally, lacking unified evaluation methods to assess whether generated worlds preserve geometry, obey physics, or support reliable control.

Method: Introduces WorldLens benchmark with five evaluation aspects: Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference. Creates WorldLens-26K dataset with human-annotated videos (scores + rationales) and develops WorldLens-Agent evaluation model distilled from human annotations for scalable, explainable scoring.

Result: No existing world model excels universally across all dimensions - models with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. The benchmark reveals trade-offs between different aspects of world generation quality.

Conclusion: WorldLens provides a unified ecosystem for measuring world fidelity, standardizing evaluation of generative world models not just by visual realism but by behavioral realism, enabling more comprehensive assessment of how real generated worlds behave.

Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects – Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference – jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity – standardizing how future models are judged not only by how real they look, but by how real they behave.

[162] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler

Main category: cs.CV

TL;DR: StereoSpace is a diffusion-based framework for monocular-to-stereo synthesis that uses viewpoint conditioning instead of explicit depth or warping, achieving state-of-the-art results with strong geometric consistency and perceptual comfort.

Details

Motivation: Current stereo generation methods often rely on explicit depth estimation and warping, which can be limited by depth estimation errors and struggle with complex scenes like non-Lambertian surfaces. The paper aims to develop a more robust, depth-free approach that can handle challenging scenes while maintaining geometric consistency.

Method: StereoSpace uses a diffusion-based framework with viewpoint conditioning to model geometry implicitly. It operates in a canonical rectified space and uses conditioning to guide the generator to infer correspondences and fill disocclusions end-to-end. The method avoids explicit depth estimation or warping operations.

Result: StereoSpace outperforms other methods from warp & inpaint, latent-warping, and warped-conditioning categories. It achieves sharp parallax and strong robustness on both layered and non-Lambertian scenes. The paper also introduces a new evaluation protocol that excludes ground truth geometry at test time and emphasizes downstream-relevant metrics (iSQoE for perceptual comfort and MEt3R for geometric consistency).

Conclusion: Viewpoint-conditioned diffusion provides a scalable, depth-free solution for stereo generation that can handle complex scenes while maintaining strong geometric consistency and perceptual quality, establishing a new paradigm for monocular-to-stereo synthesis.

Abstract: We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.

[163] Dual Cluster Contrastive learning for Object Re-Identification

Hantao Yao, Changsheng Xu

Main category: cs.CV

TL;DR: DCC introduces a dual cluster contrastive framework with individual and centroid memory banks for object ReID, reducing outlier impact through centroid-based updates and cross-view consistency.

Details

Motivation: Existing cluster contrastive methods for object ReID use individual features to update cluster memory, which fluctuates with training examples and is sensitive to outlier samples. A more stable centroid-based updating mechanism is needed to reduce individual sample impact.

Method: Proposes Dual Cluster Contrastive (DCC) framework with two memory banks: individual cluster memory (updates with single samples) and centroid cluster memory (updates with cluster mean features). Uses vanilla contrastive loss for each memory plus cross-view consistency constraint to exchange benefits between both memories. Works for both supervised and unsupervised ReID.

Result: Extensive experiments on Market-1501, MSMT17, and VeRi-776 benchmarks demonstrate superiority of DCC for both supervised and unsupervised object ReID tasks.

Conclusion: DCC effectively addresses the instability of individual-based cluster memory updates by incorporating centroid-based updates and cross-view consistency, achieving state-of-the-art performance on multiple object ReID benchmarks.

Abstract: Recently, cluster contrastive learning has been proven effective for object ReID by computing the contrastive loss between the individual features and the cluster memory. However, existing methods that use the individual features to momentum update the cluster memory will fluctuate over the training examples, especially for the outlier samples. Unlike the individual-based updating mechanism, the centroid-based updating mechanism that applies the mean feature of each cluster to update the cluster memory can reduce the impact of individual samples. Therefore, we formulate the individual-based updating and centroid-based updating mechanisms in a unified cluster contrastive framework, named Dual Cluster Contrastive framework (DCC), which maintains two types of memory banks: individual and centroid cluster memory banks. Significantly, the individual cluster memory considers just one individual at a time to take a single step for updating. The centroid cluster memory applies the mean feature of each cluster to update the corresponding cluster memory. During optimization, besides the vallina contrastive loss of each memory, a cross-view consistency constraint is applied to exchange the benefits of two memories for generating a discriminative description for the object ReID. Note that DCC can be easily applied for unsupervised or supervised object ReID by using ground-truth labels or the generated pseudo-labels. Extensive experiments on three benchmarks, \emph{e.g.,} Market-1501, MSMT17, and VeRi-776, under \textbf{supervised Object ReID} and \textbf{unsupervised Object ReID} demonstrate the superiority of the proposed DCC.

[164] Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints

Muxin Zhang, Qiao Feng, Zhuo Su, Chao Wen, Zhou Xue, Kun Li

Main category: cs.CV

TL;DR: Joint2Human generates detailed 3D human geometry using 2D diffusion models with Fourier occupancy field representation, ensuring both global structure and local details while being computationally efficient.

Details

Motivation: Current 3D human generation methods have limitations: direct 2D-to-3D approaches lose local details, while image reconstruction methods struggle with global view consistency. There's a need for a method that preserves both global structure and local details efficiently.

Method: Uses Fourier occupancy field (FOF) representation to enable direct 3D shape generation with 2D models. Incorporates high-frequency enhancer and multi-view recarving to integrate details from different views. Introduces compact spherical embedding of 3D joints for pose guidance and supports text-guided generation.

Result: Demonstrates capability to ensure global structure, local details, high resolution, and low computational cost simultaneously. Can generate 3D humans guided by both pose (via joint embeddings) and textual inputs.

Conclusion: Joint2Human presents an effective approach for detailed 3D human generation that bridges the gap between 2D diffusion models and 3D geometry, achieving both structural consistency and fine details with computational efficiency.

Abstract: 3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in losing local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details. To achieve this, we employ the Fourier occupancy field (FOF) representation, enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy, our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry, we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the generation process. Additionally, our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure, local details, high resolution, and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

Main category: cs.CV

TL;DR: FNet is an interpretable multi-modal image fusion network using ℓ₀-regularized convolutional sparse coding to separate unique/common features from different modality images, with IFNet for inverse fusion during training.

Details

Motivation: Multi-modal image fusion (MMIF) combines features from different sensor modalities to enhance information content, but existing methods lack interpretability and principled feature separation.

Method: Design FNet based on ℓ₀-regularized multi-modal convolutional sparse coding (MCSC) model with learnable ℓ₀-regularized sparse coding (LZSC) block via deep unfolding. Separate unique/common features from source images and combine them. Also propose IFNet for inverse fusion during training.

Result: FNet achieves high-quality fusion results across eight MMIF datasets, enhances downstream object detection and semantic segmentation in visible-thermal image pairs, and demonstrates good interpretability through visualized intermediate results.

Conclusion: FNet provides an interpretable, principled approach to multi-modal image fusion that effectively separates and combines unique/common features, improving both fusion quality and downstream task performance.

Abstract: Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an $\ell_0$-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the $\ell_0$-regularized CSC problem, we design a learnable $\ell_0$-regularized sparse coding (LZSC) block in a principled manner through deep unfolding. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an $\ell_0$-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet’s training. Extensive experiments show that FNet achieves high-quality fusion results across eight different MMIF datasets. Furthermore, we show that FNet enhances downstream object detection \textcolor[rgb]{ 0, 0, 0}{and semantic segmentation} in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network. Link for code and models: https://github.com/gargi884/FNet-MMIF.

[166] Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation

Gayatri Deshmukh, Somsubhra De, Chirag Sehgal, Jishu Sen Gupta, Sparsh Mittal

Main category: cs.CV

TL;DR: FLORA dataset provides 4,330 fashion outfit-text pairs with professional terminology, while NeRA adapter uses KAN-based nonlinear transformations for superior fashion image generation from text.

Details

Motivation: The fashion industry needs specialized datasets with professional terminology to advance AI-driven fashion design. Current datasets lack the nuanced language and stylistic elements used by professional designers.

Method: Created FLORA dataset with 4,330 curated fashion outfit-text pairs using industry-specific terminology. Developed NeRA adapter based on Kolmogorov-Arnold Networks (KAN) with learnable spline-based nonlinear transformations instead of traditional MLP adapters.

Result: Fine-tuning on FLORA significantly improves generative models’ ability to create accurate, stylistically rich fashion images from text descriptions. NeRA outperforms existing adapters (LoRA, LoKR, DoRA, LoHA) with better fidelity, faster convergence, and semantic alignment.

Conclusion: FLORA dataset enables better AI fashion design comprehension and generation. NeRA’s nonlinear adapter architecture provides superior modeling of complex semantic relationships. Both contributions will be open-sourced to advance the field.

Abstract: Specialized datasets that capture the fashion industry’s rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA, (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce NeRA (Nonlinear low-rank Expressive Representation Adapter), a novel adapter architecture based on Kolmogorov-Arnold Networks (KAN). Unlike traditional PEFT techniques such as LoRA, LoKR, DoRA, and LoHA that use MLP adapters, NeRA uses learnable spline-based nonlinear transformations, enabling superior modeling of complex semantic relationships, achieving strong fidelity, faster convergence and semantic alignment. Extensive experiments on our proposed FLORA and LAION-5B datasets validate the superiority of NeRA over existing adapters. We will open-source both the FLORA dataset and our implementation code.

[167] Brain-like emergent properties in deep networks: impact of network architecture, datasets and training

Niranjan Rajesh, Georgin Jacob, SP Arun

Main category: cs.CV

TL;DR: Systematic evaluation of 30+ state-of-the-art deep networks reveals that network architecture has the strongest impact on brain-like properties, with no single network outperforming all others in brain alignment.

Details

Motivation: Despite improvements in deep networks on standardized benchmarks, they still underperform humans on real-world vision tasks. The paper aims to identify which design principles (architecture, training data, or training regime) most impact brain-like emergent properties to close the gap between artificial and human vision.

Method: Systematically evaluated over 30 state-of-the-art networks with varying network architectures, training datasets, and training regimes. Assessed the presence or absence of brain-like properties that capture subtle emergent properties present in brains, beyond standard brain response prediction benchmarks.

Result: 1) Network architecture had the strongest impact on brain-like properties compared to dataset and training regime variations. 2) Networks varied widely in their alignment to the brain with no single network outperforming all others across all brain-like properties.

Conclusion: The study provides a principled and interpretable path toward closing the gap between artificial and human vision by identifying network architecture as the most critical factor for developing brain-like properties in deep networks.

Abstract: Despite the rapid pace at which deep networks are improving on standardized vision benchmarks, they are still outperformed by humans on real-world vision tasks. One solution to this problem is to make deep networks more brain-like. Although there are several benchmarks that compare the ability of deep networks to predict brain responses on natural images, they do not capture subtle but important emergent properties present in brains. It is also unclear which design principle – architecture, training data, or training regime – would have the greatest impact on these emergent properties. To investigate these issues, we systematically evaluated over 30 state-of-the-art networks with varying network architectures, training datasets, and training regimes for the presence or absence of brain-like properties. Our main findings are as follows. First, network architecture had the strongest impact on brain-like properties compared to dataset and training regime variations. Second, networks varied widely in their alignment to the brain with no single network outperforming all others. Taken together, our results offer a principled and interpretable path toward closing the gap between artificial and human vision.

[168] SpotLight: Shadow-Guided Object Relighting via Diffusion

Frédéric Fortier-Chouinard, Zitian Zhang, Louis-Etienne Messier, Mathieu Garon, Anand Bhattad, Jean-François Lalonde

Main category: cs.CV

TL;DR: SpotLight enables controllable lighting in diffusion-based neural rendering by injecting coarse shadow hints, achieving precise object shading without additional training.

Details

Motivation: Diffusion models lack manual lighting control essential for improving/personalizing image outcomes, limiting their utility as neural rendering engines for object insertion.

Method: Training-free approach that injects desired shadow hints into pre-trained diffusion-based neural renderers, enabling accurate object shading according to specified light positions.

Result: Superior object compositing results (quantitatively/perceptually), outperforms existing diffusion-based relighting models, enables hand-scribbling shadows and full-image relighting.

Conclusion: SpotLight demonstrates that precise controllable lighting can be achieved in diffusion models without training, using only shadow hints, expanding neural rendering capabilities.

Abstract: Recent work has shown that diffusion models can serve as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. However, unlike typical physics-based renderers, these neural rendering engines are limited by the lack of manual control over the lighting, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise and controllable lighting can be achieved without any additional training, simply by supplying a coarse shadow hint for the object. Indeed, we show that injecting only the desired shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, is entirely training-free and leverages existing neural rendering approaches to achieve controllable relighting. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting. We also demonstrate other applications, such as hand-scribbling shadows and full-image relighting, demonstrating its versatility.

[169] Quantifying the Reliability of Predictions in Detection Transformers: Object-Level Calibration and Image-Level Uncertainty

Young-Jin Park, Carson Sobolewski, Navid Azizan

Main category: cs.CV

TL;DR: DETR models produce many redundant predictions with varying reliability. The paper analyzes DETR’s specialist strategy, introduces Object-level Calibration Error (OCE) metric, and proposes uncertainty quantification for practical deployment.

Details

Motivation: DETR models generate hundreds of predictions far exceeding actual objects, raising trustworthiness concerns. Existing metrics fail to evaluate calibration quality and post-processing effectiveness for practical deployment.

Method: Empirical/theoretical analysis of DETR’s specialist strategy via Hungarian matching. Introduces Object-level Calibration Error (OCE) metric and post hoc uncertainty quantification framework for per-image accuracy prediction.

Result: DETRs use optimal specialist strategy: one calibrated prediction per object, others suppress confidence. OCE effectively evaluates models and identifies reliable predictions. Uncertainty framework predicts per-image accuracy.

Conclusion: Practical DETR deployment requires joint evaluation of calibration and post-processing. OCE addresses limitations of existing metrics, and uncertainty quantification enables reliable prediction selection.

Abstract: DETR and its variants have emerged as promising architectures for object detection, offering an end-to-end prediction pipeline. In practice, however, DETRs generate hundreds of predictions that far outnumber the actual objects present in an image. This raises a critical question: which of these predictions could be trusted? Addressing this concern, we provide empirical and theoretical evidence that predictions within the same image play distinct roles, resulting in varying reliability levels. Our analysis reveals that DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. We show that this strategy emerges as the loss-minimizing solution to the Hungarian matching algorithm, fundamentally shaping DETRs’ outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels. Therefore, practical deployment necessitates a joint evaluation of both the model’s calibration quality and the effectiveness of the post-processing algorithm. However, we demonstrate that existing metrics like average precision and expected calibration error are inadequate for this task. To address this issue, we further introduce Object-level Calibration Error (OCE): This object-centric design penalizes both retaining suppressed predictions and missed ground truth foreground objects, making OCE suitable for both evaluating models and identifying reliable prediction subsets. Finally, we present a post hoc uncertainty quantification framework that predicts per-image model accuracy.

[170] ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Dmitry Petrov, Pradyumn Goyal, Divyansh Shivashok, Yuanming Tao, Melinos Averkiou, Evangelos Kalogerakis

Main category: cs.CV

TL;DR: ShapeWords: A method for text-to-image synthesis guided by 3D shape information using specialized tokens to blend shape awareness with textual context.

Details

Motivation: Current shape guidance methods (like depth maps) are limited to fixed viewpoints, often ignore full 3D structure, and don't effectively integrate with textual context, leading to less coherent image synthesis.

Method: Incorporates target 3D shape information within specialized tokens embedded together with input text, effectively blending 3D shape awareness with textual context to guide image synthesis.

Result: Produces more text-compliant, aesthetically plausible images while maintaining 3D shape awareness, generating diverse yet consistent images that reflect both target shape geometry and textual description.

Conclusion: ShapeWords successfully integrates 3D shape guidance with text prompts through specialized tokens, overcoming limitations of conventional shape guidance methods and producing higher quality, more coherent synthesized images.

Abstract: We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape’s geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.

[171] When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi

Main category: cs.CV

TL;DR: The paper studies the trade-off between compression and reconstruction in two-stage image generation, introduces Causally Regularized Tokenization (CRT) to improve generation efficiency, and achieves comparable performance with fewer tokens and parameters.

Details

Motivation: Current two-stage image generation methods face a fundamental trade-off: more aggressive compression makes latent distributions easier for generative models to learn but worsens reconstruction. The paper aims to understand this trade-off and improve efficiency.

Method: 1) Analyzes the trade-off through scaling laws, 2) Introduces Causally Regularized Tokenization (CRT) which uses stage 2 generation modeling knowledge to embed inductive biases in stage 1 latents, 3) Optimizes visual tokenizer setup with CRT.

Result: CRT improves stage 2 generation performance without affecting compression rate, achieving 2-3× compute efficiency improvement. The final pipeline matches LlamaGen-3B performance (2.18 FID) with half the tokens (256 vs. 576) and quarter the parameters (775M vs. 3.1B).

Conclusion: The paper demonstrates that generation modeling capacity affects the compression-reconstruction trade-off, and that CRT can significantly improve efficiency in two-stage image generation by making tokens easier to model while maintaining compression performance.

Abstract: Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization improves stage 2 generation performance better by making the tokens easier to model without affecting the stage 1 compression rate and marginally affecting distortion: we are able to improve compute efficiency 2-3$\times$ over baseline. Finally, we use CRT with further optimizations to the visual tokenizer setup to result in a generative pipeline that matches LlamaGen-3B generation performance (2.18 FID) with half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) while using the same architecture and inference procedure.

[172] Unifying Multiple Foundation Models for Advanced Computational Pathology

Wenhui Lei, Yusheng Tan, Anqi Li, Hanyu Chen, Hengrui Tian, Ruiying Li, Zhengqun Jiang, Fang Yan, Xiaofan Zhang, Shaoting Zhang

Main category: cs.CV

TL;DR: Shazam is an online integration framework that unifies multiple pretrained pathology foundation models through adaptive expert weighting and online distillation, outperforming individual models across various pathology tasks.

Details

Motivation: Current pathology foundation models show inconsistent performance across tasks due to training data differences, and many high-performing models rely on proprietary datasets that can't be shared. Offline distillation has limitations including dependency on distillation corpus size and requiring full retraining for new models.

Method: Shazam is a task-specific online integration framework that unifies multiple pretrained pathology foundation models within a single flexible inference system. It fuses multi-level representations through adaptive expert weighting and learns task-aligned features via online distillation.

Result: Shazam consistently outperforms strong individual models across multiple pathology tasks including spatial transcriptomics prediction, survival prognosis, tile classification, and visual question answering.

Conclusion: Shazam shows promise as a scalable approach for harnessing the rapid evolution of pathology foundation models in a unified and adaptable manner, addressing limitations of current methods.

Abstract: Foundation models have advanced computational pathology by learning transferable visual representations from large histological datasets, yet recent evaluations reveal substantial variability in their performance across tasks. This inconsistency arises from differences in training data diversity and is further constrained by the reliance of many high-performing models on proprietary datasets that cannot be shared or expanded. Offline distillation offers a partial remedy but depends heavily on the size and heterogeneity of the distillation corpus and requires full retraining to incorporate new models. To address these limitations, we propose Shazam, a task-specific online integration framework that unifies multiple pretrained pathology foundation models within a single flexible inference system. Shazam fuses multi-level representations through adaptive expert weighting and learns task-aligned features via online distillation. Across spatial transcriptomics prediction, survival prognosis, tile classification, and visual question answering, Shazam consistently outperforms strong individual models, highlighting its promise as a scalable approach for harnessing the rapid evolution of pathology foundation models in a unified and adaptable manner.

[173] Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling

Tasmiah Haque, Md. Asif Bin Syed, Byungheon Jeong, Xue Bai, Sumit Mohan, Somdyuti Paul, Imtiaz Ahmed, Srinjoy Das

Main category: cs.CV

TL;DR: A real-time video motion transfer framework using keypoint forecasting with VRNN and GRU-NF models for bandwidth-efficient video transmission applications.

Details

Motivation: Enable real-time video motion transfer for bandwidth-efficient applications like video conferencing, remote health monitoring, VR interaction, and anomaly detection by using compact keypoint representations.

Method: Uses semantically meaningful keypoints as compact motion representations. Forecasts keypoints using VRNN (with recurrently conditioned stochastic latent variables) and GRU-NF (invertible exact-likelihood mapping). Predicted keypoints are transformed into video frames using optical flow-based module with generator network.

Result: VRNN achieves best point-forecast fidelity (lowest MAE) for stable multi-step forecasting, especially in high-uncertainty, multi-modal settings. GRU-NF enables richer diversity of generated videos while maintaining high visual quality.

Conclusion: The framework enables efficient low-frame-rate video transmission with either deterministic future sequences or diverse plausible futures. Lays foundation for next-generation AI systems requiring real-time, bandwidth-efficient, semantically controllable video generation.

Abstract: Motion Transfer is a technique that synthesizes videos by transferring motion dynamics from a driving video to a source image. In this work we propose a deep learning-based framework to enable real-time video motion transfer which is critical for enabling bandwidth-efficient applications such as video conferencing, remote health monitoring, virtual reality interaction, and vision-based anomaly detection. This is done using keypoints which serve as semantically meaningful, compact representations of motion across time. To enable bandwidth savings during video transmission we perform forecasting of keypoints using two generative time series models VRNN and GRU-NF. The predicted keypoints are transformed into realistic video frames using an optical flow-based module paired with a generator network, thereby enabling efficient, low-frame-rate video transmission. Based on the application this allows the framework to either generate a deterministic future sequence or sample a diverse set of plausible futures. Experimental results demonstrate that VRNN achieves the best point-forecast fidelity (lowest MAE) in applications requiring stable and accurate multi-step forecasting and is particularly competitive in higher-uncertainty, multi-modal settings. This is achieved by introducing recurrently conditioned stochastic latent variables that carry past contexts to capture uncertainty and temporal variation. On the other hand the GRU-NF model enables richer diversity of generated videos while maintaining high visual quality. This is realized by learning an invertible, exact-likelihood mapping between the keypoints and their latent representations which supports rich and controllable sampling of diverse yet coherent keypoint sequences. Our work lays the foundation for next-generation AI systems that require real-time, bandwidth-efficient, and semantically controllable video generation.

[174] UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

Linshan Wu, Yuxiang Nie, Sunan He, Jiaxin Zhuang, Luyang Luo, Tao Li, Zhuoyao Xie, Dexuan Chen, Yinghua Zhao, Neeraj Mahboobani, Varut Vardhanabhuti, Ronald Cheong Kin Chan, Yifan Peng, Pranav Rajpurkar, Hao Chen

Main category: cs.CV

TL;DR: UniBiomed is the first universal foundation model for grounded biomedical image interpretation that simultaneously generates diagnostic findings and segments corresponding biomedical objects, addressing the interpretability gap in clinical AI applications.

Details

Motivation: Current biomedical AI models lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects, making it challenging for clinicians to correlate AI-generated findings with visual evidence and interpret results.

Method: UniBiomed integrates Multi-modal Large Language Model and Segment Anything Model to unify diverse biomedical tasks. It was trained on a large-scale dataset of over 27 million triplets (images, region annotations, text descriptions) across ten biomedical imaging modalities.

Result: Extensive validation on 70 internal and 14 external datasets demonstrated state-of-the-art performance in diverse biomedical tasks including image segmentation, disease recognition, region-aware diagnosis, vision question answering, and report generation.

Conclusion: UniBiomed is a powerful and versatile biomedical foundation model that unlocks grounded interpretation capability for optimizing AI-assisted biomedical image analysis in clinical practice.

Abstract: The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate AI-generated findings with visual evidence (e.g., tiny lesions) in images and interpret the results of AI models. To address this challenge, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation, which is capable of generating accurate diagnostic findings and simultaneously segmenting the corresponding biomedical targets. UniBiomed is based on a novel integration of Multi-modal Large Language Model and Segment Anything Model, which can effectively unify diverse biomedical tasks in universal training for advancing grounded interpretation. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, region annotations, and text descriptions across ten biomedical imaging modalities. Extensive validation on 70 internal and 14 external datasets demonstrated the state-of-the-art performance of UniBiomed in diverse biomedical tasks, including image segmentation, disease recognition, region-aware diagnosis, vision question answering, and report generation. In summary, UniBiomed is a powerful and versatile biomedical foundation model, unlocking the untapped grounded interpretation capability for optimizing AI-assisted biomedical image analysis.

[175] diffDemorph: Extending Reference-Free Demorphing to Unseen Faces

Nitish Shukla, Arun Ross

Main category: cs.CV

TL;DR: diffDeMorph: A diffusion-based method for reference-free face demorphing that generalizes across different morphing techniques and face styles, outperforming previous methods by ≥59.46%.

Details

Motivation: Previous RF demorphing methods are too constrained, relying on assumptions about morphing techniques (e.g., landmark-based) and face image styles (e.g., passport photos), limiting their practical applicability.

Method: A novel diffusion-based approach called diffDeMorph that effectively disentangles component images from composite morph images with high visual fidelity. The method is trained on morphs created using synthetically generated face images and tested on real morphs.

Result: The method achieves ≥59.46% improvement over current state-of-the-art under common training protocol across all datasets tested. Experiments on six datasets and two face matchers establish the utility and efficacy of the approach.

Conclusion: diffDeMorph is the first method to generalize across morph techniques and face styles, enhancing the practicality of reference-free demorphing technology.

Abstract: A face morph is created by combining two face images corresponding to two identities to produce a composite that successfully matches both the constituent identities. Reference-free (RF) demorphing reverses this process using only the morph image, without the need for additional reference images. Previous RF demorphing methods are overly constrained, as they rely on assumptions about the distributions of training and testing morphs such as the morphing technique used (e.g., landmark-based) and face image style (e.g., passport photos). In this paper, we introduce a novel diffusion-based approach, referred to as diffDeMorph, that effectively disentangles component images from a composite morph image with high visual fidelity. Our method is the first to generalize across morph techniques and face styles, beating the current state of the art by $\geq 59.46%$ under a common training protocol across all datasets tested. We train our method on morphs created using synthetically generated face images and test on real morphs, thereby enhancing the practicality of the technique. Experiments on six datasets and two face matchers establish the utility and efficacy of our method.

[176] SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: cs.CV

TL;DR: SpatialScore benchmark reveals MLLMs’ spatial intelligence gap; SpatialCorpus training data and SpatialAgent multi-agent system improve performance.

Details

Motivation: Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are fragmented and limited in scope, lacking comprehensive assessment of spatial understanding capabilities.

Method: Proposed complementary data-driven and agent-based solutions: (1) SpatialScore benchmark with 5K samples across 30 tasks, (2) SpatialCorpus training resource with 331K QA samples for fine-tuning, (3) SpatialAgent multi-agent system with 12 spatial perception tools using Plan-Execute and ReAct reasoning.

Result: Evaluation of 40 MLLMs reveals persistent challenges and substantial gap from human-level spatial intelligence. SpatialCorpus significantly improves model performance (e.g., Qwen3-VL), and SpatialAgent enables substantial gains without additional training.

Conclusion: The benchmark, corpus, and agent framework provide solid foundation for advancing MLLMs toward human-level spatial intelligence. All resources will be released to research community.

Abstract: Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.

[177] The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts

Yuchen Zhang, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng

Main category: cs.CV

TL;DR: Proposes AMD framework to detect MLLM-generated multimodal disinformation using artifact-aware encoding and manipulation-oriented reasoning, achieving state-of-the-art performance on new MDSM dataset.

Details

Motivation: Current methods underestimate MLLM-driven deception risk and rely on unrealistic misalignment artifacts. Need to address sophisticated misinformation synthesized by multimodal LLMs that generate semantically coherent deceptive narratives.

Method: 1) Construct MDSM dataset with edited images paired with MLLM-generated deceptive texts. 2) Develop AMD framework with Artifact Pre-perception Encoding and Manipulation-Oriented Reasoning to detect MLLM-powered multimodal deceptions.

Result: AMD achieves best average performance on MDSM dataset: 88.18 ACC, 60.25 mAP, and 61.02 mIoU scores, demonstrating superior generalization for detecting MLLM-generated disinformation.

Conclusion: The proposed AMD framework effectively addresses limitations of existing methods by handling sophisticated MLLM-generated multimodal disinformation through artifact-aware encoding and manipulation reasoning, offering a unified solution for high-risk deception detection.

Abstract: The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework’s superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions. In cross-domain testing on the MDSM dataset, AMD achieves the best average performance, with 88.18 ACC, 60.25 mAP, and 61.02 mIoU scores.

[178] SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

Haihong Xiao, Jianan Zou, Yuxin Zhou, Ying He, Wenxiong Kang

Main category: cs.CV

TL;DR: SplatCo is a collaborative Gaussian splatting framework that combines structure and view information for high-fidelity rendering of complex outdoor scenes through three novel components: cross-structure collaboration, cross-view pruning, and structure-view co-learning.

Details

Motivation: The paper aims to address the challenge of achieving high-fidelity rendering for complex outdoor scenes, which requires balancing global scene layout understanding with local detail preservation while maintaining storage efficiency and preventing rendering artifacts.

Method: SplatCo introduces three key components: 1) Cross-structure collaboration module combining global tri-plane representations with local context grid features via hierarchical compensation; 2) Cross-view pruning mechanism removing inconsistent Gaussians based on structural consistency; 3) Structure-view co-learning module aggregating structural and view gradients for robust optimization.

Result: The framework effectively achieves high-fidelity rendering for large-scale outdoor scenes by combining global spatial awareness with local detail preservation, improving storage efficiency, and preventing rendering artifacts through structural consistency.

Conclusion: SplatCo demonstrates that collaborative integration of structure and view information through the proposed three components enables high-quality rendering of complex outdoor scenes, with code and project page made publicly available.

Abstract: We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor scenes. SplatCo builds upon three novel components: 1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features representing fine details. This fusion is achieved through a hierarchical compensation mechanism, ensuring both global spatial awareness and local detail preservation; 2) a cross-view pruning mechanism that removes overfitted or inaccurate Gaussians based on structural consistency, thereby improving storage efficiency and preventing rendering artifacts; 3) a structure view co-learning module that aggregates structural gradients with view gradients,thereby steering the optimization of Gaussian geometric and appearance attributes more robustly. By combining these key components, SplatCo effectively achieves high-fidelity rendering for large-scale scenes. Code and project page are available at https://splatco-tech.github.io.

[179] Test-Time Distillation for Continual Model Adaptation

Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang

Main category: cs.CV

TL;DR: CoDiRe: A Continual Distillation and Rectification framework that uses Vision-Language Models as external guidance to prevent model drift in Test-Time Adaptation, outperforming state-of-the-art methods with better efficiency.

Details

Motivation: Existing Continual Test-Time Adaptation methods suffer from self-referential feedback loops that amplify initial errors and cause model drift. Current approaches relying on self-supervision are insufficient for stable adaptation.

Method: Proposes Test-Time Distillation (TTD) guided by frozen Vision-Language Models. CoDiRe framework constructs a robust blended teacher by dynamically fusing VLM and target model predictions using Maximum Softmax Probability for weighting, then applies Optimal Transport-based rectification for stable adaptation.

Result: CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% while using only 48% of its time cost on ImageNet-C. Demonstrates superior performance and efficiency in continual test-time adaptation.

Conclusion: Using external VLM guidance through distillation and rectification effectively addresses model drift in CTTA. The proposed framework overcomes pitfalls of direct distillation and provides stable, efficient adaptation to distribution shifts.

Abstract: Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner, yet existing methods, which rely on self-supervision, are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: the Generalist Trap, where the VLM’s broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts, and the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls motivate our insight: the key is to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model’s expertise. Then applies an Optimal Transport based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% while using only 48% of its time cost on ImageNet-C.

[180] MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei, Yu Miao, Dongzhan Zhou, Di Hu

Main category: cs.CV

TL;DR: MokA is a multimodal-aware efficient fine-tuning method that addresses limitations of existing LLM-based approaches by explicitly handling both unimodal adaptation and cross-modal interaction through modality-specific parameters.

Details

Motivation: Current efficient multimodal fine-tuning methods are borrowed from LLMs and neglect intrinsic multimodal differences, failing to fully utilize all modalities. The authors argue that both unimodal adaptation and cross-modal adaptation are essential for effective MLLM fine-tuning.

Method: Proposes Multimodal low-rank Adaptation (MokA) - a multimodal-aware fine-tuning strategy that compresses unimodal information using modality-specific parameters while explicitly enhancing cross-modal interaction to ensure both unimodal and cross-modal adaptation.

Result: Extensive experiments across three multimodal scenarios (audio-visual-text, visual-text, speech-text) and multiple LLM backbones show consistent improvements, demonstrating efficacy and versatility. Ablation studies and efficiency evaluations further validate the method.

Conclusion: MokA provides a more targeted solution for efficient adaptation of MLLMs, addressing multimodal-specific challenges and paving the way for further exploration in multimodal fine-tuning.

Abstract: In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

[181] ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi, Yulu Pan, Feihong He, Xinyu Liu, Benjamin Zhang, Oluwatumininu Oguntola, Gedas Bertasius

Main category: cs.CV

TL;DR: ExAct is a new video-language benchmark for expert-level understanding of skilled physical human activities, featuring 3521 expert-curated video QA pairs across 11 activities in 6 domains, revealing a large performance gap between current VLMs and human experts.

Details

Motivation: There's a need for benchmarks that test expert-level understanding of skilled physical human activities, as current video-language models lack the nuanced understanding required for fine-grained analysis of physical skills and procedural domains.

Method: Created ExAct benchmark with 3521 expert-curated video question-answer pairs spanning 11 physical activities across 6 domains (Sports, Bike Repair, Cooking, Health, Music, Dance). Uses multiple-choice format with 5 carefully designed candidate options to test fine-grained understanding.

Result: Evaluation shows substantial performance gap: best-performing GPT-4o achieves only 44.70% accuracy, well below human expert performance of 82.02%. This reveals current VLMs’ limitations in expert-level understanding of physical skills.

Conclusion: ExAct provides a valuable benchmark for developing and evaluating VLMs capable of precise understanding of human skills in physical and procedural domains, highlighting the need for models to achieve expert-level comprehension of skilled activities.

Abstract: We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/

[182] Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Ziran Qin, Shijie Li, Liquan Shen, Junhui Hou, Zheng Wang, Weiyao Lin

Main category: cs.CV

TL;DR: Introduces I-VQA (Implicit Video Question Answering) task for answering questions without explicit visual evidence, proposes IRM framework with dual-stream modeling, and shows SOTA performance.

Details

Motivation: Current VideoQA focuses on explicit visual evidence, but fails when questions target symbolic meanings or deeper intentions where explicit evidence is unavailable.

Method: Proposes IRM (Implicit Reasoning Model) with Action-Intent Module (AIM) for dual-stream modeling of contextual actions and intent clues, and Visual Enhancement Module (VEM) for contextual visual representation enhancement.

Result: IRM outperforms GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by 0.76%, 1.37%, and 4.87% respectively on I-VQA, and achieves SOTA on similar implicit advertisement understanding and traffic-VQA tasks.

Conclusion: I-VQA addresses the gap in implicit reasoning in video understanding, and IRM provides an effective framework for handling questions without explicit visual evidence through contextual reasoning chains.

Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76%$, $1.37%$, and $4.87%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.

[183] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox

Main category: cs.CV

TL;DR: A world model for autonomous driving that achieves state-of-the-art performance with simple design, no extra supervision, and only 469M parameters trained on 280h of video, excelling in challenging scenarios like turns and urban traffic.

Details

Motivation: Existing world models for autonomous driving have limitations in long-horizon generation and generalization to challenging scenarios, requiring complex designs, additional supervision, or multiple sensors.

Method: Developed a model using simple design choices without additional supervision or sensors (maps, depth, multiple cameras). Created a hybrid tokenizer compatible with both discrete token models and continuous flow matching approaches to enable direct comparison.

Result: Achieved state-of-the-art performance despite having only 469M parameters trained on 280h of video data, with particularly strong performance in difficult scenarios like turning maneuvers and urban traffic. The comparison study favored continuous autoregressive models over discrete token models as being less brittle and more powerful.

Conclusion: Continuous autoregressive models based on flow matching outperform discrete token models for autonomous driving world modeling, being more robust to design choices and more powerful, while achieving excellent performance with simple architecture and minimal supervision.

Abstract: Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.

[184] Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

Siyu Chen, Ting Han, Chengzheng Fu, Changshe Zhang, Chaolei Wang, Jinhe Su, Guorong Cai, Meiliu Wu

Main category: cs.CV

TL;DR: Vireo is a single-stage framework for Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS) that unifies open-vocabulary segmentation with domain generalization using frozen Visual Foundation Models and depth information.

Details

Motivation: Real-world scenarios like autonomous driving in adverse conditions require semantic segmentation that can handle both unseen categories (open-vocabulary) and unseen domains (domain generalization). Current approaches treat these as separate problems, but they have complementary strengths that should be unified.

Method: Vireo builds on frozen Visual Foundation Models and incorporates scene geometry via Depth VFMs. Key components: 1) GeoText Prompts align geometric features with language cues and refine VFM encoder representations; 2) Coarse Mask Prior Embedding enhances gradient flow for faster convergence; 3) Domain-Open-Vocabulary Vector Embedding Head fuses refined structural and semantic features for robust prediction.

Result: Vireo achieves state-of-the-art performance, surpassing existing methods by a large margin in both domain generalization and open-vocabulary recognition on comprehensive evaluations.

Conclusion: Vireo offers a unified and scalable solution for robust visual understanding in diverse and dynamic environments by effectively combining open-vocabulary semantic segmentation with domain generalization capabilities.

Abstract: Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.

[185] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Zhonghao Yan, Muxi Diao, Yuxuan Yang, Ruoyan Jing, Jiayuan Xu, Kaizhou Zhang, Lele Yang, Yanxi Liu, Kongming Liang, Zhanyu Ma

Main category: cs.CV

TL;DR: MedReasoner: A modular framework using reinforcement learning to separate clinical reasoning from segmentation for accurate medical image grounding with implicit queries.

Details

Motivation: Current medical grounding pipelines rely on supervised fine-tuning with explicit spatial hints, making them inadequate for handling implicit clinical queries common in real practice where doctors describe findings without precise coordinates.

Method: Introduces MedReasoner with two separate modules: 1) an MLLM reasoner optimized with reinforcement learning to handle clinical reasoning, and 2) a frozen segmentation expert that converts spatial prompts into pixel-level masks. Uses format and accuracy rewards for alignment.

Result: Achieves state-of-the-art performance on U-MRG-14K dataset (14K samples across 10 modalities, 15 super-categories, 108 specific categories) and demonstrates strong generalization to unseen clinical queries.

Conclusion: Reinforcement learning shows significant promise for interpretable medical grounding by separating reasoning from segmentation, enabling better handling of implicit clinical queries common in real-world medical practice.

Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

[186] Counting with Confidence: Accurate Pest Monitoring in Water Traps

Xumin Gao, Mark Stevens, Grzegorz Cielniak

Main category: cs.CV

TL;DR: Proposes a method to evaluate pest counting confidence by combining counting results with environmental factors like image quality, complexity, and pest distribution uniformity.

Details

Motivation: Existing pest counting models are evaluated on datasets with ground truth but deployed without assessing reliability in real-world scenarios lacking ground truth. Need for comprehensive confidence evaluation in counting tasks.

Method: 1) Pest detection network extracts counting result information. 2) Image quality assessment, complexity assessment, and pest distribution uniformity assessment. 3) Quantifies image clarity changes from stirring using average gradient magnitude. 4) Hypothesis-driven multi-factor sensitivity analysis selects optimal assessment methods. 5) Adaptive DBSCAN clustering for distribution uniformity. 6) Regression model combines counting results and environmental factors to predict final counting confidence.

Result: Method reduces MSE by 31.7% and improves R2 by 15.2% on pest counting confidence test set compared to baseline using only counting result information.

Conclusion: First study to comprehensively evaluate counting confidence in counting tasks, quantifying relationship between influencing factors and confidence through a model. Enables reliability assessment in real-world deployments without ground truth.

Abstract: Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.

Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng

Main category: cs.CV

TL;DR: MMGCN integrates low-frame-rate visual data with high-frame-rate motion data to improve temporal action segmentation by reducing over-segmentation errors through multi-modal fusion and smooth transition augmentation.

Details

Motivation: Accurate temporal segmentation of human actions is critical for collaborative robots, but noise in pose estimation and object detection leads to over-segmentation errors that disrupt action sequence coherence.

Method: Proposes Multi-Modal Graph Convolutional Network (MMGCN) with: 1) sinusoidal encoding of 3D skeleton coordinates, 2) temporal graph fusion module for multi-resolution alignment, and 3) SmoothLabelMix augmentation for gradual action transitions.

Result: Outperforms state-of-the-art methods on Bimanual Actions Dataset, achieving F1@10: 94.5% and F1@25: 92.8% in action segmentation accuracy.

Conclusion: The MMGCN framework effectively addresses over-segmentation in action recognition by integrating multi-modal data at different frame rates and incorporating smooth transition modeling, demonstrating superior performance for human-object interaction understanding.

Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

[188] Towards Open-World Human Action Segmentation Using Graph Convolutional Networks

Hao Xing, Kai Zhe Boey, Gordon Cheng

Main category: cs.CV

TL;DR: Proposes a framework for open-world human-object interaction segmentation that detects and segments unseen actions without manual annotation, achieving significant improvements over state-of-the-art methods.

Details

Motivation: Existing learning-based methods for human-object interaction segmentation work well in closed-world scenarios but struggle with open-world situations where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that can handle out-of-distribution actions without manual annotation.

Method: Proposes a structured framework with three key innovations: 1) Enhanced Pyramid Graph Convolutional Network (EPGCN) with novel decoder for robust spatiotemporal feature upsampling, 2) Mixup-based training to synthesize out-of-distribution data without manual annotations, and 3) Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples.

Result: Achieves significant improvements over state-of-the-art action segmentation models on Bimanual Actions and H2O datasets, with 16.9% relative gain in open-set segmentation (F1@50) and 34.6% relative gain in out-of-distribution detection (AUROC).

Conclusion: The proposed framework effectively addresses the open-world action segmentation problem by enabling detection and segmentation of unseen actions without manual annotation, with comprehensive ablation studies identifying optimal configurations for practical applications.

Abstract: Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

[189] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

Main category: cs.CV

TL;DR: Neural-MedBench is a compact, reasoning-intensive benchmark for evaluating multimodal clinical reasoning in neurology, revealing significant performance gaps in state-of-the-art VLMs compared to conventional medical datasets.

Details

Motivation: Existing medical benchmarks create an "evaluation illusion" by focusing on classification accuracy, making models appear proficient while they still fail at high-stakes diagnostic reasoning. There's a need for benchmarks that truly test clinical reasoning abilities.

Method: Created Neural-MedBench integrating multi-sequence MRI scans, structured EHRs, and clinical notes with three task families: differential diagnosis, lesion recognition, and rationale generation. Developed a hybrid scoring pipeline combining LLM-based graders, clinician validation, and semantic similarity metrics.

Result: Evaluation of state-of-the-art VLMs (GPT-4o, Claude-4, MedGemma) showed sharp performance drops compared to conventional datasets. Error analysis revealed reasoning failures dominate model shortcomings rather than perceptual errors.

Conclusion: Proposes a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization and depth-oriented compact benchmarks like Neural-MedBench for reasoning fidelity. Released as open diagnostic testbed to enable rigorous assessment of clinically trustworthy AI.

Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

[190] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: A dual-model framework for embodied reference understanding that combines head-to-fingertip and wrist-to-fingertip pointing directions with CLIP-guided ensemble fusion, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Existing methods for embodied reference understanding (predicting objects from pointing gestures and language) fail to fully exploit visual disambiguation signals and rely on overly limiting single-line assumptions about pointing direction alignment.

Method: Proposes a dual-model framework with one model learning from head-to-fingertip direction and another from wrist-to-fingertip direction. Uses Gaussian ray heatmap representations as input, CLIP-Aware Pointing Ensemble module for fusion, and an auxiliary object center prediction head for enhanced localization.

Result: Achieves 75.0 mAP at 0.25 IoU on YouRefIt benchmark with state-of-the-art CLIP and C_D scores. Demonstrates robust performance on unseen CAESAR and ISL Pointing benchmarks, showing strong generalization capability.

Conclusion: The dual-model approach with complementary pointing direction models and CLIP-guided fusion effectively addresses limitations of single-line assumptions and improves multimodal reasoning for embodied reference understanding across diverse benchmarks.

Abstract: We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their complementary strengths, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble guided by CLIP features. We further incorporate an auxiliary object center prediction head to enhance referent localization. We validate our approach on YouRefIt, achieving 75.0 mAP at 0.25 IoU, alongside state-of-the-art CLIP and C_D scores, and demonstrate its generality on unseen CAESAR and ISL Pointing, showing robust performance across benchmarks.

[191] SmokeSeer: 3D Gaussian Splatting for Smoke Removal and Scene Reconstruction

Neham Jain, Andrew Jong, Sebastian Scherer, Ioannis Gkioulekas

Main category: cs.CV

TL;DR: SmokeSeer: A method for simultaneous 3D scene reconstruction and smoke removal from multi-view video sequences using thermal and RGB images, built on 3D Gaussian splatting.

Details

Motivation: Real-world smoke severely degrades image quality and visibility. Existing methods either rely on data-driven priors prone to hallucinations or are limited to static low-density smoke.

Method: Uses thermal and RGB images (thermal reduces scattering), builds on 3D Gaussian splatting to fuse information from both modalities, and decomposes scene into smoke and non-smoke components.

Result: Validated on synthetic data and new real-world smoke dataset with RGB and thermal images. Handles broad range of smoke densities and adapts to temporally varying smoke.

Conclusion: SmokeSeer effectively performs simultaneous 3D reconstruction and smoke removal, outperforming prior work. Open-source implementation and data provided.

Abstract: Smoke in real-world scenes can severely degrade image quality and hamper visibility. Recent image restoration methods either rely on data-driven priors that are susceptible to hallucinations, or are limited to static low-density smoke. We introduce SmokeSeer, a method for simultaneous 3D scene reconstruction and smoke removal from multi-view video sequences. Our method uses thermal and RGB images, leveraging the reduced scattering in thermal images to see through smoke. We build upon 3D Gaussian splatting to fuse information from the two image modalities, and decompose the scene into smoke and non-smoke components. Unlike prior work, SmokeSeer handles a broad range of smoke densities and adapts to temporally varying smoke. We validate our method on synthetic data and a new real-world smoke dataset with RGB and thermal images. We provide an open-source implementation and data on the project website.

[192] Learning Generalizable Shape Completion with SIM(3) Equivariance

Yuqing Wang, Zhaiyu Chen, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: First SIM(3)-equivariant shape completion network that achieves robust generalization by being agnostic to pose and scale, outperforming existing methods under de-biased evaluation.

Details

Motivation: Current 3D shape completion methods rely on pre-aligned scans, which leak pose and scale cues that networks exploit to memorize rather than infer intrinsic geometry. This causes performance collapse when alignment is absent in real data.

Method: Introduces a SIM(3)-equivariant shape completion network with modular layers that successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame.

Result: Outperforms both equivariant and augmentation baselines on PCN benchmark under de-biased evaluation. Sets new cross-domain records: lowers minimal matching distance on KITTI by 17% and Chamfer distance on OmniObject3D by 14%. Surprisingly outperforms competitors even under their biased settings.

Conclusion: Full SIM(3) equivariance is an effective route to truly generalizable shape completion, establishing architectural equivariance to similarity group as key for robust generalization.

Abstract: 3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance $\ell1$ on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: https://sime-completion.github.io.

[193] Improved Segmentation of Polyps and Visual Explainability Analysis

Akwasi Asare, Thanh-Huy Nguyen, Ulas Bagci

Main category: cs.CV

TL;DR: PolypSeg-GradCAM: An explainable deep learning framework combining U-Net with ResNet-34 and Grad-CAM for transparent polyp segmentation in colonoscopy, achieving high accuracy (Dice: 0.8902) with interpretable visualizations.

Details

Motivation: Colorectal cancer is a major global health issue with GI polyps as critical precursors. Early and accurate polyp segmentation is essential for reducing CRC progression, but manual delineation is labor-intensive and prone to variability. While deep learning shows promise for automated analysis, limited interpretability hinders clinical adoption.

Method: PolypSeg-GradCAM integrates U-Net architecture with pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for explainable polyp segmentation. The model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset containing 1,000 annotated endoscopic images.

Result: Achieved mean Dice coefficient of 0.8902 ± 0.0125, mean IoU of 0.8023, and AUC-ROC of 0.9722. With optimal threshold: Sensitivity of 0.9058 and Precision of 0.9083. Grad-CAM visualizations confirmed predictions were guided by clinically relevant regions, providing insight into model decision-making.

Conclusion: Integrating segmentation accuracy with interpretability can support development of trustworthy AI-assisted colonoscopy tools, bridging the gap between technical performance and clinical adoption through transparent decision-making.

Abstract: Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates a U-Net architecture with a pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. To ensure rigorous benchmarking, the model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images. Experimental results show a mean Dice coefficient of 0.8902 +/- 0.0125, a mean Intersection-over-Union (IoU) of 0.8023, and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9722. Advanced quantitative analysis using an optimal threshold yielded a Sensitivity of 0.9058 and Precision of 0.9083. Additionally, Grad-CAM visualizations confirmed that predictions were guided by clinically relevant regions, offering insight into the model’s decision-making process. This study demonstrates that integrating segmentation accuracy with interpretability can support the development of trustworthy AI-assisted colonoscopy tools.

[194] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis

Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu

Main category: cs.CV

TL;DR: FAST is a foreground-aware diffusion framework for industrial anomaly segmentation that uses training-free accelerated sampling and foreground-aware reconstruction to efficiently generate high-quality, structure-specific anomalies.

Details

Motivation: Industrial anomaly segmentation faces challenges due to scarce, diverse, and costly pixel-level annotations. Existing segmentation-oriented industrial anomaly synthesis (SIAS) methods struggle with balancing sampling efficiency and generation quality, and treat all spatial regions uniformly, overlooking statistical differences between anomaly and background areas.

Method: Proposes FAST with two novel modules: 1) Anomaly-Informed Accelerated Sampling (AIAS) - a training-free sampling algorithm using coarse-to-fine aggregation to synthesize segmentation-oriented anomalies in as few as 10 steps, and 2) Foreground-Aware Reconstruction Module (FARM) - adaptively adjusts anomaly-aware noise within masked foreground regions at each sampling step to preserve localized anomaly signals.

Result: Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks.

Conclusion: FAST provides an effective solution for industrial anomaly segmentation by addressing the limitations of existing SIAS methods through efficient sampling and foreground-aware reconstruction, enabling controllable, structure-specific anomaly synthesis for segmentation tasks.

Abstract: Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.

[195] DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream

Junhao He, Jiaxu Wang, Jia Li, Mingyuan Sun, Qiang Zhang, Jiahang Cao, Ziyi Zhang, Yi Gu, Jingkai Sun, Renjing Xu

Main category: cs.CV

TL;DR: A novel framework that combines low-framerate RGB videos with high-framerate event streams to reconstruct Dynamic 3D Gaussian Splatting, using event motion priors to guide deformation field optimization and overcome large inter-frame motion challenges.

Details

Motivation: Reconstructing Dynamic 3DGS from low-framerate RGB videos is challenging due to large inter-frame motions increasing solution space uncertainty. Event cameras capture rapid visual changes robustly but lack color information. Combining both modalities could address the motion challenge, but joint optimization is difficult due to modality discrepancies.

Method: 1) Extract motion priors from event streams using LoCM unsupervised fine-tuning to adapt event flow estimators to specific scenes. 2) Use geometry-aware data association to build event-Gaussian motion correspondence. 3) Employ motion decomposition and inter-frame pseudo-label strategies to guide deformation field optimization.

Result: Extensive experiments show the method outperforms existing image and event-based approaches across synthetic and real scenes, proving effective optimization of dynamic 3DGS with event data assistance.

Conclusion: The proposed framework successfully combines low-temporal-resolution RGB with high-framerate event streams to reconstruct Dynamic 3DGS, overcoming large inter-frame motion challenges through event motion priors and novel data association techniques.

Abstract: Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.

[196] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang

Main category: cs.CV

TL;DR: Dream4Drive is a synthetic data generation framework that creates multi-view photorealistic driving videos to enhance perception models, particularly for corner cases, using 3D-aware guidance maps and assets.

Details

Motivation: Existing driving world models focus on generation quality but overlook downstream perception task evaluation. Current synthetic data methods require pretraining+finetuning (double epochs), making benefits negligible when baseline uses same training time. Need better synthetic data to truly boost perception performance.

Method: 1) Decompose input video into 3D-aware guidance maps. 2) Render 3D assets onto these maps. 3) Fine-tune driving world model to produce edited, multi-view photorealistic videos for training perception models. Also contributes DriveObj3D dataset with typical driving scenario categories.

Result: Enables unprecedented flexibility in generating multi-view corner cases at scale. Significantly boosts corner case perception in autonomous driving. Effectively improves downstream perception models under various training epochs (unlike existing methods).

Conclusion: Dream4Drive provides a novel synthetic data generation framework that genuinely enhances perception tasks by creating realistic multi-view driving videos with 3D-aware editing, addressing limitations of existing methods and advancing autonomous driving perception.

Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive

[197] LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao, Ping Chen, Fuyuan Shi, Chao Tan, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: LeMiCa is a training-free acceleration framework for diffusion-based video generation that uses lexicographic minimax path optimization to bound global errors, achieving 2.9x speedup on Latte model with minimal quality degradation.

Details

Motivation: Existing caching strategies for video generation acceleration focus on reducing local heuristic errors but overlook global error accumulation, leading to noticeable content degradation between accelerated and original videos.

Method: Formulates cache scheduling as a directed graph with error-weighted edges and introduces Lexicographic Minimax Path Optimization strategy that explicitly bounds worst-case path error to improve global content and style consistency.

Result: Achieves 2.9x speedup on Latte model and reaches LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques with minimal perceptual quality degradation.

Conclusion: LeMiCa provides a robust and generalizable paradigm for accelerating diffusion-based video generation that improves both inference speed and generation quality, serving as a strong foundation for future research on efficient video synthesis.

Abstract: We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

[198] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf

Main category: cs.CV

TL;DR: The paper investigates small Visual Language Models for scene interpretation and action recognition on edge devices in mobile robotics, evaluating their capabilities, challenges, and biases on real-world datasets.

Details

Motivation: While LLMs and VLMs have advanced video understanding and scene interpretation, their computational complexity makes deployment on edge devices and mobile robotics challenging due to accuracy vs. inference time trade-offs.

Method: The paper proposes a pipeline using state-of-the-art small VLMs capable of edge deployment, evaluated on a diverse dataset of real-world cityscape, on-campus, and indoor scenarios.

Result: Experimental evaluation discusses the potential of small models on edge devices, with emphasis on challenges, weaknesses, inherent model biases, and application of gained information.

Conclusion: The research provides insights into deploying small VLMs for scene interpretation on edge devices in mobile robotics, highlighting both capabilities and limitations through comprehensive evaluation.

Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/

[199] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data

Jessica Plassmann, Nicolas Schuler, Georg von Freymann, Michael Schuth

Main category: cs.CV

TL;DR: This paper explores unsupervised deep learning methods for automated anomaly detection in shearographic images, comparing three architectures trained on defect-free data to reduce reliance on expert interpretation and labeled data.

Details

Motivation: Shearography is a valuable non-destructive testing method but has limited industrial adoption due to the need for expert interpretation. The study aims to reduce reliance on labeled data and manual evaluation by exploring unsupervised learning for automated anomaly detection.

Method: Three unsupervised architectures were evaluated: fully connected autoencoder, convolutional autoencoder, and student-teacher feature matching model. All models were trained solely on defect-free data. A controlled dataset was developed using custom specimens with reproducible defect patterns, with two training subsets: one with only undistorted defect-free samples, and another including globally deformed but defect-free data to simulate practical conditions.

Result: The student-teacher approach achieved superior classification robustness and enabled precise spatial defect localization. It demonstrated improved separability of feature representations compared to autoencoder-based models, as visualized through t-SNE embeddings. A YOLOv8 model trained on labeled data served as a reference benchmark for localization quality.

Conclusion: The study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments, with the student-teacher model showing particular promise for robust anomaly detection and localization.

Abstract: Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.

[200] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang, Lingjuan Lyu, Handing Wang, Yaochu Jin

Main category: cs.CV

TL;DR: VLA-Fool is a comprehensive study exposing the vulnerability of Vision-Language-Action models to multimodal adversarial attacks, showing that even minor perturbations can cause significant behavioral deviations in embodied environments.

Details

Motivation: The adversarial robustness of Vision-Language-Action (VLA) models in embodied environments remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies focus on single-modality perturbations and overlook cross-modal misalignment that fundamentally affects embodied reasoning and decision-making.

Method: VLA-Fool introduces three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that disrupt semantic correspondence between perception and instruction. It also incorporates a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework.

Result: Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

Conclusion: The study exposes critical vulnerabilities in VLA models and highlights the need for more robust multimodal alignment in embodied AI systems, as current models are fragile to carefully crafted multimodal adversarial attacks.

Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

[201] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Sünderhauf, Dimity Miller

Main category: cs.CV

TL;DR: First online scene change detection method that is pose-agnostic, label-free, maintains multi-view consistency, runs at 10+ FPS, and outperforms even offline approaches.

Details

Motivation: Online Scene Change Detection (SCD) is challenging due to unconstrained viewpoints and real-time requirements. Existing online methods are significantly less accurate than offline approaches, creating a need for a high-performance online solution.

Method: Three key components: 1) Self-supervised fusion loss to infer scene changes from multiple cues and observations, 2) PnP-based fast pose estimation against reference scene, 3) Fast change-guided update strategy for 3D Gaussian Splatting scene representation.

Result: Achieves new state-of-the-art performance, surpassing even the best offline approaches. Operates at over 10 FPS while maintaining pose-agnostic, label-free operation with multi-view consistency.

Conclusion: The proposed approach represents the first online SCD method that combines real-time performance with superior accuracy, demonstrating effectiveness through extensive experiments on complex real-world datasets.

Abstract: Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

[202] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

Main category: cs.CV

TL;DR: Fine-tuning text-to-video models with sparse synthetic data for camera controls outperforms using photorealistic real data.

Details

Motivation: Fine-tuning large text-to-video diffusion models for new generative controls (like camera parameters) typically requires vast, high-quality datasets that are difficult and expensive to acquire.

Method: Proposes a data-efficient fine-tuning strategy that learns camera controls from sparse, low-quality synthetic data rather than photorealistic real data.

Result: Fine-tuning on simple synthetic data not only enables desired camera controls but actually yields superior results compared to models fine-tuned on photorealistic real data.

Conclusion: Provides a framework that justifies why synthetic data outperforms real data for learning camera controls in text-to-video models, offering both intuitive and quantitative explanations.

Abstract: Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic “real” data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

[203] Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance

Wenjie Li, Jinglei Shi, Jin Han, Heng Guo, Zhanyu Ma

Main category: cs.CV

TL;DR: DHGM uses diffusion models with high-frequency guidance to jointly remove rain artifacts and enhance details for small object detection, avoiding conflicts between separate restoration and super-resolution steps.

Details

Motivation: Real-world images are often degraded by weather, but existing weather restoration methods sacrifice high-frequency details crucial for small object detection. Cascading restoration and super-resolution creates conflicts since removal eliminates high-frequency noise while SR hallucinates high-frequency textures from existing details.

Method: DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details, using a Diffusion-based High-frequency Guided Model for joint weather removal and super-resolution.

Result: Extensive experiments demonstrate that DHGM achieves superior performance over existing methods with lower computational costs.

Conclusion: DHGM effectively bridges the conflict between weather removal and super-resolution, generating clean and high-resolution images suitable for small object detection tasks.

Abstract: Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.

[204] Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, Jialin Gao

Main category: cs.CV

TL;DR: PicWorld is a benchmark for evaluating text-to-image models’ implicit world knowledge and physical causal reasoning, using 1,100 prompts and a multi-agent evaluator to assess physical realism and logical consistency.

Details

Motivation: Current T2I models fail on prompts requiring implicit world knowledge, and existing evaluation methods focus too narrowly on compositional alignment or single-round VQA scoring, leaving knowledge grounding, multi-physics interactions, and evidence-based assessment undertested.

Method: Introduced PicWorld benchmark with 1,100 prompts across three core categories, and PW-Agent - an evidence-grounded multi-agent evaluator that hierarchically assesses images by decomposing prompts into verifiable visual evidence.

Result: Evaluation of 17 mainstream T2I models shows they universally exhibit fundamental limitations in implicit world knowledge and physical causal reasoning to varying degrees.

Conclusion: The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems to overcome current limitations in world knowledge and physical reasoning.

Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

[205] ConsistCompose: Unified Multimodal Layout Control for Image Composition

Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Dahua Lin, Quan Wang

Main category: cs.CV

TL;DR: ConsistCompose is a unified multimodal framework that enables layout-controlled multi-instance image generation by embedding layout coordinates directly into language prompts, using a single generative interface for both visual grounding and generation tasks.

Details

Motivation: Current unified multimodal models focus primarily on visual grounding (aligning language with image regions) while neglecting the generative counterpart - linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, which limits precise compositional control in image generation.

Method: The framework embeds layout coordinates directly into language prompts, uses instance-coordinate binding prompts and coordinate-aware classifier-free guidance to translate linguistic layout cues into spatial control, and operates within a single generative interface for interleaved image-text tasks. They also created ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations.

Result: Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and maintaining competitive general multimodal understanding capabilities.

Conclusion: ConsistCompose establishes a unified paradigm for layout-controllable multimodal image generation, bridging the gap between visual grounding and generative capabilities in multimodal systems.

Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.

[206] Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Zhantao Gong, Liaoyuan Fan, Qing Guo, Xun Xu, Xulei Yang, Shijie Li

Main category: cs.CV

TL;DR: FSU-QA is a new VQA dataset designed to evaluate Foresight Intelligence - the ability to anticipate future events. It reveals current VLMs struggle with foresight reasoning, but fine-tuning on FSU-QA enables even small models to outperform larger ones.

Details

Motivation: Foresight Intelligence (anticipating future events) is essential for applications like autonomous driving but has been largely overlooked in existing research. There's a need for datasets and benchmarks specifically designed to evaluate this capability.

Method: Introduce FSU-QA, a new Visual Question-Answering dataset specifically designed for Foresight Intelligence evaluation. Use it to benchmark state-of-the-art VLMs, assess world models through semantic coherence of predictions, and fine-tune VLMs on the dataset.

Result: Current VLMs struggle with foresight reasoning tasks. FSU-QA enables effective world model assessment. Even small VLMs fine-tuned on FSU-QA substantially outperform larger, advanced models on foresight reasoning tasks.

Conclusion: FSU-QA provides a principled foundation for developing next-generation models capable of anticipating and understanding future events, bridging a critical gap in AI research for foresight intelligence.

Abstract: In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

[207] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, Yandan Luo

Main category: cs.CV

TL;DR: GuideFlow is a novel autonomous driving planning framework using Constrained Flow Matching to address mode collapse in imitative planners and constraint handling in generative planners, achieving SOTA performance on driving benchmarks.

Details

Motivation: Current E2E autonomous driving planners have two main issues: Imitative Planners suffer from multimodal trajectory mode collapse (lack of diversity), while Generative Planners struggle to incorporate safety/physical constraints directly into generation, requiring additional optimization stages.

Method: GuideFlow leverages Constrained Flow Matching with two key innovations: 1) Directly enforces explicit constraints within flow matching generation (not implicit encoding), 2) Unifies flow matching training with Energy-Based Model (EBM) to enhance autonomous optimization for physical constraints. Also parameterizes driving aggressiveness as a control signal for trajectory style manipulation.

Result: Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim, ADV-NuScenes) validate effectiveness. Achieved SOTA on NavSim test hard split (Navhard) with EPDMS score of 43.0.

Conclusion: GuideFlow successfully addresses limitations of existing E2E planners by combining the benefits of flow matching (diversity) with explicit constraint enforcement and EBM-enhanced optimization, while enabling controllable trajectory generation through aggressiveness parameterization.

Abstract: Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose \textit{\textbf{GuideFlow}}, a novel planning framework that leverages Constrained Flow Matching. Concretely, \textit{\textbf{GuideFlow}} explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, \textit{\textbf{GuideFlow}} unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model’s autonomous optimization capability to robustly satisfy physical constraints. Secondly, \textit{\textbf{GuideFlow}} parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of \textit{\textbf{GuideFlow}}. Notably, on the NavSim test hard split (Navhard), \textit{\textbf{GuideFlow}} achieved SOTA with an EPDMS score of 43.0. The code will be in https://github.com/liulin815/GuideFlow.

[208] HunyuanOCR Technical Report

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang

Main category: cs.CV

TL;DR: HunyuanOCR is a 1B-parameter open-source VLM for OCR that outperforms commercial APIs and larger models, achieving SOTA results with unified capabilities, end-to-end architecture, and RL optimization.

Details

Motivation: Address limitations of narrow OCR expert models and inefficient general VLMs by creating a lightweight yet comprehensive OCR solution that unifies versatility and efficiency.

Method: Uses ViT + lightweight LLM with MLP adapter in pure end-to-end architecture. Employs data-driven approach and RL strategies for optimization, with vLLM-based deployment.

Result: Outperforms commercial APIs, traditional pipelines, and larger models (including Qwen3-VL-4B). Achieves 1st place in ICDAR 2025 DIMT Challenge (Small Model Track) and SOTA on OCRBench for <3B parameter VLMs.

Conclusion: HunyuanOCR provides a commercial-grade, open-source OCR solution that advances research and offers solid foundation for industrial applications through its unified capabilities, streamlined architecture, and RL-enhanced performance.

Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

[209] AttenDence: Maximizing Attention Confidence for Test Time Adaptation

Yash Mali

Main category: cs.CV

TL;DR: Attention entropy minimization for test-time adaptation improves robustness to distribution shifts by making CLS token attention more confident on single test images.

Details

Motivation: Transformers provide additional unsupervised learning signals through attention mechanisms beyond just output entropy minimization, which can be leveraged for better test-time adaptation to distribution shifts.

Method: Propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective, encouraging more confident attention to relevant regions even with single test images.

Result: Attention entropy minimization improves robustness across diverse corruption types while maintaining performance on clean data, effective even with single-image test streams.

Conclusion: Attention entropy minimization is an effective TTA objective that leverages transformer attention mechanisms to enhance adaptation to distribution shifts at test time.

Abstract: Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective. This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.

[210] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: A self-prompting framework adapts SAM to remote sensing imagery using only sparse point annotations through a Refine-Requery-Reinforce loop, achieving better performance than pretrained SAM and other point-supervised methods.

Details

Motivation: Segment Anything Model (SAM) performs poorly on remote sensing imagery due to domain shift and lack of dense annotations. There's a need to adapt foundation models like SAM to specialized domains like remote sensing without requiring expensive full-mask supervision.

Method: Proposes a self-prompting, point-supervised framework with a Refine-Requery-Reinforce loop: 1) Refine coarse pseudo-masks from initial points, 2) Requery with self-constructed box prompts, 3) Reinforce by aligning embeddings across iterations to reduce confirmation bias.

Result: Outperforms pretrained SAM and recent point-supervised segmentation methods on three RSI benchmark datasets (WHU, HRSID, NWPU VHR-10), demonstrating improved segmentation quality and domain robustness without full-mask supervision.

Conclusion: Self-prompting and semantic alignment provide an efficient path for scalable, point-level adaptation of foundation segmentation models to remote sensing applications, addressing domain shift with minimal annotation requirements.

Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM’s segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[211] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li

Main category: cs.CV

TL;DR: Proposes Attention Interaction Alignment (AIA) loss to mitigate task conflicts in unified multimodal models without model decoupling, improving both generation and understanding performance.

Details

Motivation: Unified multimodal models face inherent conflicts between understanding and generation tasks. Current solutions use model decoupling (e.g., double encoders, MOE/MOT, frozen MLLM), but excessive decoupling loses interleave generation ability. The paper aims to mitigate task conflicts without resorting to model decoupling.

Method: Analyzes why decoupling alleviates conflicts by studying cross-modal attention behavior. Observes that decoupling drives models toward task-specific multimodal interaction patterns. Proposes Attention Interaction Alignment (AIA) loss that explicitly learns task-specific multimodal interaction patterns during training. Applied to Emu3 and Janus-Pro during SFT and post-training stages respectively.

Result: AIA loss refines cross-modal attention patterns and boosts both generation and understanding performance without requiring model decoupling. Demonstrates generalizability across different models and training stages.

Conclusion: Attention Interaction Alignment provides an effective alternative to model decoupling for mitigating task conflicts in unified multimodal models, preserving interleave generation ability while improving performance on both understanding and generation tasks.

Abstract: Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

[212] Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data

Ivo Bueno, Ruikun Hou, Babette Bühler, Tim Fütterer, James Drimalla, Jonathan Kyle Foster, Peter Youngs, Peter Gerjets, Ulrich Trautwein, Enkelejda Kasneci

Main category: cs.CV

TL;DR: AI-driven multimodal analysis of classroom recordings using video and transcript data to automate instructional activity and discourse recognition for scalable teacher feedback.

Details

Motivation: Manual classroom observation is resource-intensive and hard to scale, creating a need for automated systems that can provide concrete feedback to teachers through AI analysis of classroom recordings.

Method: Parallel modality-specific pipelines: (1) For video: evaluated zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels; (2) For transcripts: fine-tuned transformer-based classifier with contextualized inputs and compared against prompting-based LLMs on 19 discourse labels. Used techniques like per-label thresholding, context windows, and imbalance-aware loss functions to handle class imbalance and multi-label complexity.

Result: Fine-tuned models consistently outperformed prompting-based approaches, achieving macro-F1 scores of 0.577 for video analysis and 0.460 for transcript analysis, demonstrating feasibility of automated classroom analysis.

Conclusion: The research establishes a foundation for scalable teacher feedback systems by showing that AI-driven multimodal analysis of classroom recordings is feasible and that fine-tuned models perform better than prompting-based approaches for both video and transcript analysis.

Abstract: Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.

[213] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Ning Yu, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

Main category: cs.CV

TL;DR: Video4Spatial framework shows video diffusion models can perform complex spatial tasks like scene navigation and object grounding using only visual data, without depth or pose information.

Details

Motivation: To investigate whether video generative models can exhibit visuospatial intelligence (a key human cognitive capability) using only visual data, without relying on auxiliary modalities like depth or poses.

Method: Video4Spatial framework uses video diffusion models conditioned solely on video-based scene context. Simple yet effective design choices in framework architecture and data curation enable spatial understanding from video-only inputs.

Result: The model successfully plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments.

Conclusion: Video generative models can achieve strong spatial understanding from video context alone, advancing them toward general visuospatial reasoning capabilities.

Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

[214] AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar

Main category: cs.CV

TL;DR: AFRAgent is a lightweight instruct-BLIP-based multimodal model for mobile UI automation that achieves state-of-the-art performance while being 4x smaller than competitors through adaptive feature renormalization.

Details

Motivation: Mobile UI automation has growing industrial demand, and while VLMs enable autonomous task execution, current models struggle with accurate widget identification and action determination due to limited spatial information in vision features, plus they are often large and slow.

Method: AFRAgent uses an instruct-BLIP-based multimodal architecture with adaptive feature renormalization - a token-level affine transformation technique that enriches low-resolution image embeddings and fuses high-resolution details to enhance image embeddings in the LLM pipeline.

Result: AFRAgent achieves superior performance on Meta-GUI and AITW benchmarks, establishing new state-of-the-art for smartphone automation while being less than one-fourth the size of its nearest competitor.

Conclusion: AFRAgent demonstrates that lightweight models can achieve top performance in GUI automation through effective feature enhancement techniques, offering a more efficient solution for real-world mobile automation applications.

Abstract: There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

[215] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

Main category: cs.CV

TL;DR: ThinkDeeper is a world model-based framework for visual grounding in autonomous driving that reasons about future spatial states to handle ambiguous commands, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Existing visual grounding methods for autonomous vehicles struggle with ambiguous, context-dependent instructions because they lack reasoning capabilities over 3D spatial relations and anticipated scene evolution.

Method: Proposes ThinkDeeper with a Spatial-Aware World Model (SA-WM) that distills current scenes into command-aware latent states and rolls out future latent states for forward-looking cues. Uses hypergraph-guided decoder to hierarchically fuse states with multimodal input, capturing higher-order spatial dependencies. Also introduces DrivePilot dataset created using RAG and CoT-prompted LLM pipeline.

Result: ThinkDeeper ranks #1 on Talk2Car leaderboard and surpasses SOTA on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Shows strong robustness in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance with only 50% training data.

Conclusion: Reasoning about future spatial states through world models significantly improves visual grounding performance for autonomous driving, enabling better handling of ambiguous instructions and complex scenarios.

Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

[216] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

Main category: cs.CV

TL;DR: dots.ocr is a unified Vision-Language Model that jointly learns document layout parsing tasks (layout detection, text recognition, relational understanding) in an end-to-end framework, achieving SOTA performance on benchmarks with strong multilingual capabilities.

Details

Motivation: Current document layout parsing methods use fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. There's a need for a unified approach to enable AI to better access and interpret structured knowledge from documents.

Method: Introduces dots.ocr, a single Vision-Language Model that jointly learns three core document parsing tasks within a unified end-to-end framework. Uses a highly scalable data engine to synthesize a vast multilingual corpus for training.

Result: Achieves state-of-the-art performance on comprehensive OmniDocBench benchmark. Introduces XDocParse benchmark spanning 126 languages, where dots.ocr outperforms next-best competitor by +7.4 points, demonstrating unparalleled multilingual capabilities.

Conclusion: The unified end-to-end paradigm for document layout parsing proves effective, with dots.ocr establishing a powerful new baseline for document intelligence research, particularly in multilingual settings.

Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

[217] Glance: Accelerating Diffusion Models with 1 Sample

Zhuobai Dong, Rui Zhao, Songjie Wu, Junchao Yi, Linjie Li, Zhengyuan Yang, Lijuan Wang, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: A new method uses two lightweight LoRA adapters (Slow-LoRA and Fast-LoRA) to accelerate diffusion models by applying different speedups to different denoising phases, achieving 5× acceleration with minimal training cost.

Details

Motivation: Diffusion models have high computational costs and require many inference steps. Previous distillation methods for fewer-step inference suffer from heavy retraining costs and degraded generalization.

Method: Instead of evenly accelerating all steps, the method applies smaller speedups to early semantic stages and larger ones to later redundant phases. It uses two lightweight LoRA adapters (Slow-LoRA for slow denoising phases, Fast-LoRA for fast phases) attached to the base model, requiring minimal training.

Result: Achieves up to 5× acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. The LoRA experts are trained with only 1 sample on a single V100 within one hour, yet generalize strongly on unseen prompts.

Conclusion: The phase-aware acceleration strategy with lightweight LoRA adapters provides an efficient and generalizable solution for accelerating diffusion models, overcoming limitations of previous distillation methods.

Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.

[218] Hierarchical Attention for Sparse Volumetric Anomaly Detection in Subclinical Keratoconus

Lynn Kandakji, William Woof, Nikolas Pontikos

Main category: cs.CV

TL;DR: Hierarchical attention architectures outperform CNNs and ViTs for detecting subtle, spatially distributed anomalies in 3D medical imaging, achieving 21-23% higher sensitivity/specificity for subclinical keratoconus detection.

Details

Motivation: Detecting weak, spatially distributed anomalies in volumetric medical imaging is challenging due to difficulty integrating subtle signals across non-adjacent regions. Current approaches struggle with balancing local detail and global context.

Method: Controlled comparison of 16 architectures (convolutional, hybrid, transformer families) for subclinical keratoconus detection from 3D AS-OCT. Mechanistic analyses include attention-distance measurements, representational similarity analysis, and auxiliary age/sex prediction tasks.

Result: Hierarchical architectures achieve 21-23% higher sensitivity and specificity in subclinical regime. They provide spatial scale alignment with intermediate extent of abnormalities, avoiding excessive locality (CNNs) or diffuse integration (ViTs). Hierarchical attention learns distinct feature space balancing local structure with long-range interactions.

Conclusion: Hierarchical attention offers principled approach for early pathological change analysis in medical imaging. Findings provide design guidance for volumetric anomaly detection, with hierarchical windowing producing effective receptive fields matched to subclinical abnormalities.

Abstract: The detection of weak, spatially distributed anomalies in volumetric medical imaging remains challenging due to the difficulty of integrating subtle signals across non-adjacent regions. This study presents a controlled comparison of sixteen architectures spanning convolutional, hybrid, and transformer families for subclinical keratoconus detection from three-dimensional anterior segment optical coherence tomography (AS-OCT). The results demonstrate that hierarchical architectures achieve 21-23% higher sensitivity and specificity, particularly in the difficult subclinical regime, outperforming both convolutional neural networks (CNNs) and global-attention Vision Transformer (ViT) baselines. Mechanistic analyses indicate that this advantage arises from spatial scale alignment: hierarchical windowing produces effective receptive fields matched to the intermediate extent of subclinical abnormalities, avoiding the excessive locality observed in convolutional models and the diffuse integration characteristic of pure global attention. Attention-distance measurements show that subclinical cases require longer spatial integration than healthy or overtly pathological volumes, with hierarchical models exhibiting lower variance and more anatomically coherent focus. Representational similarity further indicates that hierarchical attention learns a distinct feature space that balances local structure sensitivity with flexible long-range interactions. Auxiliary age and sex prediction tasks demonstrate moderately high cross-task consistency, supporting the generalizability of these inductive principles. The findings provide design guidance for volumetric anomaly detection and highlight hierarchical attention as a principled approach for early pathological change analysis in medical imaging.

[219] $\mathrm{D}^\mathrm{3}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

Main category: cs.CV

TL;DR: D³-Predictor: A noise-free deterministic framework that reformulates pretrained diffusion models for dense prediction tasks by aggregating timestep-dependent visual experts into a single clean geometric prior.

Details

Motivation: Diffusion models with strong visual priors overlook a core limitation: stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires deterministic mapping from image to geometry. This noise corrupts fine-grained spatial cues and pushes models toward timestep-specific noise objectives, destroying meaningful geometric structure mappings.

Method: Introduces D³-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs, it views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Task-specific supervision is used to adapt this noise-free prior to dense prediction tasks.

Result: Extensive experiments on various dense prediction tasks demonstrate that D³-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. It requires less than half the training data previously used and efficiently performs inference in a single step.

Conclusion: D³-Predictor successfully addresses the misalignment between stochastic diffusion sampling and deterministic dense prediction requirements by creating a noise-free framework that leverages diffusion priors without corruption from stochastic noise, enabling efficient and data-efficient dense prediction.

Abstract: Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^\mathrm{3}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^\mathrm{3}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^\mathrm{3}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.

[220] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang

Main category: cs.CV

TL;DR: Fairness-aware Low-Rank Adaptation (LoRA) for medical vision-language models reduces diagnostic accuracy disparities across demographic groups by 69% while maintaining overall accuracy, using only 0.24% trainable parameters.

Details

Motivation: Medical vision-language models show expert-level performance but exhibit significant diagnostic accuracy disparities across demographic groups, creating fairness concerns in healthcare AI applications.

Method: Introduces fairness-aware Low-Rank Adaptation with three approaches: FR-LoRA (MaxAccGap regularization), GR-LoRA (inverse frequency weighting), and Hybrid-LoRA (combination). Key innovation is differentiable MaxAccGap loss for end-to-end fairness optimization.

Result: GR-LoRA reduces diagnostic accuracy disparities by 69% on 10,000 glaucoma fundus images while maintaining 53.15% overall accuracy. Race-specific optimization achieves 60% disparity reduction. Strong regularization yields optimal fairness with minimal accuracy trade-off.

Conclusion: The approach enables practical deployment of fair medical AI in resource-constrained settings by requiring only 0.24% trainable parameters, effectively addressing demographic fairness concerns while maintaining diagnostic performance.

Abstract: Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

[221] Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon

Main category: cs.CV

TL;DR: NPC is an automated pipeline that improves text-to-image alignment by identifying and applying negative prompts to suppress unintended content in generated images.

Details

Motivation: Despite progress in text-to-image generation, achieving precise text-image alignment remains challenging, especially for prompts with rich compositional structure or imaginative elements.

Method: NPC analyzes cross-attention patterns to identify both targeted negatives (related to alignment errors) and untargeted negatives (unrelated tokens present in images). It uses a verifier-captioner-proposer framework to generate candidate negative prompts and ranks them with a salient text-space score without requiring additional image synthesis.

Result: On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench.

Conclusion: NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models by guiding what not to generate through negative prompting.

Abstract: Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt’s alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.

[222] EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Jiaqi Ma, Shengkai Hu, Xu Zhang, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan

Main category: cs.CV

TL;DR: EvoIR is an All-in-One Image Restoration framework that combines evolutionary frequency modulation with adaptive optimization to handle diverse degradations.

Details

Motivation: Existing AiOIR approaches lack explicit frequency modeling and rely on fixed optimization schedules, limiting generalization across heterogeneous degradation types.

Method: Proposes EvoIR with two key components: Frequency-Modulated Module (FMM) for explicit high/low frequency decomposition and adaptive modulation, and Evolutionary Optimization Strategy (EOS) for dynamic frequency-aware objective adjustment through population-based evolution.

Result: EvoIR outperforms state-of-the-art AiOIR methods on multiple benchmarks, with the combination of FMM and EOS showing greater improvements than either component alone.

Conclusion: EvoIR provides a robust and versatile solution for AiOIR tasks by synergizing explicit frequency modeling with evolutionary optimization, effectively handling diverse degradation types while balancing structural accuracy and perceptual fidelity.

Abstract: All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

[223] SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon

Main category: cs.CV

TL;DR: SAVE is a training-free framework that reduces object hallucination in MLLMs by steering models along sparse autoencoder features that capture visual understanding, achieving state-of-the-art performance on hallucination benchmarks.

Details

Motivation: Multimodal Large Language Models (MLLMs) suffer from object hallucination due to language priors and visual information loss, which undermines their reliability in grounded visual understanding tasks.

Method: SAVE uses a binary object-presence QA probe to identify visual understanding features in Sparse Autoencoder (SAE) latent space, then steers the model along these features to reinforce grounded visual understanding without additional training.

Result: SAVE outperforms state-of-the-art training-free methods with 10%p improvement on CHAIR_S and consistent gains on POPE and MMHal-Bench. Analysis shows it suppresses uncertain object tokens and increases attention to image tokens.

Conclusion: The SAVE framework effectively mitigates object hallucination in MLLMs through SAE feature steering, demonstrating robustness, generalizability, and superior performance without requiring model training.

Abstract: Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model’s visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10%p improvement in CHAIR_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.

[224] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen

Main category: cs.CV

TL;DR: LoC-Path is an efficient multimodal LLM for pathology that reduces computational costs by identifying and focusing on task-relevant tiles in whole slide images, replacing brute-force processing with redundancy-reducing modules.

Details

Motivation: Current slide-level MLLMs for pathology use heavy encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational costs, while human experts focus only on key diagnostic regions. Tile-level features exhibit strong redundancy, with only a small subset being truly task-relevant.

Method: Introduces LoC-Path framework with: 1) Sparse Token Merger (STM) and MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens; 2) Cross-Attention Routing Adapter (CARA) and Token Importance Scorer (TIS) to integrate compressed visual representation with language model efficiently.

Result: Achieves performance comparable to state-of-the-art whole-slide MLLMs while requiring significantly lower computation and memory.

Conclusion: The proposed LoC-Path framework demonstrates that efficient WSI-language modeling is achievable by focusing on task-relevant regions and reducing redundancy, offering a more practical solution for pathology applications.

Abstract: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.

[225] EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

Chang Liu, Tianjiao Jing, Chengcheng Ma, Xuanqi Zhou, Zhengxuan Lian, Qin Jin, Hongliang Yuan, Shi-Sheng Huang

Main category: cs.CV

TL;DR: EmoDiffTalk: A novel 3D Gaussian Splatting talking head framework with emotion-aware Gaussian diffusion for fine-grained, multimodal emotional editing using text-to-AU control.

Details

Motivation: Existing photo-realistic 3D talking head methods using 3D Gaussian Splatting lack effective emotional expression manipulation, particularly for fine-grained and expansive dynamic emotional editing with multimodal control.

Method: Introduces Emotion-aware Gaussian Diffusion with two key components: 1) Action Unit (AU) prompt Gaussian diffusion process for fine-grained facial animation, and 2) Accurate text-to-AU emotion controller for expansive dynamic emotional editing using text input.

Result: Superior performance on EmoTalk3D and RenderMe-360 datasets, demonstrating better emotional subtlety, lip-sync fidelity, and controllability compared to previous works.

Conclusion: Establishes a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis, representing one of the first 3D Gaussian Splatting talking-head frameworks supporting continuous, multimodal emotional editing in AU-based expression space.

Abstract: Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.

[226] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars

Ramazan Fazylov, Sergey Zagoruyko, Aleksandr Parkin, Stamatis Lefkimmiatis, Ivan Laptev

Main category: cs.CV

TL;DR: AGORA is a novel framework that extends 3D Gaussian Splatting with a GAN to generate animatable 3D human avatars with real-time rendering and fine-grained expression control.

Details

Motivation: Existing methods have limitations: NeRF-based approaches suffer from slow rendering and dynamic inconsistencies, while 3DGS methods are typically limited to static head generation without dynamic control. There's a need for high-fidelity, animatable avatars with real-time performance for VR, telepresence, and entertainment applications.

Method: AGORA extends 3D Gaussian Splatting within a generative adversarial network framework. It introduces a lightweight, FLAME-conditioned deformation branch that predicts per-Gaussian residuals for identity-preserving, fine-grained expression control. Expression fidelity is enforced via a dual-discriminator training scheme leveraging synthetic renderings of the parametric mesh.

Result: AGORA outperforms state-of-the-art NeRF-based methods on expression accuracy while rendering at 250+ FPS on a single GPU, and ~9 FPS under CPU-only inference - representing the first demonstration of practical CPU-only animatable 3DGS avatar synthesis.

Conclusion: AGORA represents a significant step toward practical, high-performance digital humans by bridging the gap between high-fidelity animation and real-time rendering performance, enabling both GPU and CPU-only deployment.

Abstract: The generation of high-fidelity, animatable 3D human avatars remains a core challenge in computer graphics and vision, with applications in VR, telepresence, and entertainment. Existing approaches based on implicit representations like NeRFs suffer from slow rendering and dynamic inconsistencies, while 3D Gaussian Splatting (3DGS) methods are typically limited to static head generation, lacking dynamic control. We bridge this gap by introducing AGORA, a novel framework that extends 3DGS within a generative adversarial network to produce animatable avatars. Our key contribution is a lightweight, FLAME-conditioned deformation branch that predicts per-Gaussian residuals, enabling identity-preserving, fine-grained expression control while allowing real-time inference. Expression fidelity is enforced via a dual-discriminator training scheme leveraging synthetic renderings of the parametric mesh. AGORA generates avatars that are not only visually realistic but also precisely controllable. Quantitatively, we outperform state-of-the-art NeRF-based methods on expression accuracy while rendering at 250+ FPS on a single GPU, and, notably, at $\sim$9 FPS under CPU-only inference - representing, to our knowledge, the first demonstration of practical CPU-only animatable 3DGS avatar synthesis. This work represents a significant step toward practical, high-performance digital humans. Project website: https://ramazan793.github.io/AGORA/

[227] Hierarchical Deep Learning for Diatom Image Classification: A Multi-Level Taxonomic Approach

Yueying Ke

Main category: cs.CV

TL;DR: Hierarchical neural network (DiatomCascadeNet) improves diatom taxonomic classification by embedding taxonomic hierarchy into architecture, achieving better accuracy at upper taxonomic levels and keeping errors taxonomically local when species predictions fail.

Details

Motivation: Conventional diatom identification relies heavily on expert taxonomists, and while deep learning approaches exist, most treat diatom recognition as flat classification, predicting only one taxonomic rank. The authors investigate whether embedding taxonomic hierarchy into neural network architectures can improve both accuracy and error locality.

Method: Introduces DiatomCascadeNet (H-COFGS), a hierarchical convolutional network with five cascaded heads that jointly predict class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Uses a dataset of 1,456 diatom images covering 82 species and compares hierarchical and flat models under identical settings.

Result: H-COFGS matches flat baselines at species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, errors remain taxonomically local: 92.5% of misclassified species are correctly predicted at genus level vs. 67.2% for flat baselines. Reduces mean taxonomic distance by 38.2% (1.209 vs. 1.955). Progressive training reveals bidirectional mechanisms that improve class accuracy from 96.2% to 99.5% and yields 6-8% gains at upper levels.

Conclusion: Hierarchical neural network architecture produces more robust, interpretable, and biologically aligned predictions for multi-level taxonomic classification by leveraging taxonomic hierarchy through bidirectional mechanisms (top-down constraint masks and bottom-up gradient propagation).

Abstract: Accurate taxonomic identification of diatoms is essential for aquatic ecosystem monitoring, yet conventional methods depend heavily on expert taxonomists. Recent deep learning approaches improve automation, but most treat diatom recognition as flat classification, predicting only one taxonomic rank. We investigate whether embedding taxonomic hierarchy into neural network architectures can improve both accuracy and error locality. We introduce DiatomCascadeNet (H-COFGS), a hierarchical convolutional network with five cascaded heads that jointly predict class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Using a filtered dataset of 1,456 diatom images covering 82 species, we compare hierarchical and flat models under identical settings. H-COFGS matches flat baselines at the species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, errors remain taxonomically local: 92.5% of misclassified species are correctly predicted at the genus level, versus 67.2% for flat baselines. H-COFGS reduces mean taxonomic distance by 38.2% (1.209 vs. 1.955). Progressive training reveals bidirectional mechanisms: hierarchical constraint masks operate top-down to constrain prediction space, while gradients from fine-grained levels propagate bottom-up through the shared backbone, refining features. This improves class accuracy from 96.2% to 99.5% and yields 6-8% gains at upper levels, producing more robust, interpretable, and biologically aligned predictions for multi-level taxonomic classification.

[228] Optimization-Guided Diffusion for Interactive Scene Generation

Shihao Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li

Main category: cs.CV

TL;DR: OMEGA is an optimization-guided framework that improves diffusion-based multi-agent scene generation by enforcing physical and social constraints, enabling realistic safety-critical scenario creation for autonomous vehicle evaluation.

Details

Motivation: Safety-critical driving events are rare in real datasets but essential for evaluating autonomous vehicles. Existing scene generation models lack controllability and often produce physically/socially implausible scenes, limiting their usefulness for safety testing.

Method: OMEGA uses constrained optimization to re-anchor each reverse diffusion step, steering generation toward physically plausible and behaviorally coherent trajectories. It formulates ego-attacker interactions as game-theoretic optimization in distribution space to approximate Nash equilibria for adversarial scenarios.

Result: OMEGA improves physically/behaviorally valid scenes from 32.35% to 72.27% for free exploration, and from 11% to 80% for controllability-focused generation. It generates 5× more near-collision frames (TTC < 3s) while maintaining scene realism on nuPlan and Waymo datasets.

Conclusion: OMEGA provides a training-free framework that enhances diffusion-based scene generation with structural consistency and interaction awareness, enabling realistic safety-critical scenario generation for autonomous vehicle evaluation without requiring model retraining.

Abstract: Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.

[229] Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries

Samitha Nuwan Thilakarathna, Ercan Avsar, Martin Mathias Nielsen, Malte Pedersen

Main category: cs.CV

TL;DR: Optimized deep learning pipeline for fish re-identification using Vision Transformers achieves 90.43% Rank-1 accuracy on simulated Electronic Monitoring data, with hard triplet mining and custom image transformations key to performance.

Details

Motivation: Electronic Monitoring systems generate vast amounts of video data that cannot be manually reviewed, creating a need for automated fish re-identification to support sustainable marine resource management.

Method: Developed optimized deep learning pipeline using AutoFish dataset (simulating EM conveyor belts with 6 similar fish species), employing hard triplet mining with custom image transformation pipeline including dataset-specific normalization, comparing Vision Transformer (Swin-T) vs CNN (ResNet-50).

Result: Swin-T consistently outperforms ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. Analysis shows main challenge is intra-species errors where viewpoint inconsistency is more detrimental than partial occlusion.

Conclusion: Vision Transformers with hard triplet mining and custom image transformations provide effective automated fish re-identification for Electronic Monitoring systems, addressing the challenge of distinguishing visually similar individuals within species.

Abstract: Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git

[230] Thinking with Images via Self-Calling Agent

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

Main category: cs.CV

TL;DR: sCoT reformulates multimodal CoT as language-only CoT with self-calling virtual subagents, achieving better performance with 75% fewer GPU hours.

Details

Motivation: Current thinking-with-images paradigms show strong visual reasoning but optimizing interleaved multimodal CoT through RL is challenging due to scarce high-quality reasoning data.

Method: Self-Calling Chain-of-Thought (sCoT) uses a main agent to decompose visual reasoning into atomic subtasks, then invokes parameter-sharing subagents (virtual replicas) to solve them in isolated context. Uses group-relative policy optimization for reinforcement.

Result: On HR-Bench 4K, sCoT improves overall reasoning performance by up to 1.9% with ~75% fewer GPU hours compared to strong baselines.

Conclusion: sCoT provides an effective and efficient visual reasoning paradigm that avoids explicit modality interleaving while enhancing optimization through self-calling architecture.

Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $\sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

[231] StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: StereoWorld is an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation with geometry-aware regularization and spatio-temporal tiling for efficient high-resolution synthesis.

Details

Motivation: The growing adoption of XR devices has created strong demand for high-quality stereo video, but current production methods are costly and prone to artifacts, creating a need for more efficient and reliable stereo video generation solutions.

Method: The framework jointly conditions on monocular video input while using explicit geometry-aware regularization to ensure 3D structural fidelity, and integrates a spatio-temporal tiling scheme for efficient high-resolution synthesis. The authors also curated a large-scale stereo video dataset with over 11M frames aligned to natural human IPD for training and evaluation.

Result: Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency compared to existing approaches.

Conclusion: StereoWorld provides an effective solution for high-quality stereo video generation that addresses the cost and artifact challenges of traditional production methods, with demonstrated superiority over previous techniques in both visual quality and 3D consistency.

Abstract: The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.

[232] Perception-Inspired Color Space Design for Photo White Balance Editing

Yang Cheng, Ziteng Cui, Shenghan Su, Lin Gu, Zenghui Zhang

Main category: cs.CV

TL;DR: The paper proposes a novel white balance correction framework using a perception-inspired Learnable HSI color space and Mamba-based network to overcome limitations of traditional sRGB-based approaches.

Details

Motivation: Current sRGB-based white balance editing for post-ISP correction has limitations due to fixed nonlinear transformations and entangled color channels, which struggle with complex lighting conditions when original camera RAW data is unavailable.

Method: Introduces a Learnable HSI (LHSI) color space based on cylindrical color model that separates luminance from chromatic components, with dedicated parameters for enhanced disentanglement and learnable mapping. Also proposes a new Mamba-based network tailored to the LHSI color space characteristics.

Result: Experimental results on benchmark datasets demonstrate the superiority of the proposed method over existing approaches.

Conclusion: The work highlights the potential of perception-inspired color space design in computational photography for more effective white balance correction, especially in complex lighting conditions.

Abstract: White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene’s true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.

[233] StateSpace-SSL: Linear-Time Self-supervised Learning for Plant Disease Detection

Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb

Main category: cs.CV

TL;DR: StateSpace-SSL: A linear-time self-supervised learning framework using Vision Mamba state-space encoder for plant disease detection that outperforms CNN- and transformer-based SSL methods.

Details

Motivation: Existing SSL methods (CNN- or transformer-based) are poorly matched to agricultural imagery. CNNs struggle to capture continuously evolving disease patterns along leaf structures, while transformers have quadratic attention costs from high-resolution patches.

Method: Proposes StateSpace-SSL with Vision Mamba state-space encoder that models long-range lesion continuity through directional scanning across leaf surfaces. Uses prototype-driven teacher-student objective to align representations across multiple views for stable, lesion-aware features.

Result: Outperforms CNN- and transformer-based SSL baselines on three publicly available plant disease datasets across various evaluation metrics. Learns compact, lesion-focused feature maps.

Conclusion: StateSpace-SSL demonstrates the advantage of linear state-space modeling for self-supervised plant disease representation learning, effectively capturing lesion continuity while maintaining computational efficiency.

Abstract: Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.

[234] UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

Alberto Rota, Mert Kiray, Mert Asim Karaoglu, Patrick Ruhkamp, Elena De Momi, Nassir Navab, Benjamin Busam

Main category: cs.CV

TL;DR: UnReflectAnything is an RGB-only framework that removes specular highlights from single images using a vision transformer encoder, highlight localization head, and token-level inpainting module, trained with virtual highlight synthesis.

Details

Motivation: Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery, creating challenges for computer vision tasks.

Method: Uses frozen vision transformer encoder for multi-scale features, lightweight head for highlight localization, token-level inpainting module to restore corrupted features, and Virtual Highlight Synthesis pipeline for training without paired data.

Result: Achieves competitive performance with state-of-the-art results on several benchmarks and generalizes well across natural and surgical domains despite challenging non-Lambertian surfaces and non-uniform lighting.

Conclusion: UnReflectAnything effectively removes specular highlights from single RGB images, enabling better texture recovery and geometric reasoning across diverse domains through innovative training with virtual highlight synthesis.

Abstract: Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/

[235] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo

Main category: cs.CV

TL;DR: The paper introduces Reason-Informed Video Editing (RVE), a new task requiring reasoning about physical plausibility and causal dynamics during video editing, along with RVE-Bench benchmark and ReViSE framework with self-reflective reasoning.

Details

Motivation: Current video unified models struggle with reason-informed visual editing despite having strong vision-language understanding capabilities. This gap exists due to inadequate datasets for training/evaluating reasoning-aware video editing and a disconnect between models' reasoning and editing capabilities.

Method: Proposes ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model uses its internal VLM to provide intrinsic feedback by assessing whether edited videos logically satisfy given instructions, with differential feedback refining the generator’s reasoning during training.

Result: Extensive experiments on RVE-Bench show ReViSE significantly enhances editing accuracy and visual fidelity, achieving 32% improvement in Overall score on the reasoning-informed video editing subset compared to state-of-the-art methods.

Conclusion: The paper successfully bridges the gap between reasoning and video editing by introducing the RVE task, creating a comprehensive benchmark (RVE-Bench), and developing an effective self-reflective reasoning framework (ReViSE) that significantly improves reasoning-aware video editing performance.

Abstract: Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models’ reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model’s internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator’s reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

cs.AI

[236] ExaCraft: Dynamic Learning Context Adaptation for Personalized Educational Examples

Akaash Chatterjee, Suman Kundu

Main category: cs.AI

TL;DR: ExaCraft is an AI system that generates personalized learning examples by adapting to learners’ dynamic context through user profiles and real-time behavior analysis.

Details

Motivation: Existing educational AI tools don't focus on generating examples or adapting to learners' changing understanding, struggles, or growing skills. Learning is most effective when connected to relevant, relatable examples that resonate personally.

Method: Uses Google Gemini AI and Python Flask API accessible via Chrome extension. Combines user-defined profiles (location, education, profession, complexity preferences) with real-time analysis of learner behavior. Adapts to five key aspects: indicators of struggle, mastery patterns, topic progression history, session boundaries, and learning progression signals.

Result: System generates culturally relevant and individually tailored examples that evolve from basic concepts to advanced technical implementations, responding to topic repetition, regeneration requests, and topic progression patterns.

Conclusion: ExaCraft represents an innovative approach to personalized education by dynamically adapting example generation to learners’ evolving context and needs.

Abstract: Learning is most effective when it’s connected to relevant, relatable examples that resonate with learners on a personal level. However, existing educational AI tools don’t focus on generating examples or adapting to learners’ changing understanding, struggles, or growing skills. We’ve developed ExaCraft, an AI system that generates personalized examples by adapting to the learner’s dynamic context. Through the Google Gemini AI and Python Flask API, accessible via a Chrome extension, ExaCraft combines user-defined profiles (including location, education, profession, and complexity preferences) with real-time analysis of learner behavior. This ensures examples are both culturally relevant and tailored to individual learning needs. The system’s core innovation is its ability to adapt to five key aspects of the learning context: indicators of struggle, mastery patterns, topic progression history, session boundaries, and learning progression signals. Our demonstration will show how ExaCraft’s examples evolve from basic concepts to advanced technical implementations, responding to topic repetition, regeneration requests, and topic progression patterns in different use cases.

[237] Suzume-chan: Your Personal Navigator as an Embodied Information Hub

Maya Grace Torii, Takahito Murakami, Shuka Koseki, Yoichi Ochiai

Main category: cs.AI

TL;DR: The paper proposes an “Embodied Information Hub” called Suzume-chan - a soft AI agent that enables more human-centered knowledge sharing through physical and conversational interaction, reducing psychological distance in communication.

Details

Motivation: Current digital tools improve access to information but fail to create the sense of connection needed for deep understanding. The paper aims to address the gap between information access and meaningful human connection in knowledge sharing.

Method: The authors propose an “Embodied Information Hub” based on Social Presence Theory. They developed Suzume-chan, a small, soft AI agent running locally with language models and RAG (retrieval-augmented generation). The system learns from spoken explanations and responds through dialogue.

Result: The prototype demonstrates how physical embodiment and conversational interaction can reduce psychological distance in knowledge sharing, making the process warmer and more human-centered compared to traditional digital tools.

Conclusion: Embodied AI agents like Suzume-chan can bridge the gap between information access and human connection by creating a sense of social presence, potentially transforming how expert knowledge is shared and understood.

Abstract: Access to expert knowledge often requires real-time human communication. Digital tools improve access to information but rarely create the sense of connection needed for deep understanding. This study addresses this issue using Social Presence Theory, which explains how a feeling of “being together” enhances communication. An “Embodied Information Hub” is proposed as a new way to share knowledge through physical and conversational interaction. The prototype, Suzume-chan, is a small, soft AI agent running locally with a language model and retrieval-augmented generation (RAG). It learns from spoken explanations and responds through dialogue, reducing psychological distance and making knowledge sharing warmer and more human-centered.

[238] Exploring Health Misinformation Detection with Multi-Agent Debate

Chih-Han Chen, Chen-Han Tsai, Yu-Shao Peng

Main category: cs.AI

TL;DR: A two-stage framework for health misinformation detection using agreement score prediction followed by multi-agent debate when consensus is insufficient.

Details

Motivation: Health misinformation is proliferating online, requiring both high-quality evidence retrieval and rigorous reasoning for effective verification.

Method: Two-stage framework: 1) LLMs evaluate retrieved articles independently to compute aggregated agreement score; 2) If score below threshold, multiple agents engage in structured debate to synthesize conflicting evidence and generate reasoned verdicts.

Result: Experimental results show superior performance compared to baseline methods.

Conclusion: Combining automated scoring with collaborative reasoning is valuable for complex verification tasks in health misinformation detection.

Abstract: Fact-checking health-related claims has become increasingly critical as misinformation proliferates online. Effective verification requires both the retrieval of high-quality evidence and rigorous reasoning processes. In this paper, we propose a two-stage framework for health misinformation detection: Agreement Score Prediction followed by Multi-Agent Debate. In the first stage, we employ large language models (LLMs) to independently evaluate retrieved articles and compute an aggregated agreement score that reflects the overall evidence stance. When this score indicates insufficient consensus-falling below a predefined threshold-the system proceeds to a second stage. Multiple agents engage in structured debate to synthesize conflicting evidence and generate well-reasoned verdicts with explicit justifications. Experimental results demonstrate that our two-stage approach achieves superior performance compared to baseline methods, highlighting the value of combining automated scoring with collaborative reasoning for complex verification tasks.

[239] Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting

Moein Heidari, Mohammad Amin Roohi, Armin Khosravi, Ilker Hacihaliloglu

Main category: cs.AI

TL;DR: Echo-CoPilot is a multi-view, multi-task AI agent that uses a large language model to orchestrate specialized echocardiography tools for comprehensive cardiac assessment, outperforming existing models on clinical benchmarks.

Details

Motivation: Echocardiography interpretation is cognitively demanding and manual, while current AI models operate in isolation without providing unified clinical assessments. There's a need for a coherent system that integrates multiple specialized tools for comprehensive cardiac evaluation.

Method: Echo-CoPilot uses a large language model in a ReAct-style loop to orchestrate specialized echocardiography tools. It decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement, disease prediction, and report synthesis, then integrates outputs into guideline-aware answers.

Result: Achieved 50.8% accuracy on the MIMIC-EchoQA benchmark, outperforming both general-purpose and biomedical video vision-language models. The agent effectively leverages quantitative measurements and physiologic context to resolve challenging borderline cases.

Conclusion: Echo-CoPilot demonstrates that orchestrating specialized echocardiography tools through an LLM-based agent can provide clinically coherent assessments, potentially reducing cognitive load and improving consistency in echocardiography interpretation.

Abstract: Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.

[240] Fuzzy Hierarchical Multiplex

Alexis Kafantaris

Main category: cs.AI

TL;DR: A new fuzzy optimization framework extending FCM causality for service optimization in information transmission, using dynamics to map data into metrics and analyze logical implication and concept hierarchy through multiplex networks.

Details

Motivation: To create a theoretical framework for service optimization in information transmission processes, particularly for service process design, by extending fuzzy cognitive maps (FCM) causality to better handle logical relationships and concept hierarchies.

Method: Extends FCM causality using dynamics to map data into metrics, creates a framework that examines logical implication and hierarchy of concepts using multiplex networks, and provides thorough analysis of FHM (Fuzzy Hierarchical Model) following logical steps.

Result: Proposes a white-theoretical framework that expounds and exemplifies the main objectives and orientation for service optimization of information transmission in service process design, with elegant logical analysis.

Conclusion: The framework provides a systematic approach for optimizing information transmission services through fuzzy logic and hierarchical concept analysis, offering theoretical foundations for service process design applications.

Abstract: A new fuzzy optimization framework that extends FCM causality is proposed. This model utilizes the dynamics to map data into metrics and create a framework that examines logical implication and hierarchy of concepts using a multiplex. Moreover, this is a white-theoretical paper introducing the framework and analyzing the logic and math behind it. Upon this extension the main objectives and the orientation of this framework is expounded and exemplified; this framework is meant for service optimization of information transmission in service process design. Lastly, a thorough analysis of the FHM is included which is done following the logical steps in a simple and elegant manner.

[241] Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

Sha Li, Ayush Sadekar, Nathan Self, Yiqi Su, Lars Andersland, Mira Chaplin, Annabel Zhang, Hyoju Yang, James B Henderson, Krista Wigginton, Linsey Marr, T. M. Murali, Naren Ramakrishnan

Main category: cs.AI

TL;DR: SciEx is a modular framework for scientific information extraction that addresses challenges with long documents, multi-modal content, and inconsistent data across publications, enabling flexible integration of models and prompting strategies.

Details

Motivation: Existing LLM-based extraction tools struggle with scientific literature's complexities: long documents, multi-modal content, inconsistent information across publications, and rapidly changing data schemas that make system re-architecture difficult.

Method: SciEx is a modular, composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design enables extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms.

Result: The framework was evaluated on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. The findings provide practical insights into both strengths and limitations of current LLM-based pipelines.

Conclusion: SciEx provides a practical solution for scientific information extraction that addresses real-world challenges in scientific literature processing, offering a modular approach that can adapt to changing requirements and integrate evolving AI capabilities.

Abstract: Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.

[242] DynaMate: An Autonomous Agent for Protein-Ligand Molecular Dynamics Simulations

Salomé Guilbert, Cassandra Masschelein, Jeremy Goumaz, Bohdan Naida, Philippe Schwaller

Main category: cs.AI

TL;DR: DynaMate is an automated multi-agent LLM framework that autonomously designs and executes complete molecular dynamics workflows for protein and protein-ligand systems, including free energy binding affinity calculations.

Details

Motivation: Despite the broad utility of MD simulations in drug discovery and protein engineering, the technical complexity of MD setup (parameterization, input preparation, software configuration) remains a major barrier for widespread and efficient usage. Current agentic LLMs have not successfully automated protein-ligand MD workflows.

Method: DynaMate is a modular multi-agent framework with three specialized modules that interact to plan experiments, perform simulations, and analyze results. It integrates dynamic tool use, web search, PaperQA, and self-correcting behavior to autonomously design and execute complete MD workflows, including MM/PB(GB)SA free energy calculations.

Result: The framework was evaluated across twelve benchmark systems of varying complexity. DynaMate reliably performed full MD simulations, corrected runtime errors through iterative reasoning, and produced meaningful analyses of protein-ligand interactions, demonstrating success rate, efficiency, and adaptability.

Conclusion: DynaMate paves the way toward standardized, scalable, and time-efficient molecular modeling pipelines for future biomolecular and drug design applications by automating complex MD workflows that were previously inaccessible to non-experts.

Abstract: Force field-based molecular dynamics (MD) simulations are indispensable for probing the structure, dynamics, and functions of biomolecular systems, including proteins and protein-ligand complexes. Despite their broad utility in drug discovery and protein engineering, the technical complexity of MD setup, encompassing parameterization, input preparation, and software configuration, remains a major barrier for widespread and efficient usage. Agentic LLMs have demonstrated their capacity to autonomously execute multi-step scientific processes, and to date, they have not successfully been used to automate protein-ligand MD workflows. Here, we present DynaMate, a modular multi-agent framework that autonomously designs and executes complete MD workflows for both protein and protein-ligand systems, and offers free energy binding affinity calculations with the MM/PB(GB)SA method. The framework integrates dynamic tool use, web search, PaperQA, and a self-correcting behavior. DynaMate comprises three specialized modules, interacting to plan the experiment, perform the simulation, and analyze the results. We evaluated its performance across twelve benchmark systems of varying complexity, assessing success rate, efficiency, and adaptability. DynaMate reliably performed full MD simulations, corrected runtime errors through iterative reasoning, and produced meaningful analyses of protein-ligand interactions. This automated framework paves the way toward standardized, scalable, and time-efficient molecular modeling pipelines for future biomolecular and drug design applications.

Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu

Main category: cs.AI

TL;DR: SWR is a photorealistic urban simulation platform for embodied AI with procedurally generated cities, supporting multi-robot control and two challenging benchmarks for evaluating robot capabilities in realistic urban scenarios.

Details

Motivation: Current foundation models for robotics focus mainly on indoor/household scenarios, lacking evaluation in large-scale, realistic urban environments with dynamic elements like pedestrians and traffic systems.

Method: Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes with dynamic elements (pedestrians, traffic), supports multi-robot control/communication, and creates two benchmarks: multimodal instruction-following navigation and multi-agent search tasks.

Result: State-of-the-art models (including VLMs) struggle with SWR’s tasks, revealing deficiencies in perception, reasoning, and planning abilities needed for urban environments.

Conclusion: SWR provides a comprehensive simulation platform and benchmarks to evaluate critical robot capacities in realistic urban scenarios, exposing limitations of current models and advancing development of generalist robotics for open-ended urban environments.

Abstract: Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

[244] Parallel Decoder Transformer: Model-Internal Parallel Decoding with Speculative Invariance via Note Conditioning

Logan Robbins

Main category: cs.AI

TL;DR: PDT introduces a parallel decoder transformer with speculative note conditioning adapters to enable coordinated parallel generation in frozen LLMs, achieving 77.8% precision in coverage prediction without retraining base model weights.

Details

Motivation: Autoregressive decoding in LLMs creates latency bottlenecks that scale linearly with output length. Existing parallel methods like Skeleton-of-Thought suffer from coherence drift due to lack of cross-stream communication between parallel generation streams.

Method: PDT injects lightweight Speculative Note Conditioning (SNC) adapters into a frozen pre-trained model, allowing parallel decoding streams to synchronize via a shared dynamic latent space. Coordination is formulated as a speculative consensus problem where sibling streams broadcast semantic “notes” to a global bus, gated by a learned verification head.

Result: PDT achieves 77.8% precision in coverage prediction and recovers approximate serial semantics without modifying the trunk weights of the frozen 20B-parameter backbone model, validated on a 50,000-step curriculum.

Conclusion: PDT establishes a scalable, efficient alternative to full model fine-tuning for structured parallel generation, enabling coordinated parallel decoding while maintaining coherence through speculative consensus mechanisms.

Abstract: Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic notes’’ to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.

[245] Mind the Gap! Pathways Towards Unifying AI Safety and Ethics Research

Dani Roytburg, Beck Miller

Main category: cs.AI

TL;DR: Quantitative study reveals deep structural divide between AI safety and AI ethics research communities, with over 80% of collaborations occurring within rather than across fields, and cross-disciplinary exchange dependent on a small number of brokers.

Details

Motivation: AI alignment research has diverged into two parallel tracks: safety (focused on existential risks, deceptive behaviors) and ethics (focused on present harms, social bias). Despite both warning about insufficient alignment investment, they disagree on what alignment means, leading to isolated research communities with different methodologies and institutional homes.

Method: Large-scale quantitative study using bibliometric and co-authorship network analysis of 6,442 papers from twelve major ML and NLP conferences (2020-2025). Analyzed collaboration patterns and network connectivity between safety and ethics communities.

Result: Over 80% of collaborations occur within either safety or ethics communities. Cross-field connectivity is highly concentrated: roughly 5% of papers account for more than 85% of bridging links. Removing a small number of brokers sharply increases segregation, showing cross-disciplinary exchange depends on handful of actors rather than broad collaboration.

Conclusion: The safety-ethics divide is not only conceptual but institutional, with implications for research agendas, policy, and venues. Integrating technical safety work with normative ethics via shared benchmarks, cross-institutional venues, and mixed-method methodologies is essential for building AI systems that are both robust and just.

Abstract: While much research in artificial intelligence (AI) has focused on scaling capabilities, the accelerating pace of development makes countervailing work on producing harmless, “aligned” systems increasingly urgent. Yet research on alignment has diverged along two largely parallel tracks: safety–centered on scaled intelligence, deceptive or scheming behaviors, and existential risk–and ethics–focused on present harms, the reproduction of social bias, and flaws in production pipelines. Although both communities warn of insufficient investment in alignment, they disagree on what alignment means or ought to mean. As a result, their efforts have evolved in relative isolation, shaped by distinct methodologies, institutional homes, and disciplinary genealogies. We present a large-scale, quantitative study showing the structural split between AI safety and AI ethics. Using a bibliometric and co-authorship network analysis of 6,442 papers from twelve major ML and NLP conferences (2020-2025), we find that over 80% of collaborations occur within either the safety or ethics communities, and cross-field connectivity is highly concentrated: roughly 5% of papers account for more than 85% of bridging links. Removing a small number of these brokers sharply increases segregation, indicating that cross-disciplinary exchange depends on a handful of actors rather than broad, distributed collaboration. These results show that the safety-ethics divide is not only conceptual but institutional, with implications for research agendas, policy, and venues. We argue that integrating technical safety work with normative ethics–via shared benchmarks, cross-institutional venues, and mixed-method methodologies–is essential for building AI systems that are both robust and just.

[246] Linear socio-demographic representations emerge in Large Language Models from indirect cues

Paul Bouchaud, Pedro Ramaciotti

Main category: cs.AI

TL;DR: LLMs develop linear representations of user demographics in activation space that can be decoded from names and occupations, revealing implicit biases that affect downstream behavior like career recommendations.

Details

Motivation: To understand how LLMs encode sociodemographic attributes from indirect cues and whether models that pass bias benchmarks still harbor implicit biases that affect their behavior.

Method: Probed residual streams across layers of four transformer-based LLMs (Magistral 24B, Qwen3 14B, GPT-OSS 20B, OLMo2-1B) with explicit demographic disclosure, then tested if same probes predict demographics from implicit cues like names and occupations.

Result: LLMs develop linear demographic representations: names activate census-aligned gender/race representations, occupations trigger workforce-statistics-correlated representations. These implicit representations actively shape downstream behavior like career recommendations.

Conclusion: Models passing bias benchmarks may still harbor implicit biases that affect behavior at scale, highlighting fairness implications. Demographic representations are linear and interpretable in activation space.

Abstract: We investigate how LLMs encode sociodemographic attributes of human conversational partners inferred from indirect cues such as names and occupations. We show that LLMs develop linear representations of user demographics within activation space, wherein stereotypically associated attributes are encoded along interpretable geometric directions. We first probe residual streams across layers of four open transformer-based LLMs (Magistral 24B, Qwen3 14B, GPT-OSS 20B, OLMo2-1B) prompted with explicit demographic disclosure. We show that the same probes predict demographics from implicit cues: names activate census-aligned gender and race representations, while occupations trigger representations correlated with real-world workforce statistics. These linear representations allow us to explain demographic inferences implicitly formed by LLMs during conversation. We demonstrate that these implicit demographic representations actively shape downstream behavior, such as career recommendations. Our study further highlights that models that pass bias benchmark tests may still harbor and leverage implicit biases, with implications for fairness when applied at scale.

[247] Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda

Main category: cs.AI

TL;DR: SAE embeddings provide cost-effective, controllable, and interpretable representations for analyzing large text corpora, outperforming LLM-based methods and dense embeddings in various data analysis tasks.

Details

Motivation: Current methods for analyzing large text corpora rely on expensive LLM-based techniques or dense embeddings that lack control over properties of interest, creating a need for more cost-effective and controllable approaches.

Method: Using sparse autoencoders (SAEs) to create SAE embeddings - representations whose dimensions map to interpretable concepts, enabling controlled analysis through concept filtering and comparison.

Result: SAE embeddings uncover 2-8x bigger differences at lower cost than LLMs, identify biases more reliably, enable controllable clustering along axes of interest, and outperform dense embeddings on property-based retrieval.

Conclusion: SAEs serve as a versatile tool for unstructured data analysis, highlighting the importance of interpreting models through their data, with applications ranging from dataset comparison to model behavior investigation.

Abstract: Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding “trigger” phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

[248] Towards Foundation Models with Native Multi-Agent Intelligence

Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou, Lei Bai

Main category: cs.AI

TL;DR: Foundation models need native multi-agent intelligence, not just single-agent abilities, as strong single-agent performance doesn’t automatically translate to robust multi-agent capabilities.

Details

Motivation: While foundation models are becoming the "brain" of AI agents and gaining single-agent abilities, the next frontier is endowing them with native multi-agent intelligence. Current assumptions that strong single-agent performance naturally leads to multi-agent intelligence are incorrect.

Method: Identified four core capabilities for multi-agent contexts: understanding, planning, efficient communication, and adaptation. Conducted extensive empirical evaluation across 41 large language models to test the relationship between single-agent and multi-agent performance.

Result: Strong single-agent performance alone does NOT automatically yield robust multi-agent intelligence. The paper provides empirical evidence across 41 LLMs showing this gap exists.

Conclusion: There’s a need for focused research directions including dataset construction, evaluation frameworks, training paradigms, and safety considerations specifically for building foundation models with native multi-agent intelligence.

Abstract: Foundation models (FMs) are increasingly assuming the role of the “brain” of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities – such as GUI interaction or integrated tool use – we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions – spanning dataset construction, evaluation, training paradigms, and safety considerations – for building FMs with native multi-agent intelligence.

[249] Robust AI Security and Alignment: A Sisyphean Endeavor?

Apostol Vassilev

Main category: cs.AI

TL;DR: The paper extends Gödel’s incompleteness theorem to AI systems, establishing fundamental information-theoretic limits on AI security and alignment robustness.

Details

Motivation: To understand the fundamental limitations of AI systems regarding security and alignment, which is crucial for responsible AI adoption and deployment.

Method: Extends Gödel’s incompleteness theorem to artificial intelligence systems, establishing theoretical limitations through information-theoretic analysis.

Result: Proves inherent limitations in AI robustness for security and alignment, demonstrates broader implications for cognitive reasoning limitations in AI systems.

Conclusion: AI systems have fundamental limitations in security and alignment robustness due to information-theoretic constraints; understanding these limitations is essential for responsible AI development, and practical approaches are needed to address these challenges.

Abstract: This manuscript establishes information-theoretic limitations for robustness of AI security and alignment by extending Gödel’s incompleteness theorem to AI. Knowing these limitations and preparing for the challenges they bring is critically important for the responsible adoption of the AI technology. Practical approaches to dealing with these challenges are provided as well. Broader implications for cognitive reasoning limitations of AI systems are also proven.

[250] Modeling Narrative Archetypes in Conspiratorial Narratives: Insights from Singapore-Based Telegram Groups

Soorya Ram Shimgekar, Abhay Goyal, Lam Yin Cheung, Roy Ka-Wei Lee, Koustuv Saha, Pi Zonooz, Navin Kumar

Main category: cs.AI

TL;DR: This paper analyzes conspiratorial discourse in Singapore Telegram groups using a two-stage computational framework: fine-tuning RoBERTa for classification and building a signed belief graph with a novel SiBeGNN model to identify narrative archetypes.

Details

Motivation: Conspiratorial discourse is increasingly embedded in digital ecosystems but difficult to study. The authors aim to understand how such content spreads within everyday discussions rather than isolated echo chambers, challenging assumptions about online radicalization.

Method: Two-stage framework: 1) Fine-tune RoBERTa-large to classify messages as conspiratorial (F1=0.866 on 2,000 expert-labeled messages). 2) Build signed belief graph with nodes as messages and edges reflecting belief alignment weighted by textual similarity. Introduce SiBeGNN with Sign Disentanglement Loss to separate ideological alignment from stylistic features.

Result: Identified seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN achieved superior clustering quality (cDBI=8.38 vs 13.60-67.27 baselines) with 88% inter-rater expert agreement.

Conclusion: Conspiratorial messages appear within routine discussions, not just skepticism clusters, challenging assumptions about online radicalization. The framework advances computational methods for belief-driven discourse analysis with applications for stance detection, political communication, and content moderation.

Abstract: Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features. Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.

[251] AgriRegion: Region-Aware Retrieval for High-Fidelity Agricultural Advice

Mesafint Fanuel, Mahmoud Nabil Mahmoud, Crystal Cook Marshal, Vishal Lakhotia, Biswanath Dari, Kaushik Roy, Shaohu Zhang

Main category: cs.AI

TL;DR: AgriRegion is a RAG framework that reduces agricultural hallucinations in LLMs by incorporating geospatial metadata and region-prioritized retrieval for locally accurate farming advice.

Details

Motivation: General-purpose LLMs suffer from contextual hallucinations in agriculture, providing advice that may be scientifically sound in one region but disastrous in another due to variations in soil, climate, and local regulations.

Method: AgriRegion uses a Retrieval-Augmented Generation framework with geospatial metadata injection and region-prioritized re-ranking, restricting knowledge to verified local agricultural extension services and enforcing geo-spatial constraints during retrieval.

Result: AgriRegion reduces hallucinations by 10-20% compared to state-of-the-art LLM systems and significantly improves trust scores, as demonstrated on the novel AgriRegion-Eval benchmark dataset.

Conclusion: The AgriRegion framework effectively addresses region-specific agricultural advisory needs by combining RAG with geospatial constraints, providing more reliable and locally accurate farming recommendations.

Abstract: Large Language Models (LLMs) have demonstrated significant potential in democratizing access to information. However, in the domain of agriculture, general-purpose models frequently suffer from contextual hallucination, which provides non-factual advice or answers are scientifically sound in one region but disastrous in another due to variations in soil, climate, and local regulations. We introduce AgriRegion, a Retrieval-Augmented Generation (RAG) framework designed specifically for high-fidelity, region-aware agricultural advisory. Unlike standard RAG approaches that rely solely on semantic similarity, AgriRegion incorporates a geospatial metadata injection layer and a region-prioritized re-ranking mechanism. By restricting the knowledge base to verified local agricultural extension services and enforcing geo-spatial constraints during retrieval, AgriRegion ensures that the advice regarding planting schedules, pest control, and fertilization is locally accurate. We create a novel benchmark dataset, AgriRegion-Eval, which comprises 160 domain-specific questions across 12 agricultural subfields. Experiments demonstrate that AgriRegion reduces hallucinations by 10-20% compared to state-of-the-art LLMs systems and significantly improves trust scores according to a comprehensive evaluation.

[252] The 2025 Foundation Model Transparency Index

Alexander Wan, Kevin Klyman, Sayash Kapoor, Nestor Maslej, Shayne Longpre, Betty Xiong, Percy Liang, Rishi Bommasani

Main category: cs.AI

TL;DR: The 2025 Foundation Model Transparency Index shows transparency among AI companies has worsened, with average scores dropping from 58 to 40 out of 100, despite new policy mandates.

Details

Motivation: To track how transparency practices evolve among increasingly consequential foundation model developers, and to inform policymakers about current transparency gaps.

Method: Annual index evaluating transparency using indicators related to data acquisition, usage data, monitoring, and other factors; third edition with expanded company coverage including Alibaba, DeepSeek, and xAI.

Result: Transparency deteriorated significantly: average score fell from 58 (2024) to 40 (2025); companies are most opaque about training data, compute, and post-deployment impact; IBM scored highest (95) while xAI and Midjourney scored lowest (14).

Conclusion: Despite increasing policy mandates, transparency has worsened, revealing critical information deficits that require more aggressive policy interventions, especially around training data and post-deployment monitoring.

Abstract: Foundation model developers are among the world’s most important companies. As these companies become increasingly consequential, how do their transparency practices evolve? The 2025 Foundation Model Transparency Index is the third edition of an annual effort to characterize and quantify the transparency of foundation model developers. The 2025 FMTI introduces new indicators related to data acquisition, usage data, and monitoring and evaluates companies like Alibaba, DeepSeek, and xAI for the first time. The 2024 FMTI reported that transparency was improving, but the 2025 FMTI finds this progress has deteriorated: the average score out of 100 fell from 58 in 2024 to 40 in 2025. Companies are most opaque about their training data and training compute as well as the post-deployment usage and impact of their flagship models. In spite of this general trend, IBM stands out as a positive outlier, scoring 95, in contrast to the lowest scorers, xAI and Midjourney, at just 14. The five members of the Frontier Model Forum we score end up in the middle of the Index: we posit that these companies avoid reputational harms from low scores but lack incentives to be transparency leaders. As policymakers around the world increasingly mandate certain types of transparency, this work reveals the current state of transparency for foundation model developers, how it may change given newly enacted policy, and where more aggressive policy interventions are necessary to address critical information deficits.

[253] CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

Yakun Zhu, Zhongzhen Huang, Qianhan Feng, Linjie Mu, Yannian Gu, Shaoting Zhang, Qi Dou, Xiaofan Zhang

Main category: cs.AI

TL;DR: CP-Env is a controllable agentic hospital environment that evaluates LLMs across end-to-end clinical pathways, revealing most models struggle with pathway complexity despite some internalizing knowledge.

Details

Motivation: Current benchmarks focus on static exams or isolated dialogues, which inadequately evaluate LLMs in dynamic clinical scenarios involving complex decision-making and transitions between different healthcare stages.

Method: CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios from triage to multidisciplinary meetings, enabling branching, long-horizon task execution. It uses a three-tiered evaluation framework covering Clinical Efficacy, Process Competency, and Professional Ethics.

Result: Most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Excessive reasoning steps can be counterproductive, while top models show reduced tool dependency through internalized knowledge.

Conclusion: CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation, providing benchmark and tools for further research at https://github.com/SPIRAL-MED/CP-Env.

Abstract: Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP-Env.

[254] An exploration for higher efficiency in multi objective optimisation with reinforcement learning

Mehmet Emin Aydin

Main category: cs.AI

TL;DR: Proposes using multi-objective reinforcement learning to optimize operator sequences in search algorithms, extending single-objective approaches to multi-objective optimization.

Details

Motivation: Optimization efficiency remains challenging, especially for multi-objective cases where operator sequence optimization hasn't been well explored despite promising results in single-objective optimization.

Method: Multi-objective reinforcement learning approach to generalize experiences and find optimal/near-optimal sequences of operators for neighborhood move operations.

Result: The paper presents an overview of a proposed generalisation approach with some stages completed and others outstanding, demonstrating the potential efficiency of multi-objective reinforcement learning.

Conclusion: Multi-objective reinforcement learning offers promising solutions for optimizing operator sequences in search algorithms, addressing efficiency challenges in multi-objective optimization.

Abstract: Efficiency in optimisation and search processes persists to be one of the challenges, which affects the performance and use of optimisation algorithms. Utilising a pool of operators instead of a single operator to handle move operations within a neighbourhood remains promising, but an optimum or near optimum sequence of operators necessitates further investigation. One of the promising ideas is to generalise experiences and seek how to utilise it. Although numerous works are done around this issue for single objective optimisation, multi-objective cases have not much been touched in this regard. A generalised approach based on multi-objective reinforcement learning approach seems to create remedy for this issue and offer good solutions. This paper overviews a generalisation approach proposed with certain stages completed and phases outstanding that is aimed to help demonstrate the efficiency of using multi-objective reinforcement learning.

[255] ID-PaS : Identity-Aware Predict-and-Search for General Mixed-Integer Linear Programs

Junyang Cai, El Mehdi Er Raqabi, Pascal Van Hentenryck, Bistra Dilkina

Main category: cs.AI

TL;DR: Extends Predict-and-Search framework to parametric MIPs with identity-aware learning (ID-PaS) to handle heterogeneous variables, outperforming Gurobi and PaS on real-world problems.

Details

Motivation: Current Predict-and-Search methods are limited to binary problems and don't handle fixed variables common in practical settings, while parametric MIPs with heterogeneous variables need better ML integration.

Method: Extends Predict-and-Search to parametric MIPs and introduces ID-PaS, an identity-aware learning framework that enables ML models to effectively handle heterogeneous variables in mixed-integer programs.

Result: Experiments on real-world large-scale problems show ID-PaS consistently achieves superior performance compared to state-of-the-art solver Gurobi and the original PaS framework.

Conclusion: ID-PaS successfully extends ML-enhanced Predict-and-Search to parametric MIPs with heterogeneous variables, demonstrating practical improvements over existing methods for real-world combinatorial optimization.

Abstract: Mixed-Integer Linear Programs (MIPs) are powerful and flexible tools for modeling a wide range of real-world combinatorial optimization problems. Predict-and-Search methods operate by using a predictive model to estimate promising variable assignments and then guiding a search procedure toward high-quality solutions. Recent research has demonstrated that incorporating machine learning (ML) into the Predict-and-Search framework significantly enhances its performance. Still, it is restricted to binary problems and overlooks the presence of fixed variables that commonly arise in practical settings. This work extends the Predict-and-Search (PaS) framework to parametric MIPs and introduces ID-PaS, an identity-aware learning framework that enables the ML model to handle heterogeneous variables more effectively. Experiments on several real-world large-scale problems demonstrate that ID-PaS consistently achieves superior performance compared to the state-of-the-art solver Gurobi and PaS.

[256] Reverse Thinking Enhances Missing Information Detection in Large Language Models

Yuxin Liu, Chaojie Gu, Yihang Zhang, Bin Qian, Shibo He

Main category: cs.AI

TL;DR: A reverse thinking framework improves LLMs’ ability to detect missing information by transforming the problem into backward reasoning, outperforming traditional forward reasoning methods.

Details

Motivation: LLMs struggle with missing information problems, leading to incomplete responses, factual errors, and hallucinations. Traditional forward reasoning approaches like Chain-of-Thought and Tree-of-Thought fail to systematically identify and recover omitted information.

Method: Proposes a novel reverse thinking framework that guides LLMs through backward reasoning to identify necessary conditions and pinpoint missing elements. Transforms missing information identification into a more manageable backward reasoning problem.

Result: Experimental results show substantial performance gains compared to traditional forward reasoning methods. The reverse thinking approach significantly improves model accuracy on missing information detection tasks.

Conclusion: Reverse thinking provides a promising direction for enhancing LLMs’ logical completeness and reasoning robustness, offering an effective alternative to forward reasoning for problems involving missing information.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning tasks, yet they often struggle with problems involving missing information, exhibiting issues such as incomplete responses, factual errors, and hallucinations. While forward reasoning approaches like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) have shown success in structured problem-solving, they frequently fail to systematically identify and recover omitted information. In this paper, we explore the potential of reverse thinking methodologies to enhance LLMs’ performance on missing information detection tasks. Drawing inspiration from recent work on backward reasoning, we propose a novel framework that guides LLMs through reverse thinking to identify necessary conditions and pinpoint missing elements. Our approach transforms the challenging task of missing information identification into a more manageable backward reasoning problem, significantly improving model accuracy. Experimental results demonstrate that our reverse thinking approach achieves substantial performance gains compared to traditional forward reasoning methods, providing a promising direction for enhancing LLMs’ logical completeness and reasoning robustness.

[257] Neuronal Attention Circuit (NAC) for Representation Learning

Waleed Razzaq, Izis Kankaraway, Yun-Bo Zhao

Main category: cs.AI

TL;DR: NAC (Neuronal Attention Circuit) is a continuous-time attention mechanism that reformulates attention logits as solutions to linear ODEs using biologically-inspired sparse gating, enabling efficient adaptive dynamics with theoretical guarantees.

Details

Motivation: Standard attention mechanisms have discrete nature that limits continuous-time modeling capabilities. The authors aim to create a biologically plausible continuous-time attention mechanism that can handle irregular time-series data more effectively.

Method: NAC reformulates attention logits computation as solution to linear first-order ODE with nonlinear interlinked gates inspired by C. elegans neuronal circuits. It uses sparse sensory gates for key-query projections and a sparse backbone network with two heads for content-target and learnable time-constant gates. Supports three computation modes: explicit Euler integration, exact closed-form solution, and steady-state approximation. Includes sparse Top-K pairwise concatenation for memory efficiency.

Result: NAC matches or outperforms competing baselines in accuracy across irregular time-series classification, autonomous vehicle lane-keeping, and industrial prognostics. It occupies intermediate position in runtime and memory efficiency compared to other continuous-time baselines.

Conclusion: NAC provides a novel biologically plausible continuous-time attention mechanism with theoretical guarantees (state stability, bounded errors, universal approximation) that enables effective modeling of irregular time-series while maintaining competitive performance and efficiency.

Abstract: Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically plausible CT-Attention mechanism that reformulates attention logits computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing \textit{C. elegans} Neuronal Circuit Policies (NCPs) wiring mechanism. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing \textit{content-target} and \textit{learnable time-constant} gates, enabling efficient adaptive dynamics. NAC supports three attention logit computation modes: (i) explicit Euler integration, (ii) exact closed-form solution, and (iii) steady-state approximation. To improve memory intensity, we implemented a sparse Top-\emph{K} pairwise concatenation scheme that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability, bounded approximation errors, and universal approximation. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory efficiency compared with several CT baselines.

[258] Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, Krista A. Ehinger

Main category: cs.AI

TL;DR: The paper introduces CogVision, a dataset for analyzing vision-language models’ internal mechanisms, particularly attention heads’ functional roles in multimodal reasoning through step-by-step decomposition of complex questions.

Details

Motivation: Vision-language models excel on benchmarks but remain black boxes; there's a need to systematically understand their internal mechanisms, especially how attention heads function in multimodal reasoning.

Method: Introduces CogVision dataset that decomposes complex multimodal questions into step-by-step subquestions simulating human reasoning; uses probing-based methodology to identify functional attention heads specialized in specific receptive/cognitive functions.

Result: Functional heads are universally sparse, vary in number/distribution across functions, mediate interactions and hierarchical organization; intervention experiments show removing them degrades performance while emphasizing them enhances accuracy.

Conclusion: Provides insights into VLM cognitive organization and suggests directions for designing models with more human-aligned perceptual and reasoning abilities.

Abstract: Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.

[259] Trustworthy Orchestration Artificial Intelligence by the Ten Criteria with Control-Plane Governance

Byeong Ho Kang, Wenli Yang, Muhammad Bilal Amin

Main category: cs.AI

TL;DR: A framework called Ten Criteria for Trustworthy Orchestration AI addresses the accountability gap in AI systems by embedding governance into AI ecosystems through a Control-Panel architecture.

Details

Motivation: There's a widening gap between AI's technical capabilities and institutional accountability. Ethical guidance alone is insufficient - governance needs to be embedded into the execution fabric of AI ecosystems.

Method: Proposes the Ten Criteria for Trustworthy Orchestration AI framework with a Control-Panel architecture that integrates human input, semantic coherence, audit and provenance integrity. Inspired by international standards and Australia’s National Framework for AI Assurance.

Result: Demonstrates that trustworthiness can be systematically engineered into AI systems, ensuring execution remains verifiable, transparent, reproducible and under meaningful human control.

Conclusion: The framework provides comprehensive governance for entire AI ecosystems (components, consumers, human participants), going beyond conventional AI-to-AI coordination to embed accountability into the execution fabric.

Abstract: As Artificial Intelligence (AI) systems increasingly assume consequential decision-making roles, a widening gap has emerged between technical capabilities and institutional accountability. Ethical guidance alone is insufficient to counter this challenge; it demands architectures that embed governance into the execution fabric of the ecosystem. This paper presents the Ten Criteria for Trustworthy Orchestration AI, a comprehensive assurance framework that integrates human input, semantic coherence, audit and provenance integrity into a unified Control-Panel architecture. Unlike conventional agentic AI initiatives that primarily focus on AI-to-AI coordination, the proposed framework provides an umbrella of governance to the entire AI components, their consumers and human participants. By taking aspiration from international standards and Australia’s National Framework for AI Assurance initiative, this work demonstrates that trustworthiness can be systematically incorporated (by engineering) into AI systems, ensuring the execution fabric remains verifiable, transparent, reproducible and under meaningful human control.

[260] InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information Bottleneck

Quanmin Wei, Penglin Dai, Wei Li, Bingyi Liu, Xiao Wu

Main category: cs.AI

TL;DR: InfoCom is an information-aware framework that achieves communication-efficient collaborative perception for autonomous driving, reducing data transmission from MB to KB scale while maintaining near-lossless perception performance.

Details

Motivation: Collaborative perception faces a fundamental communication-performance trade-off, and existing approaches assume MB-level data transmission which may fail under practical network constraints. There's a need for more communication-efficient solutions that work under real-world network limitations.

Method: InfoCom introduces an information purification paradigm based on extended Information Bottleneck principles. It includes: 1) Information-Aware Encoding to condense features into minimal messages, 2) Sparse Mask Generation to identify spatial cues with negligible cost, and 3) Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction.

Result: InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP respectively.

Conclusion: InfoCom establishes a pioneering theoretical foundation for communication-efficient collaborative perception and demonstrates practical viability by dramatically reducing communication requirements while maintaining perception quality, making it suitable for real-world autonomous driving systems with network constraints.

Abstract: Precise environmental perception is critical for the reliability of autonomous driving systems. While collaborative perception mitigates the limitations of single-agent perception through information sharing, it encounters a fundamental communication-performance trade-off. Existing communication-efficient approaches typically assume MB-level data transmission per collaboration, which may fail due to practical network constraints. To address these issues, we propose InfoCom, an information-aware framework establishing the pioneering theoretical foundation for communication-efficient collaborative perception via extended Information Bottleneck principles. Departing from mainstream feature manipulation, InfoCom introduces a novel information purification paradigm that theoretically optimizes the extraction of minimal sufficient task-critical information under Information Bottleneck constraints. Its core innovations include: i) An Information-Aware Encoding condensing features into minimal messages while preserving perception-relevant information; ii) A Sparse Mask Generation identifying spatial cues with negligible communication cost; and iii) A Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction. Comprehensive experiments across multiple datasets demonstrate that InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP, respectively.

[261] EpiPlanAgent: Agentic Automated Epidemic Response Planning

Kangkun Mao, Fang Xu, Jinru Ding, Yidong Jiang, Yujun Yao, Yirong Chen, Junming Liu, Xiaoqin Wu, Qian Wu, Xiaoyan Huang, Jie Xu

Main category: cs.AI

TL;DR: EpiPlanAgent is an LLM-based multi-agent system that automates epidemic response planning, improving plan quality and reducing development time compared to manual methods.

Details

Motivation: Traditional epidemic response planning is labor-intensive and manual, creating a need for automated, scalable solutions to improve public health preparedness.

Method: Multi-agent framework using LLMs with task decomposition, knowledge grounding, and simulation modules, tested by public health professionals with real-world outbreak scenarios.

Result: Significantly improved plan completeness and guideline alignment while drastically reducing development time; expert evaluation showed high consistency with human-authored content.

Conclusion: EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

Abstract: Epidemic response planning is essential yet traditionally reliant on labor-intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent-based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi-agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public health professionals tested the system using real-world outbreak scenarios in a controlled evaluation. Results demonstrated that EpiPlanAgent significantly improved the completeness and guideline alignment of plans while drastically reducing development time compared to manual workflows. Expert evaluation confirmed high consistency between AI-generated and human-authored content. User feedback indicated strong perceived utility. In conclusion, EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

Yongqiang Yu, Xuhui Li, Hazza Mahmood, Jinxing Zhou, Haodong Hong, Longtao Jiang, Zhiqiang Xu, Qi Wu, Xiaojun Chang

Main category: cs.AI

TL;DR: A user-feedback-driven adaptation framework for Vision-and-Language Navigation that integrates human interactions into continual learning, converting user feedback into environment-aligned training data with memory-bank warm-start to improve navigation performance.

Details

Motivation: Current GSA-VLN frameworks exclude valuable user feedback and rely only on unsupervised adaptation from environmental exposure. User feedback offers natural supervision that can significantly enhance adaptation quality for real-world deployment.

Method: Introduces a user-feedback-driven adaptation framework that systematically integrates human interactions into continual learning. Converts user feedback (navigation instructions and corrective signals) into high-quality training data. Uses memory-bank warm-start mechanism to reuse previously acquired environmental knowledge and mitigate cold-start degradation.

Result: Experiments on GSA-R2R benchmark show consistent improvement over strong baselines like GR-DUET, with better navigation success and path efficiency. Memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Framework demonstrates robustness in both continual and hybrid adaptation settings.

Conclusion: The user-feedback-driven framework effectively enhances GSA-VLN adaptation by leveraging human supervision, with memory-bank warm-start ensuring stable redeployment and sustained improvement across diverse deployment conditions.

Abstract: Vision-and-Language Navigation (VLN) requires agents to navigate complex environments by following natural-language instructions. General Scene Adaptation for VLN (GSA-VLN) shifts the focus from zero-shot generalization to continual, environment-specific adaptation, narrowing the gap between static benchmarks and real-world deployment. However, current GSA-VLN frameworks exclude user feedback, relying solely on unsupervised adaptation from repeated environmental exposure. In practice, user feedback offers natural and valuable supervision that can significantly enhance adaptation quality. We introduce a user-feedback-driven adaptation framework that extends GSA-VLN by systematically integrating human interactions into continual learning. Our approach converts user feedback-navigation instructions and corrective signals-into high-quality, environment-aligned training data, enabling efficient and realistic adaptation. A memory-bank warm-start mechanism further reuses previously acquired environmental knowledge, mitigating cold-start degradation and ensuring stable redeployment. Experiments on the GSA-R2R benchmark show that our method consistently surpasses strong baselines such as GR-DUET, improving navigation success and path efficiency. The memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Results under both continual and hybrid adaptation settings confirm the robustness and generality of our framework, demonstrating sustained improvement across diverse deployment conditions.

[263] On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering

Ziseok Lee, Minyeong Hwang, Sanghyun Jo, Wooyeol Lee, Jihyung Ko, Young Bin Park, Jae-Mun Choi, Eunho Yang, Kyungsu Kim

Main category: cs.AI

TL;DR: ACE method prevents Marginal Path Collapse in ratio-of-densities steering for diffusion models, enabling reliable composition of heterogeneous models.

Details

Motivation: Ratio-of-densities steering for diffusion models has a critical failure mode called Marginal Path Collapse, where intermediate densities become non-normalizable when composing heterogeneous models trained on different noise schedules or datasets, particularly problematic in molecular design tasks like flexible-pose scaffold decoration.

Method: Two-part solution: 1) Derive a simple path existence criterion predicting collapse from noise schedules and exponents alone; 2) Introduce Adaptive path Correction with Exponents (ACE), which extends Feynman-Kac steering to time-varying exponents and guarantees valid probability paths.

Result: ACE eliminates collapse and enables high-guidance compositional generation, improving distributional and docking metrics over constant-exponent baselines and even specialized task-specific scaffold decoration models on both synthetic 2D benchmark and flexible-pose scaffold decoration tasks.

Conclusion: ACE transforms ratio-of-densities steering with heterogeneous experts from an unstable heuristic into a reliable tool for controllable generation, particularly valuable for complex molecular design tasks requiring composition of multiple specialized models.

Abstract: Inference-time steering enables pretrained diffusion/flow models to be adapted to new tasks without retraining. A widely used approach is the ratio-of-densities method, which defines a time-indexed target path by reweighting probability-density trajectories from multiple models with positive, or in some cases, negative exponents. This construction, however, harbors a critical and previously unformalized failure mode: Marginal Path Collapse, where intermediate densities become non-normalizable even though endpoints remain valid. Collapse arises systematically when composing heterogeneous models trained on different noise schedules or datasets, including a common setting in molecular design where de-novo, conformer, and pocket-conditioned models must be combined for tasks such as flexible-pose scaffold decoration. We provide a novel and complete solution for the problem. First, we derive a simple path existence criterion that predicts exactly when collapse occurs from noise schedules and exponents alone. Second, we introduce Adaptive path Correction with Exponents (ACE), which extends Feynman-Kac steering to time-varying exponents and guarantees a valid probability path. On a synthetic 2D benchmark and on flexible-pose scaffold decoration, ACE eliminates collapse and enables high-guidance compositional generation, improving distributional and docking metrics over constant-exponent baselines and even specialized task-specific scaffold decoration models. Our work turns ratio-of-densities steering with heterogeneous experts from an unstable heuristic into a reliable tool for controllable generation.

[264] REMISVFU: Vertical Federated Unlearning via Representation Misdirection for Intermediate Output Feature

Wenhan Wu, Zhili He, Huanghuang Liang, Yili Gong, Jiawei Jiang, Chuang Hu, Dazhao Cheng

Main category: cs.AI

TL;DR: REMISVFU is a plug-and-play representation misdirection framework for fast client-level unlearning in split Vertical Federated Learning (VFL) systems, enabling GDPR-compliant right to be forgotten while maintaining utility for remaining parties.

Details

Motivation: Data protection regulations like GDPR require federated systems to support the "right to be forgotten," but existing unlearning methods focus on Horizontal Federated Learning (HFL) and are ineffective for Vertical Federated Learning (VFL) where data is partitioned by features rather than samples.

Method: When a deletion request arrives, the forgetting party collapses its encoder output to a randomly sampled anchor on the unit sphere, breaking the statistical link between its features and the global model. The server jointly optimizes retention and forgetting losses, aligning their gradients via orthogonal projection to prevent destructive interference.

Result: Evaluations on public benchmarks show REMISVFU suppresses back-door attack success to the natural class-prior level and sacrifices only about 2.5% points of clean accuracy, outperforming state-of-the-art baselines.

Conclusion: REMISVFU provides an effective solution for client-level unlearning in VFL systems, addressing the unique challenges of feature-partitioned architectures while maintaining model utility and complying with data protection regulations.

Abstract: Data-protection regulations such as the GDPR grant every participant in a federated system a right to be forgotten. Federated unlearning has therefore emerged as a research frontier, aiming to remove a specific party’s contribution from the learned model while preserving the utility of the remaining parties. However, most unlearning techniques focus on Horizontal Federated Learning (HFL), where data are partitioned by samples. In contrast, Vertical Federated Learning (VFL) allows organizations that possess complementary feature spaces to train a joint model without sharing raw data. The resulting feature-partitioned architecture renders HFL-oriented unlearning methods ineffective. In this paper, we propose REMISVFU, a plug-and-play representation misdirection framework that enables fast, client-level unlearning in splitVFL systems. When a deletion request arrives, the forgetting party collapses its encoder output to a randomly sampled anchor on the unit sphere, severing the statistical link between its features and the global model. To maintain utility for the remaining parties, the server jointly optimizes a retention loss and a forgetting loss, aligning their gradients via orthogonal projection to eliminate destructive interference. Evaluations on public benchmarks show that REMISVFU suppresses back-door attack success to the natural class-prior level and sacrifices only about 2.5% points of clean accuracy, outperforming state-of-the-art baselines.

[265] LLM-Empowered Representation Learning for Emerging Item Recommendation

Ziying Zhang, Quanming Yao, Yaqing Wang

Main category: cs.AI

TL;DR: EmerFlow: LLM-powered framework for recommending emerging items by generating distinctive embeddings that balance uniqueness with shared patterns from established items.

Details

Motivation: Existing recommendation methods oversimplify emerging items by assuming they have few/no interactions, failing to capture their dynamic accumulation of interactions over time while preserving their uniqueness and leveraging shared patterns with established items.

Method: Three-step framework: 1) Enrich raw features of emerging items through LLM reasoning, 2) Align representations with existing recommendation model’s embedding space, 3) Incorporate new interactions through meta-learning to refine embeddings.

Result: Extensive experiments across diverse domains (movies, pharmaceuticals) show EmerFlow consistently outperforms existing methods in learning expressive embeddings for emerging items from limited interactions.

Conclusion: EmerFlow effectively addresses the emerging item recommendation challenge by generating distinctive embeddings that preserve uniqueness while leveraging shared patterns, enabled by LLM reasoning and meta-learning.

Abstract: In this work, we tackle the challenge of recommending emerging items, whose interactions gradually accumulate over time. Existing methods often overlook this dynamic process, typically assuming that emerging items have few or even no historical interactions. Such an assumption oversimplifies the problem, as a good model must preserve the uniqueness of emerging items while leveraging their shared patterns with established ones. To address this challenge, we propose EmerFlow, a novel LLM-empowered representation learning framework that generates distinctive embeddings for emerging items. It first enriches the raw features of emerging items through LLM reasoning, then aligns these representations with the embedding space of the existing recommendation model. Finally, new interactions are incorporated through meta-learning to refine the embeddings. This enables EmerFlow to learn expressive embeddings for emerging items from only limited interactions. Extensive experiments across diverse domains, including movies and pharmaceuticals, show that EmerFlow consistently outperforms existing methods.

[266] AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, Yuanchun Li

Main category: cs.AI

TL;DR: AgentProg: A program-guided approach for mobile GUI agent context management that organizes interaction history as a program to reduce context overhead while maintaining task performance.

Details

Motivation: Mobile GUI agents for long-horizon tasks face critical bottlenecks due to reliance on expanding interaction history causing substantial context overhead. Existing context management techniques often fail to preserve vital semantic information, leading to degraded task performance.

Method: AgentProg reframes interaction history as a program with variables and control flow, organizing information according to program structure to determine what to retain/discard. Integrates a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes.

Result: Achieved state-of-the-art success rates on AndroidWorld and extended long-horizon task suite. Maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation.

Conclusion: AgentProg provides an effective program-guided approach for agent context management that reduces context overhead while preserving semantic information, enabling robust performance on long-horizon mobile GUI tasks.

Abstract: The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.

[267] Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li

Main category: cs.AI

TL;DR: SaEI improves VLM reasoning via selective-adversarial entropy intervention during RL sampling, enhancing answer diversity without distorting factual knowledge.

Details

Motivation: Existing RL-based VLM finetuning methods only intervene entropy during policy optimization, ignoring entropy intervention during RL sampling which could improve response diversity and boost GRPO performance.

Method: Proposes SaEI with two components: 1) Entropy-guided adversarial sampling (EgAS) formulates sampled response entropy as adversarial objective to attack visual inputs, and 2) Token-selective entropy computation (TsEC) selectively computes entropy to avoid distorting factual knowledge.

Result: Extensive experiments on in-domain and out-of-domain datasets show SaEI greatly improves policy exploration via entropy intervention, boosting reasoning capabilities of VLMs.

Conclusion: Selective-adversarial entropy intervention during RL sampling effectively enhances VLM reasoning by improving answer diversity while preserving factual knowledge, outperforming existing entropy intervention methods.

Abstract: Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.

[268] Planning, Living and Judging: A Multi-agent LLM-based Framework for Cyclical Urban Planning

Hang Ni, Yuzhi Wang, Hao Liu

Main category: cs.AI

TL;DR: CUP is a cyclical urban planning framework using LLM agents to continuously generate, evaluate, and refine urban plans through planning, living simulation, and judging components.

Details

Motivation: Urban regeneration faces challenges in adapting to evolving needs during urbanization, requiring more dynamic and responsive planning approaches that can continuously improve based on feedback.

Method: Multi-agent LLM framework with three components: Planning (generates/refines plans), Living (simulates resident behaviors), and Judging (evaluates effectiveness). These operate in a closed-loop cyclical process.

Result: Experiments on real-world datasets demonstrate the framework’s effectiveness as a continuous and adaptive planning process for urban regeneration.

Conclusion: CUP provides a new paradigm for urban planning that enables dynamic, responsive, and continuously improving urban regeneration through LLM-powered cyclical processes.

Abstract: Urban regeneration presents significant challenges within the context of urbanization, requiring adaptive approaches to tackle evolving needs. Leveraging advancements in large language models (LLMs), we propose Cyclical Urban Planning (CUP), a new paradigm that continuously generates, evaluates, and refines urban plans in a closed-loop. Specifically, our multi-agent LLM-based framework consists of three key components: (1) Planning, where LLM agents generate and refine urban plans based on contextual data; (2) Living, where agents simulate the behaviors and interactions of residents, modeling life in the urban environment; and (3) Judging, which involves evaluating plan effectiveness and providing iterative feedback for improvement. The cyclical process enables a dynamic and responsive planning approach. Experiments on the real-world dataset demonstrate the effectiveness of our framework as a continuous and adaptive planning process.

[269] Representation of the structure of graphs by sequences of instructions

Ezequiel Lopez-Rubio

Main category: cs.AI

TL;DR: A new method represents graphs as strings of instructions that build adjacency matrices step-by-step, making graphs amenable to deep learning language models while preserving local structural patterns.

Details

Motivation: Current graph representations (based on adjacency matrices) are not suitable for processing by powerful deep learning language models that specialize in text processing. There's a need to bridge the gap between graph processing and modern text-based deep learning models.

Method: Proposes a novel graph representation method that transforms adjacency matrices into strings of simple instructions. The instructions build the adjacency matrix step by step, creating a reversible transformation where graphs can be converted to strings and vice versa.

Result: The representation is compact and maintains local structural patterns of graphs. A tentative computational experiment shows favorable results, suggesting the method could effectively bridge graph processing with deep learning models.

Conclusion: This new string-based representation of graphs could boost graph processing by deep learning language models, potentially enabling more effective application of text-specialized models to graph analysis tasks.

Abstract: The representation of graphs is commonly based on the adjacency matrix concept. This formulation is the foundation of most algebraic and computational approaches to graph processing. The advent of deep learning language models offers a wide range of powerful computational models that are specialized in the processing of text. However, current procedures to represent graphs are not amenable to processing by these models. In this work, a new method to represent graphs is proposed. It represents the adjacency matrix of a graph by a string of simple instructions. The instructions build the adjacency matrix step by step. The transformation is reversible, i.e. given a graph the string can be produced and vice versa. The proposed representation is compact and it maintains the local structural patterns of the graph. Therefore, it is envisaged that it could be useful to boost the processing of graphs by deep learning models. A tentative computational experiment is reported, with favorable results.

[270] Targeted Data Protection for Diffusion Model by Matching Training Trajectory

Hojun Lee, Mijin Koo, Yeji Song, Nojun Kwak

Main category: cs.AI

TL;DR: TAFAP is a new method for targeted data protection in diffusion models that controls the entire training trajectory to redirect model outputs toward user-specified concepts, outperforming snapshot-based approaches.

Details

Motivation: Current diffusion model fine-tuning for personalization raises privacy concerns, but existing protection methods only degrade image quality passively. Targeted Data Protection (TDP) offers active redirection, but existing TDP methods have poor controllability due to snapshot-matching approaches that don't account for complete learning dynamics.

Method: TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations) controls the entire training trajectory rather than just snapshots. It uses trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning, achieving simultaneous control over both identity and visual patterns.

Result: TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. It demonstrates the first successful targeted transformation in diffusion models with control over both identity and visual patterns.

Conclusion: TAFAP enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs, representing a significant advancement in targeted data protection for privacy-sensitive applications.

Abstract: Recent advancements in diffusion models have made fine-tuning text-to-image models for personalization increasingly accessible, but have also raised significant concerns regarding unauthorized data usage and privacy infringement. Current protection methods are limited to passively degrading image quality, failing to achieve stable control. While Targeted Data Protection (TDP) offers a promising paradigm for active redirection toward user-specified target concepts, existing TDP attempts suffer from poor controllability due to snapshot-matching approaches that fail to account for complete learning dynamics. We introduce TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations), the first method to successfully achieve effective TDP by controlling the entire training trajectory. Unlike snapshot-based methods whose protective influence is easily diluted as training progresses, TAFAP employs trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning. We validate our method through extensive experiments, demonstrating the first successful targeted transformation in diffusion models with simultaneous control over both identity and visual patterns. TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. This work enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs.

[271] When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar

Main category: cs.AI

TL;DR: LLM-based peer review systems are vulnerable to adversarial PDF manipulation attacks that can flip “Reject” decisions to “Accept,” with obfuscation strategies achieving high success rates across multiple language models.

Details

Motivation: The motivation is to investigate the robustness of LLM-based peer review systems (both individual reviewer use and institutional deployment) against adversarial attacks, particularly focusing on the incentive to manipulate paper acceptance decisions.

Method: Researchers developed a novel evaluation metric (WAVS - Weighted Adversarial Vulnerability Score), curated a dataset of 200 scientific papers, adapted 15 domain-specific attack strategies, and evaluated them across 13 language models including GPT-5, Claude Haiku, and DeepSeek.

Result: Obfuscation strategies like “Maximum Mark Magyk” successfully manipulated scores and achieved alarming decision flip rates, demonstrating significant vulnerabilities in LLM-based peer review systems, even in large-scale models.

Conclusion: LLM-as-a-Judge systems in scientific peer review are vulnerable to adversarial PDF manipulation attacks designed to flip rejection decisions to acceptance, highlighting critical security concerns for AI-powered assessment systems in academia.

Abstract: The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the “Lazy Reviewer” hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford’s Agents4Science. This study investigates the robustness of these “LLM-as-a-Judge” systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping “Reject” decisions to “Accept,” for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like “Maximum Mark Magyk” successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

[272] Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, Hai Zhao

Main category: cs.AI

TL;DR: ReMe is a framework for LLM agents that transforms procedural memory from static storage to dynamic reasoning through distillation, adaptive reuse, and utility-based refinement, enabling experience-driven evolution and outperforming larger models.

Details

Motivation: Existing LLM agent memory frameworks suffer from "passive accumulation" - treating memory as static, append-only archives rather than dynamic reasoning tools. This creates a gap between static storage and the need for adaptive, evolving intelligence.

Method: ReMe introduces three key mechanisms: 1) multi-faceted distillation extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers, and generating comparative insights; 2) context-adaptive reuse tailors historical insights to new contexts via scenario-aware indexing; 3) utility-based refinement autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool.

Result: Extensive experiments on BFCL-V3 and AppWorld show ReMe establishes new state-of-the-art in agent memory systems. Crucially, Qwen3-8B with ReMe outperforms larger, memoryless Qwen3-14B, demonstrating a significant memory-scaling effect where self-evolving memory provides computation-efficient lifelong learning.

Conclusion: ReMe bridges the gap between static memory storage and dynamic reasoning, enabling LLM agents to evolve through experience. The framework demonstrates that self-evolving memory can be more effective than simply scaling model size, offering a pathway for efficient lifelong learning in AI agents.

Abstract: Procedural memory enables large language model (LLM) agents to internalize “how-to” knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a “passive accumulation” paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.

[273] Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang

Main category: cs.AI

TL;DR: Training-free LLM agent architecture for zero-shot PCG parameter configuration using Actor-Critic agents to bridge semantic gap between natural language instructions and technical parameters.

Details

Motivation: PCG tools require precise configuration of opaque technical parameters, but off-the-shelf LLMs fail to bridge the semantic gap between abstract user instructions and strict parameter specifications.

Method: Proposes a training-free architecture with Actor and Critic LLM agents that work iteratively: Actor reasons over tool parameters, Critic refines configurations to align with human design preferences through autonomous reasoning.

Result: Outperforms single-agent baselines, produces diverse and structurally valid 3D environments from natural language descriptions, establishes new benchmark for instruction-following in PCG.

Conclusion: Off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools without task-specific fine-tuning, shifting burden from model training to architectural reasoning.

Abstract: Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.

[274] Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly

Moshe Lahmy, Roi Yozevitch

Main category: cs.AI

TL;DR: SEAL-RAG is a training-free controller that uses a “replace, don’t expand” strategy to fight context dilution in multi-hop RAG queries by actively swapping out distractors for gap-closing evidence.

Details

Motivation: Existing RAG systems fail on multi-hop queries when initial retrieval misses bridge facts. Prior corrective approaches (Self-RAG, CRAG, Adaptive-k) add more context or prune lists, leading to context dilution where distractors crowd out relevant information.

Method: SEAL-RAG executes a Search → Extract → Assess → Loop cycle: performs entity-anchored extraction to build a live gap specification (missing entities/relations), triggers targeted micro-queries, and uses entity-first ranking to actively swap out distractors for gap-closing evidence under fixed retrieval depth k.

Result: On HotpotQA (k=3), SEAL improves answer correctness by +3-13 pp and evidence precision by +12-18 pp over Self-RAG. On 2WikiMultiHopQA (k=5), it outperforms Adaptive-k by +8.0 pp in accuracy and maintains 96% evidence precision compared to 22% for CRAG. Gains are statistically significant (p<0.001).

Conclusion: SEAL-RAG’s fixed-k replacement strategy yields predictable cost profile while ensuring top-k slots are optimized for precision rather than mere breadth, effectively combating context dilution in multi-hop RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don’t expand’’} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3–13 pp} and evidence precision by \textbf{+12–18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96%} evidence precision compared to 22% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.

[275] Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen

Main category: cs.AI

TL;DR: InternGeometry is an LLM agent that achieves medalist-level performance on IMO geometry problems using iterative proposition generation, symbolic verification, and reinforcement learning with minimal training data.

Details

Motivation: Current AI geometry solving is dominated by expert models like AlphaGeometry 2 that require massive data synthesis and search. LLMs struggle with geometry due to weak heuristics for auxiliary constructions. The authors aim to build the first medalist-level LLM agent for geometry.

Method: InternGeometry uses iterative proposition and auxiliary construction proposals verified by a symbolic engine, with reflection on feedback to guide subsequent proposals. A dynamic memory mechanism enables 200+ interactions per problem. Complexity-Boosting Reinforcement Learning (CBRL) gradually increases problem complexity during training.

Result: Solves 44/50 IMO geometry problems (2000-2024), exceeding average gold medalist score (40.9). Uses only 13K training examples (0.004% of AlphaGeometry 2’s data). Can propose novel auxiliary constructions not in human solutions.

Conclusion: Demonstrates LLM agents’ potential for expert-level geometry tasks with minimal training data. The model, data, and symbolic engine will be released to support future research.

Abstract: Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine’s feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.

[276] NormCode: A Semi-Formal Language for Context-Isolated AI Planning

Xin Guan

Main category: cs.AI

TL;DR: NormCode is a semiformal language that eliminates context pollution in multistep LLM workflows by enforcing data isolation between steps, separating semantic (LLM) from syntactic (deterministic) operations, and providing three isomorphic formats for human authoring, machine execution, and verification.

Details

Motivation: Multistep LLM workflows suffer from context pollution where accumulating information across steps causes hallucinations, confusion of intermediate outputs, and loss of task constraints, creating reliability issues in high-stakes domains.

Method: NormCode is a semiformal language that constructs plans of inferences with strict data isolation between steps, separating semantic (LLM-driven, nondeterministic) from syntactic (deterministic data restructuring) operations. It provides three isomorphic formats: .ncds for human authoring, .ncd for machine execution, and .ncn for human verification, with an orchestrator for dependency-driven scheduling, SQLite checkpointing, and loop management.

Result: Validated through two demonstrations: (1) base X addition algorithm achieving 100% accuracy on arbitrary length inputs, and (2) self-hosted execution of NormCode’s own five-phase compiler pipeline, showing elimination of cross-step contamination.

Conclusion: NormCode addresses critical transparency needs in high-stakes domains by making AI workflows auditable by design through structured decompositions with data isolation, enabling precise cost and reliability tracing while eliminating context pollution.

Abstract: Multistep workflows that chain large language model (LLM) calls suffer from context pollution: as information accumulates across steps, models hallucinate, confuse intermediate outputs, and lose track of task constraints. We present NormCode, a semiformal language for constructing plans of inferences, structured decompositions where each step operates in data isolation and receives only explicitly passed inputs, which eliminates crossstep contamination by design. NormCode enforces a strict separation between semantic operations (LLMdriven reasoning, nondeterministic) and syntactic operations (deterministic data restructuring), enabling precise cost and reliability tracing. The language exists in three isomorphic formats: .ncds for human authoring, .ncd for machine execution, and .ncn for human verification, supporting progressive formalization from sketch to production. We validate NormCode through two demonstrations: (1) a base X addition algorithm achieving 100 percent accuracy on arbitrary length inputs, and (2) self hosted execution of NormCode’s own five phase compiler pipeline. The working orchestrator provides dependency driven scheduling, SQLite backed checkpointing, and loop management, making AI workflows auditable by design and addressing a critical need for transparency in high stakes domains such as legal reasoning, medical decision making, and financial analysis.

[277] Phythesis: Physics-Guided Evolutionary Scene Synthesis for Energy-Efficient Data Center Design via LLMs

Minghao LI, Ruihang Wang, Rui Tan, Yonggang Wen

Main category: cs.AI

TL;DR: Phythesis combines LLMs with physics-guided evolutionary optimization to automate energy-efficient data center design, achieving 57.3% higher generation success and 11.5% better PUE than vanilla LLM solutions.

Details

Motivation: Traditional data center design methods using human expertise and simulation tools don't scale with increasing complexity. Existing AI approaches for indoor layouts ignore physics and can't handle DCs' quantifiable objectives and strict physical constraints.

Method: Phythesis uses iterative bi-level optimization: (1) LLM-driven level generates 3D layouts and self-criticizes to refine scene topology, (2) physics-informed level optimizes asset parameters and selects best asset combinations for simulation-ready scene synthesis.

Result: Experiments across three generation scales show Phythesis achieves 57.3% generation success rate increase and 11.5% power usage effectiveness (PUE) improvement compared to vanilla LLM-based solutions.

Conclusion: The framework successfully bridges the gap between generative AI and physics-based design, enabling automated, simulation-ready data center layouts that meet strict operational objectives and physical constraints.

Abstract: Data center (DC) infrastructure serves as the backbone to support the escalating demand for computing capacity. Traditional design methodologies that blend human expertise with specialized simulation tools scale poorly with the increasing system complexity. Recent studies adopt generative artificial intelligence to design plausible human-centric indoor layouts. However, they do not consider the underlying physics, making them unsuitable for the DC design that sets quantifiable operational objectives and strict physical constraints. To bridge the gap, we propose Phythesis, a novel framework that synergizes large language models (LLMs) and physics-guided evolutionary optimization to automate simulation-ready (SimReady) scene synthesis for energy-efficient DC design. Phythesis employs an iterative bi-level optimization architecture, where (i) the LLM-driven optimization level generates physically plausible three-dimensional layouts and self-criticizes them to refine the scene topology, and (ii) the physics-informed optimization level identifies the optimal asset parameters and selects the best asset combination. Experiments on three generation scales show that Phythesis achieves 57.3% generation success rate increase and 11.5% power usage effectiveness (PUE) improvement, compared with the vanilla LLM-based solution.

Liang Peng, Haopeng Liu, Yixuan Ye, Cheng Liu, Wenjun Shen, Si Wu, Hau-San Wong

Main category: cs.AI

TL;DR: scRCL is a refinement contrastive learning framework that incorporates cell-gene interactions to improve unsupervised cell type identification from single-cell omics data.

Details

Motivation: Existing clustering methods focus only on intrinsic cellular structure and ignore cell-gene associations, limiting their ability to distinguish closely related cell types.

Method: Proposes scRCL with two contrastive distribution alignment components for cellular structure and a refinement module that integrates gene-correlation structure learning to capture cell-gene associations.

Result: Outperforms state-of-the-art baselines on single-cell RNA-seq and spatial transcriptomics datasets, with recovered populations showing coherent gene-expression signatures.

Conclusion: Incorporating cell-gene interactions through contrastive learning and refinement modules improves cell type identification accuracy and biological relevance in single-cell analysis.

Abstract: Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.

[279] CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Tong Zhang, Carlos Hinojosa, Bernard Ghanem

Main category: cs.AI

TL;DR: CAPTAIN is a training-free framework that mitigates memorization in diffusion models by modifying latent features during denoising, using frequency-based noise initialization, optimal timestep identification, and semantic feature injection from reference images.

Details

Motivation: Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing methods struggle to reduce memorization without compromising prompt alignment.

Method: CAPTAIN uses frequency-based noise initialization to reduce memorization tendency early in denoising, identifies optimal timesteps for feature injection, localizes memorized regions, and injects semantically aligned features from non-memorized reference images into localized latent regions.

Result: CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt and preserving visual quality.

Conclusion: CAPTAIN provides an effective training-free solution for mitigating memorization in diffusion models without compromising prompt fidelity, addressing important privacy and copyright concerns in large-scale deployment.

Abstract: Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing inference-time mitigation methods typically manipulate classifier-free guidance (CFG) or perturb prompt embeddings; however, they often struggle to reduce memorization without compromising alignment with the conditioning prompt. We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising. CAPTAIN first applies frequency-based noise initialization to reduce the tendency to replicate memorized patterns early in the denoising process. It then identifies the optimal denoising timesteps for feature injection and localizes memorized regions. Finally, CAPTAIN injects semantically aligned features from non-memorized reference images into localized latent regions, suppressing memorization while preserving prompt fidelity and visual quality. Our experiments show that CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt.

[280] On the Dynamics of Multi-Agent LLM Communities Driven by Value Diversity

Muhua Huang, Qinlin Zhao, Xiaoyuan Yi, Xing Xie

Main category: cs.AI

TL;DR: Value diversity in AI agent communities enhances stability, fosters emergent behaviors, and increases creativity in self-developed principles, but with diminishing returns at extreme heterogeneity.

Details

Motivation: To understand how diversity of values shapes collective behavior in AI communities, particularly as LLM-based multi-agent systems become more prevalent and their collective behaviors gain attention.

Method: Used naturalistic value elicitation based on Schwartz’s Theory of Basic Human Values to construct multi-agent simulations where communities with varying numbers of agents engaged in open-ended interactions and constitution formation.

Result: Value diversity enhances value stability, fosters emergent behaviors, and brings more creative principles developed by agents themselves without external guidance, but with diminishing returns as extreme heterogeneity induces instability.

Conclusion: Value diversity represents a new axis of future AI capability that bridges AI ability and sociological studies of institutional emergence.

Abstract: As Large Language Models (LLM) based multi-agent systems become increasingly prevalent, the collective behaviors, e.g., collective intelligence, of such artificial communities have drawn growing attention. This work aims to answer a fundamental question: How does diversity of values shape the collective behavior of AI communities? Using naturalistic value elicitation grounded in the prevalent Schwartz’s Theory of Basic Human Values, we constructed multi-agent simulations where communities with varying numbers of agents engaged in open-ended interactions and constitution formation. The results show that value diversity enhances value stability, fosters emergent behaviors, and brings more creative principles developed by the agents themselves without external guidance. However, these effects also show diminishing returns: extreme heterogeneity induces instability. This work positions value diversity as a new axis of future AI capability, bridging AI ability and sociological studies of institutional emergence.

[281] AEBNAS: Strengthening Exit Branches in Early-Exit Networks through Hardware-Aware Neural Architecture Search

Oscar Robben, Saeed Khalilian, Nirvana Meratnia

Main category: cs.AI

TL;DR: Hardware-aware NAS framework designs early-exit networks with optimized exit branch depth/layers and adaptive thresholds, achieving higher accuracy with same/lower MACs than SOTA on CIFAR-10/100/SVHN.

Details

Motivation: Early-exit networks reduce energy/latency by adapting computation to input complexity, but designing them is challenging due to balancing efficiency and performance. Existing NAS approaches focus on exit positions/number, but exit branch depth/layer types also significantly impact efficiency and accuracy.

Method: Propose hardware-aware Neural Architecture Search (NAS) framework that optimizes both accuracy and efficiency. The method considers varying depths and layer types for exit branches, along with adaptive threshold tuning for early-exit decisions.

Result: Evaluation on CIFAR-10, CIFAR-100, and SVHN datasets shows the proposed framework designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to state-of-the-art approaches.

Conclusion: Hardware-aware NAS with optimized exit branch architectures and adaptive thresholds effectively designs efficient early-exit networks that balance accuracy and computational efficiency better than existing methods.

Abstract: Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.

[282] Challenges of Evaluating LLM Safety for User Welfare

Manon Kempermann, Sai Suresh Macharla Vasu, Mahalakshmi Raveenthiran, Theo Farrell, Ingmar Weber

Main category: cs.AI

TL;DR: LLM safety evaluations need context-aware approaches for personal advice domains like finance/health, as current universal-risk frameworks fail to account for individual user vulnerabilities.

Details

Motivation: Current LLM safety evaluations focus on universal risks, but millions use LLMs for personal advice on high-stakes topics where harms are context-dependent. User-welfare safety evaluations remain underdeveloped despite frameworks recognizing the need for individual risk assessment.

Method: Evaluated advice from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles with varying vulnerability. Conducted two studies: 1) compared context-blind vs. context-aware evaluator ratings, 2) tested if realistic user context disclosure in prompts improves safety assessment.

Result: Context-blind evaluators rated identical responses significantly safer than context-aware evaluators (safety scores dropped from 5/7 to 3/7 for high-vulnerability users). Adding realistic user context disclosure to prompts showed no significant improvement in safety evaluation.

Conclusion: Effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone is insufficient. The study provides methodology for context-aware evaluation and shows individual welfare assessment needs distinct approaches from universal-risk frameworks.

Abstract: Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD’s AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

[283] Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning

Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt, Michael Moor, Farhad Nooralahzadeh, Michael Krauthammer

Main category: cs.AI

TL;DR: RL optimization improves CXR VLM performance for report generation and visual grounding, but explicit thinking doesn’t provide additional benefits beyond strong SFT.

Details

Motivation: Most medical VLMs rely only on supervised fine-tuning (SFT) without evaluating answer quality. Reinforcement learning (RL) can incorporate task-specific feedback, and combining RL with explicit reasoning has shown gains in other domains, so researchers wanted to investigate these effects in CXR VLM.

Method: Built RadVLM based on Qwen3-VL with large-scale SFT on CXR data, then cold-start SFT to add basic thinking ability. Applied Group Relative Policy Optimization (GRPO) with clinically grounded rewards for report generation and visual grounding. Conducted matched RL experiments on domain-specific and general-domain variants, with and without thinking.

Result: Strong SFT is crucial for high base performance, but RL provides additional gains on both tasks. Explicit thinking does not further improve results. RL-optimized RadVLM models outperform baselines and achieve state-of-the-art performance on report generation and grounding.

Conclusion: Clinically aligned RL is a powerful complement to SFT for medical VLMs, enabling state-of-the-art performance in CXR interpretation tasks without needing explicit thinking mechanisms.

Abstract: Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning (“thinking”) has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.

[284] COMPARE: Clinical Optimization with Modular Planning and Assessment via RAG-Enhanced AI-OCT: Superior Decision Support for Percutaneous Coronary Intervention Compared to ChatGPT-5 and Junior Operators

Wei Fang, Chiyao Wang, Wenshuai Ma, Hui Liu, Jianqiang Hu, Xiaona Niu, Yi Chu, Mingming Zhang, Jingxiao Yang, Dongwei Zhang, Zelin Li, Pengyun Liu, Jiawei Zheng, Pengke Zhang, Chaoshi Qin, Wangang Guo, Bin Wang, Yugang Xue, Wei Zhang, Zikuan Wang, Rui Zhu, Yihui Cao, Quanmao Lu, Rui Meng, Yan Li

Main category: cs.AI

TL;DR: CA-GPT, a domain-specific AI model for OCT-guided PCI, outperforms general-purpose ChatGPT-5 and junior physicians in procedural decision-making for both pre-PCI planning and post-PCI assessment.

Details

Motivation: Intravascular imaging (OCT) improves PCI outcomes but interpretation is operator-dependent. General-purpose AI lacks domain-specific reliability, creating need for specialized AI models in interventional cardiology.

Method: Single-center analysis of 96 OCT-guided PCI patients comparing procedural decisions from CA-GPT (domain-specific AI), ChatGPT-5 (general-purpose AI), and junior physicians against expert-derived procedural records using 10 pre-specified metrics across pre- and post-PCI phases.

Result: CA-GPT significantly outperformed both ChatGPT-5 and junior physicians in pre-PCI planning (median agreement 5 vs 3 vs 4). Superior in stent diameter (90.3% vs 72.2%) and length selection (80.6% vs 52.8%). Maintained excellent post-PCI agreement (5 vs 4 vs 5) with robust performance in complex scenarios.

Conclusion: CA-GPT-based AI-OCT system provides superior, standardized, and reliable intravascular imaging interpretation, demonstrating significant potential to augment operator expertise and optimize OCT-guided PCI procedures.

Abstract: Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the general-purpose ChatGPT-5 and junior physicians for OCT-guided PCI planning and assessment. Methods: In this single-center analysis of 96 patients who underwent OCT-guided PCI, the procedural decisions generated by the CA-GPT, ChatGPT-5, and junior physicians were compared with an expert-derived procedural record. Agreement was assessed using ten pre-specified metrics across pre-PCI and post-PCI phases. Results: For pre-PCI planning, CA-GPT demonstrated significantly higher median agreement scores (5[IQR 3.75-5]) compared to both ChatGPT-5 (3[2-4], P<0.001) and junior physicians (4[3-4], P<0.001). CA-GPT significantly outperformed ChatGPT-5 across all individual pre-PCI metrics and showed superior performance to junior physicians in stent diameter (90.3% vs. 72.2%, P<0.05) and length selection (80.6% vs. 52.8%, P<0.01). In post-PCI assessment, CA-GPT maintained excellent overall agreement (5[4.75-5]), significantly higher than both ChatGPT-5 (4[4-5], P<0.001) and junior physicians (5[4-5], P<0.05). Subgroup analysis confirmed CA-GPT’s robust performance advantage in complex scenarios. Conclusion: The CA-GPT-based AI-OCT system achieved superior decision-making agreement versus a general-purpose large language model and junior physicians across both PCI planning and assessment phases. This approach provides a standardized and reliable method for intravascular imaging interpretation, demonstrating significant potential to augment operator expertise and optimize OCT-guided PCI.

[285] HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Wang Lu, Yao Zhu, Jindong Wang

Main category: cs.AI

TL;DR: HAROOD is a comprehensive benchmark for human activity recognition in out-of-distribution settings, evaluating 16 methods across 4 OOD scenarios on 6 datasets to assess OOD algorithm effectiveness for HAR.

Details

Motivation: Current HAR research lacks comprehensive evaluation of OOD algorithms across realistic distribution shifts (individual, device, environment, time variations). There's no clear understanding of whether OOD methods are necessary for HAR or which algorithms perform best.

Method: Proposed HAROOD benchmark with 4 OOD scenarios (cross-person, cross-position, cross-dataset, cross-time), covering 6 datasets, 16 comparative methods implemented with CNN and Transformer architectures, and two model selection protocols.

Result: Extensive experiments show no single method consistently outperforms others across all scenarios, highlighting substantial opportunity for advancement in OOD-based HAR research.

Conclusion: HAROOD provides a modular, extensible framework to facilitate OOD-based HAR research, revealing current limitations and opportunities for future algorithm development in handling distribution shifts.

Abstract: Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

[286] Agile Deliberation: Concept Deliberation for Subjective Visual Classification

Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan, Enming Luo, Chun-Ta Lu, Tushar Dogra, Ranjay Krishna, Ariel Fuxman

Main category: cs.AI

TL;DR: Agile Deliberation: A human-in-the-loop framework for evolving visual concept classification through iterative concept refinement and borderline case exposure.

Details

Motivation: Existing human-in-the-loop approaches assume users have clear, stable concept understanding, but in reality users often start with vague ideas that need iterative refinement through "concept deliberation" - a practice discovered through interviews with content moderation experts.

Method: Operationalizes real content moderators’ deliberation strategies into Agile Deliberation framework with two stages: (1) concept scoping - decomposing initial concept into structured hierarchy of sub-concepts, and (2) concept iteration - surfacing semantically borderline examples for user reflection and feedback to iteratively align image classifier with user’s evolving intent.

Result: Evaluated through 18 user sessions (1.5h each) rather than standard benchmarks. Achieved 7.5% higher F1 scores than automated decomposition baselines and >3% higher than manual deliberation. Participants reported clearer conceptual understanding and lower cognitive effort.

Conclusion: Agile Deliberation effectively supports evolving and subjective concepts in human-in-the-loop systems by enabling iterative concept refinement through structured deliberation, outperforming both automated and manual approaches while reducing cognitive load.

Abstract: From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through “concept deliberation”, a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called “Agile Deliberation” that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user’s evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

[287] V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions

Mumuksh Tayal, Manan Tayal, Aditya Singh, Shishir Kolathaya, Ravi Prakash

Main category: cs.AI

TL;DR: V-OCBF: A model-free framework that learns neural Control Barrier Functions from offline data to synthesize safe controllers without online interaction or expert-designed barriers.

Details

Motivation: Existing Safe Offline RL methods only enforce soft expected-cost constraints without forward invariance guarantees, while CBFs require expert-designed barriers or full system dynamics knowledge. There's a need for offline-learned safety guarantees without model assumptions.

Method: Learns neural CBF from offline demonstrations using recursive finite-difference barrier update (model-free). Uses expectile-based objective to avoid OOD actions and restricts updates to dataset-supported action set. Synthesizes safe control via Quadratic Program with learned barrier.

Result: V-OCBF achieves substantially fewer safety violations than baselines while maintaining strong task performance across multiple case studies, demonstrating scalability for offline safety-critical controller synthesis.

Conclusion: V-OCBF enables offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers, providing rigorous safety guarantees from purely offline data.

Abstract: Ensuring safety in autonomous systems requires controllers that satisfy hard, state-wise constraints without relying on online interaction. While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they do not guarantee forward invariance. Conversely, Control Barrier Functions (CBFs) provide rigorous safety guarantees but usually depend on expert-designed barrier functions or full knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.

[288] LLMs Can Assist with Proposal Selection at Large User Facilities

Lijie Ding, Janell Thomson, Jon Taylor, Changwoo Do

Main category: cs.AI

TL;DR: LLMs can effectively rank scientific proposals at large facilities, offering scalable, consistent, and cost-effective alternatives to human review while enabling advanced analytical capabilities.

Details

Motivation: Traditional human proposal review suffers from weak inter-proposal correlations, reviewer bias, inconsistency, and the impracticality of pairwise comparison methods due to quadratic workload. LLMs offer a scalable solution.

Method: Used LLMs to rank proposals from three beamlines at ORNL’s Spallation Neutron Source, comparing LLM rankings with human rankings using Spearman correlation. Also employed embedding models for quantitative proposal similarity assessment.

Result: LLM rankings strongly correlate with human rankings (Spearman ρ≈0.2-0.8, improving to ≥0.5 after outlier removal). LLMs perform comparably to humans in identifying high-publication-potential proposals while costing 100x less.

Conclusion: LLMs provide a viable, cost-effective alternative to human proposal review, offering consistent rankings and enabling advanced analytical capabilities like proposal similarity assessment that are challenging for human reviewers.

Abstract: We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ρ\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.

[289] Multi-Granular Node Pruning for Circuit Discovery

Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, A. B. Siddique

Main category: cs.AI

TL;DR: Node-level circuit discovery framework for LLMs that uses learnable masks with granularity-specific sparsity penalties to identify minimal subnetworks at finer granularity than existing methods.

Details

Motivation: Existing circuit discovery methods are computationally expensive, limited to coarse-grained units (attention heads/MLP blocks), and overlook finer structures like individual neurons. Need scalable, fine-grained approach.

Method: Propose node-level pruning framework with learnable masks across multiple granularity levels (blocks to neurons) within unified optimization objective. Use granularity-specific sparsity penalties to guide pruning in single fine-tuning run.

Result: Identifies smaller circuits than prior methods, shows many neurons deemed important by coarse methods are actually irrelevant while maintaining task performance. 5-10x lower memory footprint by avoiding intermediate activation storage.

Conclusion: The proposed node-level circuit discovery framework addresses scalability and granularity limitations of existing methods, enabling efficient identification of minimal, fine-grained circuits in LLMs.

Abstract: Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.

[290] On Decision-Making Agents and Higher-Order Causal Processes

Matt Wilson

Main category: cs.AI

TL;DR: Paper establishes equivalence between POMDP agents and quantum process functions, revealing dual physics/AI interpretations and extending to multi-agent systems.

Details

Motivation: To bridge concepts between artificial intelligence (POMDP agents) and quantum physics (process functions), revealing fundamental structural similarities and enabling cross-disciplinary insights.

Method: Mathematical identification showing how agent policies and memory updates combine into process functions via link product, establishing precise correspondence between POMDP agents and one-input process functions.

Result: Established dual interpretation: physics view (process function as environment) vs AI view (process function as agent), extended to multi-agent systems via decentralized POMDPs and multi-input process functions.

Conclusion: The correspondence reveals deep structural connections between AI decision-making and quantum operations, suggesting new perspectives for both fields and enabling cross-fertilization of concepts.

Abstract: We establish a precise correspondence between decision-making agents in partially observable Markov decision processes (POMDPs) and one-input process functions, the classical limit of higher-order quantum operations. In this identification an agent’s policy and memory update combine into a process function w that interacts with a POMDP environment via the link product. This suggests a dual interpretation: in the physics view, the process function acts as the environment into which local operations (agent interventions) are inserted, whereas in the AI view it encodes the agent and the inserted functions represent environments. We extend this perspective to multi-agent systems by identifying observation-independent decentralized POMDPs as natural domains for multi-input process functions.

[291] The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data

Massimiliano Luca, Ciro Beneduce, Bruno Lepri, Jacopo Staiano

Main category: cs.AI

TL;DR: LLMs can predict gender from online shopping histories with moderate accuracy, but their predictions rely heavily on gender stereotypes and biases, which persist even with explicit bias-mitigation instructions.

Details

Motivation: As LLMs gain widespread adoption across domains, it's crucial to examine whether their impressive performance masks subtle biases. While gender bias has been studied in various contexts, this paper investigates a novel angle: whether LLMs can predict gender from online shopping histories and whether these predictions reflect gender stereotypes.

Method: Used a dataset of historical online purchases from US users to evaluate six LLMs’ ability to classify gender. Analyzed both the models’ reasoning processes and product-gender co-occurrence patterns in their predictions.

Result: LLMs achieved moderate accuracy in gender prediction, but their decisions were largely based on stereotypical associations between product categories and gender. Explicit instructions to avoid bias reduced prediction certainty but didn’t eliminate stereotypical patterns.

Conclusion: Gender biases in LLMs are persistent and deeply embedded, even in novel contexts like shopping history analysis. The findings emphasize the need for more robust bias-mitigation strategies beyond simple instruction-based approaches.

Abstract: With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual’s gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.

[292] Advancing AI Research Assistants with Expert-Involved Learning

Tianyu Liu, Simeng Han, Hanchen Wang, Xiao Luo, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, Yufeng Liu, Xinyue Cui, Aviv Yaish, Yuhang Chen, Minsheng Hao, Chuhan Li, Kexing Li, Arman Cohan, Hua Xu, Mark Gerstein, James Zou, Hongyu Zhao

Main category: cs.AI

TL;DR: ARIEL is an open-source framework for evaluating and optimizing LLMs/LMMs in biomedicine, testing article summarization and figure interpretation capabilities, with findings showing current models produce fluent but incomplete summaries and struggle with visual reasoning.

Details

Motivation: While LLMs and LMMs promise to accelerate biomedical discovery, their reliability remains unclear, necessitating systematic evaluation frameworks to assess and improve their capabilities in biomedical applications.

Method: ARIEL pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. It uses uniform protocols and blinded PhD-level evaluation, with optimization through prompt engineering, lightweight fine-tuning, and compute-scaled inference strategies.

Result: State-of-the-art models generate fluent but incomplete summaries, while LMMs struggle with detailed visual reasoning. Prompt engineering and fine-tuning improve textual coverage, and compute-scaled inference enhances visual question answering. The ARIEL agent can integrate textual and visual cues to propose testable mechanistic hypotheses.

Conclusion: ARIEL delineates current strengths and limitations of foundation models in biomedicine and provides a reproducible platform for advancing trustworthy AI in biomedical research.

Abstract: Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.

[293] Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables

Yitong Zhou, Mingyue Cheng, Qingyang Mao, Yucong Luo, Qi Liu, Yupeng Li, Xiaohan Zhang, Deguang Liu, Xin Li, Enhong Chen

Main category: cs.AI

TL;DR: ChemTable is a new benchmark for evaluating multimodal LLMs on chemical tables from real literature, testing both table recognition and understanding tasks, revealing current models’ limitations in handling domain-specific elements like molecular structures.

Details

Motivation: There's a need for more challenging benchmarks to assess multimodal LLMs' ability to understand complex scientific data. Scientific tables combine text, symbols, and graphics, forming multimodal reasoning scenarios, but existing benchmarks focus on general domains and fail to capture the structural complexity and domain-specific semantics of scientific research, particularly in chemical tables.

Method: Created ChemTable - a large-scale benchmark of chemical tables constructed from real-world literature with expert-annotated cell layouts, logical structures, and domain-specific labels. The benchmark supports two core tasks: table recognition (structure and content extraction) and table understanding (descriptive and reasoning-based question answering).

Result: Evaluation shows mainstream multimodal models perform reasonably well in layout parsing but face significant limitations handling critical elements like molecular structures and symbolic conventions. Closed-source models lead overall but still fall short of human-level performance.

Conclusion: ChemTable provides a realistic testing platform for evaluating scientific multimodal understanding, revealing current bottlenecks in domain-specific reasoning and advancing the development of intelligent systems for scientific research.

Abstract: With the widespread application of multimodal large language models in scientific intelligence, there is an urgent need for more challenging evaluation benchmarks to assess their ability to understand complex scientific data. Scientific tables, as core carriers of knowledge representation, combine text, symbols, and graphics, forming a typical multimodal reasoning scenario. However, existing benchmarks are mostly focused on general domains, failing to reflect the unique structural complexity and domain-specific semantics inherent in scientific research. Chemical tables are particularly representative: they intertwine structured variables such as reagents, conditions, and yields with visual symbols like molecular structures and chemical formulas, posing significant challenges to models in cross-modal alignment and semantic parsing. To address this, we propose ChemTable-a large scale benchmark of chemical tables constructed from real-world literature, containing expert-annotated cell layouts, logical structures, and domain-specific labels. It supports two core tasks: (1) table recognition (structure and content extraction); and (2) table understanding (descriptive and reasoning-based question answering). Evaluation on ChemTable shows that while mainstream multimodal models perform reasonably well in layout parsing, they still face significant limitations when handling critical elements such as molecular structures and symbolic conventions. Closed-source models lead overall but still fall short of human-level performance. This work provides a realistic testing platform for evaluating scientific multimodal understanding, revealing the current bottlenecks in domain-specific reasoning and advancing the development of intelligent systems for scientific research.

[294] Emotional Support with LLM-based Empathetic Dialogue Generation

Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, Yongxiang Li

Main category: cs.AI

TL;DR: The paper presents a competition solution for emotional support conversation tasks using LLMs with prompt engineering and fine-tuning, achieving second place in NLPCC 2025 Task 8.

Details

Motivation: Addressing the growing demand for mental health support by developing effective emotional support conversation systems that can provide empathetic and appropriate assistance through dialogue.

Method: Leveraging large-scale language models enhanced by prompt engineering and fine-tuning techniques, exploring both parameter-efficient Low-Rank Adaptation (LoRA) and full-parameter fine-tuning strategies.

Result: The best model ranked second in the NLPCC 2025 Task 8 ESC evaluation, demonstrating the effectiveness of combining LLMs with adaptation methods for emotional support tasks.

Conclusion: The approach shows potential for emotional support conversation systems, with future work focusing on enhancing emotional understanding and response personalization for more practical and reliable systems.

Abstract: Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model’s ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.

[295] SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data

Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda

Main category: cs.AI

TL;DR: A modular framework for generating high-quality synthetic dialogue data for LLM training (SFT/DPO) using configurable pipelines and dual-stage quality filtering.

Details

Motivation: High-quality datasets are critical for LLM advancement in SFT and alignment tasks like DPO, but creating such datasets is resource-intensive. The paper aims to reduce data preparation overhead through scalable synthetic data generation.

Method: A modular, configuration-based pipeline for modeling complex dialogue flows with minimal manual intervention. Uses dual-stage quality tagging combining heuristic rules and LLM-based evaluations to filter/scored OASST-formatted conversations. Flexible schema supports both SFT and DPO use cases.

Result: A robust framework for generating and managing high-quality synthetic conversational data at scale, enabling seamless integration into diverse LLM training workflows while significantly reducing data preparation overhead.

Conclusion: The proposed synthetic data generation framework provides a scalable, configurable solution for producing high-fidelity training data for LLM SFT and alignment tasks, addressing the critical need for quality datasets in LLM advancement.

Abstract: The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.

[296] ARE: Scaling Up Agent Environments and Evaluations

Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom

Main category: cs.AI

TL;DR: ARE is a platform for creating scalable agent environments, and Gaia2 is a benchmark built on ARE that measures general agent capabilities through asynchronous, dynamic tasks requiring adaptation, collaboration, and handling of ambiguity.

Details

Motivation: The paper addresses the gap between model development and real-world deployment by creating a platform that enables scalable environment creation and robust agent evaluation. Current benchmarks are often static and don't capture real-world complexities like dynamic environments, collaboration needs, and temporal constraints.

Method: 1) Developed Meta Agents Research Environments (ARE) - a platform with simple abstractions for building complex environments with rules, tools, content, and verifiers. 2) Built Gaia2 benchmark on ARE that requires agents to handle ambiguities, noise, dynamic environments, collaboration, and temporal constraints. 3) Designed Gaia2 to run asynchronously to surface failure modes invisible in static settings.

Result: Experiments show no system dominates across the intelligence spectrum - stronger reasoning often comes at efficiency costs, and budget scaling curves plateau. This highlights the need for new architectures and adaptive compute strategies. ARE enables continuous extension of Gaia2 to other environments.

Conclusion: ARE provides a foundation for scalable agent environment creation and evaluation. Gaia2 reveals important trade-offs in agent design and demonstrates the value of asynchronous, dynamic benchmarks. The platform empowers the community to create domain-specific benchmarks, which is crucial for driving frontier AI capabilities forward in the “second half” of AI progress.

Abstract: We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI’s second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.

[297] Multi-Robot Path Planning Combining Heuristics and Multi-Agent Reinforcement Learning

Shaoming Peng

Main category: cs.AI

TL;DR: MAPPOHR is a two-layer path planning method combining heuristic search, empirical rules, and multi-agent reinforcement learning (MAPPO) for multi-robot navigation in dynamic environments, outperforming existing methods in planning performance and learning efficiency.

Details

Motivation: Existing methods for multi-robot path finding in dynamic environments have limitations: heuristic search methods cause frequent replanning leading to long travel distances, while learning approaches suffer from low sample exploration/utilization efficiency and high training costs.

Method: Two-layer approach: 1) Real-time planner using MAPPO multi-agent reinforcement learning with embedded empirical rules in action output layer and reward functions, 2) Heuristic search planner creating global guiding paths. During movement, heuristic planner replans based on real-time planner instructions.

Result: Tested in 10 different conflict scenarios, MAPPOHR showed better planning performance than existing learning and heuristic methods, with higher learning efficiency due to empirical knowledge and heuristic search utilization.

Conclusion: The proposed MAPPOHR method effectively addresses limitations of previous approaches by combining heuristic search, empirical rules, and reinforcement learning, achieving superior performance and efficiency in multi-robot dynamic path planning.

Abstract: Multi-robot path finding in dynamic environments is a highly challenging classic problem. In the movement process, robots need to avoid collisions with other moving robots while minimizing their travel distance. Previous methods for this problem either continuously replan paths using heuristic search methods to avoid conflicts or choose appropriate collision avoidance strategies based on learning approaches. The former may result in long travel distances due to frequent replanning, while the latter may have low learning efficiency due to low sample exploration and utilization, and causing high training costs for the model. To address these issues, we propose a path planning method, MAPPOHR, which combines heuristic search, empirical rules, and multi-agent reinforcement learning. The method consists of two layers: a real-time planner based on the multi-agent reinforcement learning algorithm, MAPPO, which embeds empirical rules in the action output layer and reward functions, and a heuristic search planner used to create a global guiding path. During movement, the heuristic search planner replans new paths based on the instructions of the real-time planner. We tested our method in 10 different conflict scenarios. The experiments show that the planning performance of MAPPOHR is better than that of existing learning and heuristic methods. Due to the utilization of empirical knowledge and heuristic search, the learning efficiency of MAPPOHR is higher than that of existing learning methods.

[298] Machine Learning for Quantifier Selection in cvc5

Jan Jakubův, Mikoláš Janota, Jelle Piepenbrock, Josef Urban

Main category: cs.AI

TL;DR: ML-guided quantifier selection improves SMT solving for first-order quantified problems by using gradient boosting to decide which quantifiers to instantiate during solving.

Details

Motivation: Quantifiers present a major challenge for SMT solvers and are technically a source of undecidability. Current approaches need better guidance on which quantifiers to instantiate during solving to improve performance on first-order problems.

Method: Train gradient boosting decision trees to predict which quantifiers should be instantiated. The ML model is invoked multiple times during solver execution as the set of active quantifiers changes. Integrated into the cvc5 SMT solver and trained on problems from the Mizar Mathematical Library.

Result: Considerable increase in the system’s holdout-set performance after training on a large set of first-order problems from Mizar Mathematical Library.

Conclusion: Machine learning guidance of quantifier selection significantly improves state-of-the-art SMT solving for first-order quantified problems, demonstrating the effectiveness of efficient ML models integrated into SMT solvers.

Abstract: In this work we considerably improve the state-of-the-art SMT solving on first-order quantified problems by efficient machine learning guidance of quantifier selection. Quantifiers represent a significant challenge for SMT and are technically a source of undecidability. In our approach, we train an efficient machine learning model that informs the solver which quantifiers should be instantiated and which not. Each quantifier may be instantiated multiple times and the set of the active quantifiers changes as the solving progresses. Therefore, we invoke the ML predictor many times, during the whole run of the solver. To make this efficient, we use fast ML models based on gradient boosting decision trees. We integrate our approach into the state-of-the-art cvc5 SMT solver and show a considerable increase of the system’s holdout-set performance after training it on a large set of first-order problems collected from the Mizar Mathematical Library.

[299] A Generation Framework with Strict Constraints for Crystal Materials Design

Chao Huang, Jiahui Chen, Chen Chen, Chen Chen, Chunyan Chen, Renjie Su, Shiyu Du

Main category: cs.AI

TL;DR: A constrained generation framework using LLMs to generate crystal structures with specific properties, achieving >2x probability of meeting target properties and near-perfect chemical composition adherence.

Details

Motivation: Existing crystal generation methods rely on random sampling without strict constraints, requiring extensive post-processing to find stable candidates with desired properties. There's a need for more controlled generation that directly incorporates target properties and chemical constraints.

Method: A two-stage constrained generation framework: 1) LLM-based constraint generator produces intermediate constraints (symmetry info, composition ratio) considering target properties, 2) Crystal structure generator uses these constraints to ensure controlled generation.

Result: The method generates crystal structures with >2x probability of meeting target properties compared to existing approaches. Nearly 100% of generated crystals strictly adhere to predefined chemical composition, eliminating supply chain risks.

Conclusion: The proposed constrained generation framework effectively addresses limitations of random sampling approaches by incorporating property-aware constraints through LLMs, enabling more targeted and reliable crystal structure generation with practical industrial benefits.

Abstract: The design of crystal materials plays a critical role in areas such as new energy development, biomedical engineering, and semiconductors. Recent advances in data-driven methods have enabled the generation of diverse crystal structures. However, most existing approaches still rely on random sampling without strict constraints, requiring multiple post-processing steps to identify stable candidates with the desired physical and chemical properties. In this work, we present a new constrained generation framework that takes multiple constraints as input and enables the generation of crystal structures with specific chemical and properties. In this framework, intermediate constraints, such as symmetry information and composition ratio, are generated by a constraint generator based on large language models (LLMs), which considers the target properties. These constraints are then used by a subsequent crystal structure generator to ensure that the structure generation process is under control. Our method generates crystal structures with a probability of meeting the target properties that is more than twice that of existing approaches. Furthermore, nearly 100% of the generated crystals strictly adhere to predefined chemical composition, eliminating the risks of supply chain during production.

[300] Object-centric proto-symbolic behavioural reasoning from pixels

Ruben van Bergen, Justus Hübotter, Alma Lago, Pablo Lanillos

Main category: cs.AI

TL;DR: A brain-inspired deep learning architecture learns object-centric representations from pixels to bridge low-level perception/control with high-level logical reasoning, enabling emergent conditional reasoning and composition in synthetic environments.

Details

Motivation: Autonomous agents need to bridge low-level sensory/motor spaces with high-level abstract reasoning without expensive supervision. Object-centric representations provide a grounded interface between these levels as a key inductive bias for unsupervised learning.

Method: Novel brain-inspired deep learning architecture that learns object-centric representations from pixels to interpret, control, and reason about environments. Uses dynamic internal desired goal generation for adaptation and robustness.

Result: Agent learns emergent conditional reasoning (A→B ∧ ¬A→C), logical composition (A→B ∧ A→C ⊢ A→(B∧C)), and XOR operations. Successfully controls environment to satisfy objectives from logical rules, adapts online to unexpected changes, and shows robustness to world model violations.

Conclusion: The architecture demonstrates how grounded object representations serve as inductive bias for unsupervised learning to enable behavioral reasoning, though currently limited to synthetic 2D/3D environments (dSprites) rather than real-world complexity.

Abstract: Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels – ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as $(A \to B) \land (\neg A \to C)$, as well as logical composition $(A \to B) \land (A \to C) \vdash A \to (B \land C)$ and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.

[301] AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge

You-Le Fang, Dong-Shan Jian, Xiang Li, Yan-Qing Ma

Main category: cs.AI

TL;DR: AI-Newton: A framework for autonomous concept-driven scientific discovery that derives general physical laws from multi-experiment data without supervision or prior knowledge.

Details

Motivation: Current AI methods excel at empirical modeling from individual experiments but struggle to uncover common fundamental physics that human physicists can discover. There's a gap in AI's ability to autonomously derive general physical laws from raw data.

Method: AI-Newton introduces two core innovations: (1) proposing interpretable physical concepts to construct laws, and (2) progressively generalizing these laws to broader domains. The system operates autonomously on raw, multi-experiment data without supervision or prior physical knowledge.

Result: Applied to a large, noisy dataset of mechanics experiments, AI-Newton successfully rediscovers foundational and universal laws including Newton’s second law, conservation of energy, and universal gravitation.

Conclusion: This work represents a significant advance toward autonomous, human-like scientific discovery, bridging the gap between AI’s empirical modeling capabilities and human physicists’ ability to uncover fundamental principles.

Abstract: While current AI-driven methods excel at deriving empirical models from individual experiments, a significant challenge remains in uncovering the common fundamental physics that underlie these models – a task at which human physicists are adept. To bridge this gap, we introduce AI-Newton, a novel framework for concept-driven scientific discovery. Our system autonomously derives general physical laws directly from raw, multi-experiment data, operating without supervision or prior physical knowledge. Its core innovations are twofold: (1) proposing interpretable physical concepts to construct laws, and (2) progressively generalizing these laws to broader domains. Applied to a large, noisy dataset of mechanics experiments, AI-Newton successfully rediscovers foundational and universal laws, such as Newton’s second law, the conservation of energy, and the universal gravitation. This work represents a significant advance toward autonomous, human-like scientific discovery.

[302] SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

Yuyang Dong, Nobuhiro Ueda, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada

Main category: cs.AI

TL;DR: SCAN is a VLM-friendly document layout analysis method that improves RAG performance by semantically segmenting documents into coherent regions, boosting textual RAG by 9.4 points and visual RAG by 10.4 points.

Details

Motivation: As LLMs and VLMs gain adoption, rich document analysis for RAG applications faces challenges because single pages contain large amounts of information. Current approaches struggle to process visually rich documents effectively, despite VLMs showing better RAG performance potential.

Method: SCAN uses a coarse-grained semantic approach to identify document components with appropriate semantic granularity, dividing documents into coherent regions covering contiguous components. The model is trained by fine-tuning object detection models on annotated datasets.

Result: Experimental results across English and Japanese datasets show SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and commercial document processing solutions.

Conclusion: SCAN effectively enhances both textual and visual RAG systems for visually rich documents by providing VLM-friendly semantic document layout analysis that balances context preservation with processing efficiency.

Abstract: With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.

[303] FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale

Isabelle Lee, Sarah Liaw, Dani Yogatama

Main category: cs.AI

TL;DR: FOL-Traces is a large-scale dataset of programmatically verified reasoning traces for evaluating structured logical inference in language models, with challenging diagnostic tasks that current models perform poorly on.

Details

Motivation: Current evaluation of reasoning in language models is problematic: natural-language traces are unverifiable, symbolic datasets are too small, and benchmarks often conflate heuristics with true inference. There's a need for rigorous evaluation of structured logical inference.

Method: Created FOL-Traces dataset with programmatically verified reasoning traces. Proposed two diagnostic tasks: masked operation prediction (probes syntactic awareness) and step completion (probes process fidelity).

Result: Models perform poorly on the dataset: only around 45.7% accuracy on masked operation prediction and around 27% on two-step completion, showing the dataset remains challenging for current reasoning LLMs.

Conclusion: FOL-Traces provides a scalable testbed for rigorous study of structured logical inference in language models, revealing significant limitations in current models’ reasoning capabilities.

Abstract: Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks-masked operation prediction and step completion-that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.

[304] AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology

Akash Kundu, Rishika Goswami

Main category: cs.AI

TL;DR: LLMs show human-like cognitive patterns in psychological tests: coherent narratives (TAT), framing bias, Liberty/Oppression moral judgments, and cognitive dissonance with rationalization.

Details

Motivation: To investigate whether Large Language Models exhibit human-like cognitive patterns using established psychological frameworks, bridging cognitive psychology and AI safety research.

Method: Evaluated several proprietary and open-source LLMs using structured prompts and automated scoring across four psychological frameworks: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance.

Result: Models produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization - mirroring human cognitive tendencies shaped by training data and alignment methods.

Conclusion: LLMs exhibit human-like cognitive patterns with important implications for AI transparency, ethical deployment, and future work bridging cognitive psychology and AI safety.

Abstract: We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluated several proprietary and open-source models using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety

[305] Achieving Trustworthy Real-Time Decision Support Systems with Low-Latency Interpretable AI Models

Zechun Deng, Ziwei Liu, Ziqian Bi, Junhao Song, Chia Xin Liang, Joe Yeong, Xinyuan Song, Junfeng Hao

Main category: cs.AI

TL;DR: This paper reviews real-time decision support systems using low-latency AI models, focusing on Edge-IoT integration, human-AI collaboration, and LLM-assisted decision-making under resource constraints.

Details

Motivation: The motivation is to address the growing need for efficient real-time decision support systems that can operate effectively in resource-constrained environments, leveraging recent advances in AI, Edge computing, and IoT technologies to enable better human-AI collaboration.

Method: The paper conducts a comprehensive literature review and analysis of existing approaches, examining technical developments like DeLLMa, model compression techniques, edge device analytics improvements, and frameworks for addressing resource limitations and adaptability needs.

Result: The review provides practical insights into development strategies and application areas, identifying opportunities for creating more efficient and flexible AI-supported decision systems, while highlighting current challenges and limitations.

Conclusion: The paper concludes by outlining future research directions for AI-driven real-time decision support systems, emphasizing how AI can transform decision-making processes and setting the stage for breakthroughs in this rapidly evolving field.

Abstract: This paper investigates real-time decision support systems that leverage low-latency AI models, bringing together recent progress in holistic AI-driven decision tools, integration with Edge-IoT technologies, and approaches for effective human-AI teamwork. It looks into how large language models can assist decision-making, especially when resources are limited. The research also examines the effects of technical developments such as DeLLMa, methods for compressing models, and improvements for analytics on edge devices, while also addressing issues like limited resources and the need for adaptable frameworks. Through a detailed review, the paper offers practical perspectives on development strategies and areas of application, adding to the field by pointing out opportunities for more efficient and flexible AI-supported systems. The conclusions set the stage for future breakthroughs in this fast-changing area, highlighting how AI can reshape real-time decision support.

[306] Scaling Neuro-symbolic Problem Solving: Solver-Free Learning of Constraints and Objectives

Marianne Defresne, Romain Gambardella, Sophie Barbe, Thomas Schiex

Main category: cs.AI

TL;DR: A differentiable neuro-symbolic architecture with probabilistic loss learns to solve NP-hard reasoning problems from natural inputs, outperforming other methods on Sudoku variants, visual Min-Cut/Max-Cut, and protein design optimization.

Details

Motivation: To address the challenge of hybridizing discrete reasoning with neural networks, particularly for solving NP-hard reasoning problems from natural inputs where Large Language Models struggle. There's increasing interest in neural architectures that can learn to solve such problems directly from natural inputs.

Method: Introduces a differentiable neuro-symbolic architecture with a new probabilistic loss function that learns both constraints and objectives. The approach pushes the combinatorial solver out of the training loop for scalable training while using exact inference for maximum accuracy. This allows learning complete models that can be scrutinized and augmented with side constraints.

Result: The method efficiently learns to solve NP-hard reasoning problems from natural inputs: 1) On three Sudoku variants (symbolic, visual, many-solution), requires fraction of training time compared to other hybrid methods; 2) On visual Min-Cut/Max-Cut task, optimizes regret better than Decision-Focused-Learning regret-dedicated loss; 3) Efficiently learns energy optimization formulation for real-world protein design problem.

Conclusion: The proposed differentiable neuro-symbolic architecture with probabilistic loss successfully bridges neural networks and discrete reasoning, enabling efficient learning of NP-hard problem solving from natural inputs across diverse domains including puzzles, optimization tasks, and real-world applications like protein design.

Abstract: In the ongoing quest for hybridizing discrete reasoning with neural nets, there is an increasing interest in neural architectures that can learn how to solve discrete reasoning or optimization problems from natural inputs, a task that Large Language Models seem to struggle with. Objectives: We introduce a differentiable neuro-symbolic architecture and a loss function dedicated to learning how to solve NP-hard reasoning problems. Methods: Our new probabilistic loss allows for learning both the constraints and the objective, thus delivering a complete model that can be scrutinized and completed with side constraints. By pushing the combinatorial solver out of the training loop, our architecture also offers scalable training while exact inference gives access to maximum accuracy. Results: We empirically show that it can efficiently learn how to solve NP-hard reasoning problems from natural inputs. On three variants of the Sudoku benchmark – symbolic, visual, and many-solution –, our approach requires a fraction of training time of other hybrid methods. On a visual Min-Cut/Max-cut task, it optimizes the regret better than a Decision-Focused-Learning regret-dedicated loss. Finally, it efficiently learns the energy optimization formulation of the large real-world problem of designing proteins.

[307] Executable Epistemology: The Structured Cognitive Loop as an Architecture of Intentional Understanding

Myung Ho Kim

Main category: cs.AI

TL;DR: The paper introduces Structured Cognitive Loop (SCL), an executable epistemological framework that bridges philosophy and AI by defining intelligence as a continuous process rather than a property, enabling philosophy to be tested through structural experiments.

Details

Motivation: Current large language models lack genuine epistemic understanding despite exhibiting intelligence, revealing a gap in epistemic architecture. The paper aims to address this by shifting from asking "what is intelligence?" (ontological) to "under what conditions does cognition emerge?" (epistemological).

Method: SCL operationalizes philosophical insights from process philosophy, enactive cognition, and extended mind theory into computationally interpretable structures. It defines intelligence as a continuous loop of judgment, memory, control, action, and regulation, creating functional separation within cognitive architecture.

Result: SCL enables “executable epistemology” where philosophy becomes structural experiment. It shows functional separation yields more coherent and interpretable behavior than monolithic prompt-based systems, supported by agent evaluations. It redefines intelligence as capacity to reconstruct epistemic state through intentional understanding.

Conclusion: Real AI progress requires architectures that realize cognitive principles structurally, not just larger models. SCL impacts philosophy (allowing theories to be enacted), AI (grounding behavior in epistemic structure), and epistemology (framing knowledge as continuous reconstruction within phenomenologically coherent loops).

Abstract: Large language models exhibit intelligence without genuine epistemic understanding, exposing a key gap: the absence of epistemic architecture. This paper introduces the Structured Cognitive Loop (SCL) as an executable epistemological framework for emergent intelligence. Unlike traditional AI research asking “what is intelligence?” (ontological), SCL asks “under what conditions does cognition emerge?” (epistemological). Grounded in philosophy of mind and cognitive phenomenology, SCL bridges conceptual philosophy and implementable cognition. Drawing on process philosophy, enactive cognition, and extended mind theory, we define intelligence not as a property but as a performed process – a continuous loop of judgment, memory, control, action, and regulation. SCL makes three contributions. First, it operationalizes philosophical insights into computationally interpretable structures, enabling “executable epistemology” – philosophy as structural experiment. Second, it shows that functional separation within cognitive architecture yields more coherent and interpretable behavior than monolithic prompt based systems, supported by agent evaluations. Third, it redefines intelligence: not representational accuracy but the capacity to reconstruct its own epistemic state through intentional understanding. This framework impacts philosophy of mind, epistemology, and AI. For philosophy, it allows theories of cognition to be enacted and tested. For AI, it grounds behavior in epistemic structure rather than statistical regularity. For epistemology, it frames knowledge not as truth possession but as continuous reconstruction within a phenomenologically coherent loop. We situate SCL within debates on cognitive phenomenology, emergence, normativity, and intentionality, arguing that real progress requires not larger models but architectures that realize cognitive principles structurally.

[308] AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

Minwei Kong, Ao Qu, Xiaotong Guo, Wenbin Ouyang, Chonghe Jiang, Han Zheng, Yining Ma, Dingyi Zhuang, Yuhan Tang, Junyi Li, Shenhao Wang, Haris Koutsopoulos, Hai Wang, Cathy Wu, Jinhua Zhao

Main category: cs.AI

TL;DR: AlphaOPT is a self-improving experience library that enables LLMs to learn optimization modeling from limited demonstrations and solver feedback without annotated reasoning traces or parameter updates.

Details

Motivation: Optimization modeling is difficult to automate because informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization.

Method: AlphaOPT operates in a continual two-phase cycle: (1) Library Learning phase reflects on failed attempts, extracting solver-verified structured insights as {taxonomy, condition, explanation, example}; (2) Library Evolution phase diagnoses retrieval misalignments and refines applicability conditions of stored insights to improve transfer across tasks.

Result: AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers.

Conclusion: AlphaOPT enables efficient learning from limited demonstrations without curated rationales, expands continually without costly retraining by updating the library rather than model weights, and makes knowledge explicit and interpretable for human inspection and intervention.

Abstract: Optimization modeling enables critical decisions across industries but remains difficult to automate: informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization. We present AlphaOPT, a self-improving experience library that enables an LLM to learn from limited demonstrations (even answers alone, without gold-standard programs) and solver feedback - without annotated reasoning traces or parameter updates. AlphaOPT operates in a continual two-phase cycle: (i) a Library Learning phase that reflects on failed attempts, extracting solver-verified, structured insights as {taxonomy, condition, explanation, example}; and (ii) a Library Evolution phase that diagnoses retrieval misalignments and refines the applicability conditions of stored insights, improving transfer across tasks. This design (1) learns efficiently from limited demonstrations without curated rationales, (2) expands continually without costly retraining by updating the library rather than model weights, and (3) makes knowledge explicit and interpretable for human inspection and intervention. Experiments show that AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers. Code and data are available at: https://github.com/Minw913/AlphaOPT.

[309] PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Houda Labiod, Frank Kargl

Main category: cs.AI

TL;DR: PaTAS is a parallel trust assessment framework using Subjective Logic to model and propagate trust through neural networks, providing interpretable trust estimates that complement accuracy metrics.

Details

Motivation: Conventional evaluation metrics like accuracy fail to capture uncertainty and reliability of model predictions, especially under adversarial or degraded conditions. There's a need for trustworthy AI systems in safety-critical applications.

Method: PaTAS operates in parallel with neural computation using Trust Nodes and Trust Functions to propagate input, parameter, and activation trust. It includes Parameter Trust Update mechanism for training and Inference-Path Trust Assessment (IPTA) for instance-specific trust at inference.

Result: Experiments show PaTAS produces interpretable, symmetric, and convergent trust estimates that expose reliability gaps in poisoned, biased, or uncertain data. It effectively distinguishes benign vs adversarial inputs and identifies divergence between model confidence and actual reliability.

Conclusion: PaTAS provides transparent and quantifiable trust reasoning within neural architectures, establishing a foundation for evaluating model reliability throughout the AI lifecycle.

Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.

[310] BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation

Chenyang Zuo, Siqi Fan, Zaiqing Nie

Main category: cs.AI

TL;DR: BioMedGPT-Mol is a molecular language model created by fine-tuning a general-purpose reasoning model with a multi-task learning framework on curated molecular instruction data, achieving strong performance on molecular understanding/generation tasks and state-of-the-art retrosynthetic planning.

Details

Motivation: To explore how general-purpose language models can be efficiently adapted for molecular science applications, particularly for small molecule drug development, given recent advances in reasoning models.

Method: Curated and unified existing public instruction datasets to create large-scale, comprehensive training data. Fine-tuned a general-purpose reasoning model using a meticulously designed multi-task learning framework.

Result: Achieved remarkable performance on consolidated benchmarks (LlaSMol, TOMG-Bench, MuMOInstruct). Demonstrated state-of-the-art performance on RetroBench for multi-step retrosynthetic planning, showing superior efficacy as an end-to-end retrosynthetic planner.

Conclusion: General-purpose reasoning models can be effectively and efficiently post-trained into professional molecular language models through well-structured multi-task curricula. The approach can potentially be extended to other biomedical scientific domains.

Abstract: Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging these capabilities, we further apply the model to multi-step retrosynthetic planning, achieving state-of-the-art performance on RetroBench and demonstrating its superior efficacy as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.

[311] CogMCTS: A Novel Cognitive-Guided Monte Carlo Tree Search Framework for Iterative Heuristic Evolution with Large Language Models

Hui Wang, Yang Liu, Xiaoyu Zhang, Chaoxu Mu

Main category: cs.AI

TL;DR: CogMCTS integrates LLM cognitive guidance with Monte Carlo Tree Search for automated heuristic design, using multi-round feedback, dual-track expansion, and strategic mutation to improve exploration-exploitation balance and solution quality.

Details

Motivation: Existing LLM-based evolutionary methods for automatic heuristic design suffer from local optima and limited search diversity. While LLM-MCTS integration improves exploration-exploitation trade-off, it lacks multi-round cognitive integration and constrained search diversity.

Method: CogMCTS framework tightly integrates LLM cognitive guidance with MCTS using: 1) multi-round cognitive feedback incorporating historical experience, node info, and negative outcomes; 2) dual-track node expansion with elite heuristic management; 3) strategic mutation of heuristic forms and parameters.

Result: CogMCTS outperforms existing LLM-based AHD methods in stability, efficiency, and solution quality.

Conclusion: The proposed CogMCTS framework effectively addresses limitations of existing LLM-based heuristic design methods by achieving better exploration-exploitation balance through cognitive-guided MCTS integration.

Abstract: Automatic Heuristic Design (AHD) is an effective framework for solving complex optimization problems. The development of large language models (LLMs) enables the automated generation of heuristics. Existing LLM-based evolutionary methods rely on population strategies and are prone to local optima. Integrating LLMs with Monte Carlo Tree Search (MCTS) improves the trade-off between exploration and exploitation, but multi-round cognitive integration remains limited and search diversity is constrained. To overcome these limitations, this paper proposes a novel cognitive-guided MCTS framework (CogMCTS). CogMCTS tightly integrates the cognitive guidance mechanism of LLMs with MCTS to achieve efficient automated heuristic optimization. The framework employs multi-round cognitive feedback to incorporate historical experience, node information, and negative outcomes, dynamically improving heuristic generation. Dual-track node expansion combined with elite heuristic management balances the exploration of diverse heuristics and the exploitation of high-quality experience. In addition, strategic mutation modifies the heuristic forms and parameters to further enhance the diversity of the solution and the overall optimization performance. The experimental results indicate that CogMCTS outperforms existing LLM-based AHD methods in stability, efficiency, and solution quality.

[312] EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Bo Zhang, Xuan Zhou, Ming Yan, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. Fung, Yalong Li, Pengjun Xie

Main category: cs.AI

TL;DR: EcomBench is a new benchmark for evaluating foundation agents in realistic e-commerce environments, addressing the gap between academic benchmarks and real-world applications.

Details

Motivation: Existing benchmarks focus on academic or artificial scenarios, overlooking real-world challenges. E-commerce presents a practical domain with diverse user interactions, dynamic market conditions, and real decision-making tasks that better test agent capabilities.

Method: Built EcomBench from genuine user demands in leading global e-commerce ecosystems, curated and annotated by human experts for clarity, accuracy, and domain relevance. Covers multiple e-commerce task categories with three difficulty levels.

Result: EcomBench provides a holistic benchmark that evaluates agents on key capabilities including deep information retrieval, multi-step reasoning, and cross-source knowledge integration in realistic e-commerce contexts.

Conclusion: EcomBench offers a rigorous, dynamic testbed for measuring practical agent capabilities in modern e-commerce, bridging the gap between academic evaluation and real-world application requirements.

Abstract: Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.

cs.SD

[313] Building Audio-Visual Digital Twins with Smartphones

Zitong Lan, Yiwei Tang, Yuhan Wang, Haowen Lai, Yiduo Hao, Mingmin Zhao

Main category: cs.SD

TL;DR: AV-Twin is the first practical system for creating editable audio-visual digital twins using commodity smartphones, combining visual and acoustic reconstruction with material property recovery for fully modifiable environments.

Details

Motivation: Current digital twins are primarily visual and overlook acoustics, which is a core component of spatial realism and interaction. There's a need for practical systems that can capture both audio and visual aspects of environments using accessible hardware.

Method: Combines mobile RIR (Room Impulse Response) capture with visual-assisted acoustic field modeling to efficiently reconstruct room acoustics. Uses differentiable acoustic rendering to recover per-surface material properties, enabling modification of materials, geometry, and layout with automatic audio-visual updates.

Result: Developed the first practical system (AV-Twin) that constructs editable audio-visual digital twins using only commodity smartphones, establishing a path toward fully modifiable audio-visual digital twins for real-world environments.

Conclusion: AV-Twin represents a significant advancement in digital twin technology by integrating acoustic realism with visual representation, enabling practical creation and modification of audio-visual digital twins using accessible hardware.

Abstract: Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.

[314] VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

Main category: cs.SD

TL;DR: VocSim is a training-free benchmark that evaluates frozen audio embeddings’ intrinsic geometric alignment using 125k single-source clips across 19 corpora, measuring local purity and global class separation to assess zero-shot content representation quality.

Details

Motivation: Current audio representation evaluation focuses on supervised classification benchmarks that measure adaptability through parameter updates, but there's a need to assess the intrinsic geometric quality of frozen embeddings for zero-shot content representation without training.

Method: VocSim aggregates 125k single-source audio clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. It evaluates embeddings using Precision@k for local purity and Global Separation Rate (GSR) for point-wise class separation, with GSR calibrated against an empirical permutation baseline.

Result: Frozen Whisper encoder features with time-frequency pooling and label-free PCA achieve strong zero-shot performance, but VocSim reveals a consistent generalization gap where performance drops sharply on blind, low-resource speech. The benchmark validates embeddings by showing they predict avian perceptual similarity, improve bioacoustic classification, and achieve SOTA on HEAR benchmark.

Conclusion: Intrinsic geometric quality measured by VocSim proxies utility in downstream applications. The benchmark uncovers generalization gaps in current audio representations and provides a standardized evaluation framework, with data, code, and leaderboard released to the community.

Abstract: General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.

[315] Forensic deepfake audio detection using segmental speech features

Tianle Yang, Chengzhe Sun, Siwei Lyu, Phil Rose

Main category: cs.SD

TL;DR: Segmental acoustic features (related to articulatory processes) are effective for deepfake audio detection, while global features are less useful. A speaker-specific framework is proposed for forensic contexts.

Details

Motivation: Current deepfake audio detection methods may not be optimal for forensic contexts. Segmental features are more interpretable and harder for deepfake models to replicate due to their relationship with human articulation. There's a need for approaches distinct from traditional forensic voice comparison methods.

Method: The study uses acoustic features of segmental speech sounds (individual speech segments like phonemes) that are commonly used in forensic voice comparison. It proposes a speaker-specific framework for deepfake detection rather than speaker-independent systems.

Result: Certain segmental features are effective in identifying deepfakes, while some global features provide little value. The speaker-specific approach offers advantages in forensic contexts for case-by-case interpretability and sensitivity to individual phonetic realization.

Conclusion: Segmental acoustic features offer a promising approach for deepfake audio detection, especially in forensic contexts. The speaker-specific framework provides better interpretability and sensitivity compared to speaker-independent systems, suggesting a need to rethink detection approaches for forensic applications.

Abstract: This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection using methods that are distinct from those employed in traditional FVC, and offer a new perspective on leveraging segmental features for this purpose. In addition, the present study proposes a speaker-specific framework for deepfake detection, which differs fundamentally from the speaker-independent systems that dominate current benchmarks. While speaker-independent frameworks aim at broad generalization, the speaker-specific approach offers advantages in forensic contexts where case-by-case interpretability and sensitivity to individual phonetic realization are essential.

[316] Semantic-Aware Confidence Calibration for Automated Audio Captioning

Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli, Venkata Krishna Rayalu Garapati

Main category: cs.SD

TL;DR: The paper presents a framework for improving confidence calibration in automated audio captioning by integrating confidence prediction and using semantic similarity rather than n-gram overlap to evaluate correctness.

Details

Motivation: Current audio captioning models produce overconfident predictions regardless of semantic accuracy, limiting reliability. This stems from two issues: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation.

Method: Augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. Uses CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality.

Result: Experiments on Clotho v2 show confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics.

Conclusion: Semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics, establishing a framework that addresses both confidence estimation and semantic correctness evaluation.

Abstract: Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.

[317] Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Whenty Ariyanti, Kuan-Yu Chen, Sabato Marco Siniscalchi, Hsin-Min Wang, Yu Tsao

Main category: cs.SD

TL;DR: VOQANet+ combines speech foundation model embeddings with acoustic features (jitter, shimmer, HNR) for robust pathological voice assessment, outperforming baselines on both vowel and sentence speech.

Details

Motivation: Traditional voice quality assessment methods like CAPE-V and GRBAS rely on expert raters and suffer from inter-rater variability, creating need for objective, automated solutions for voice disorder diagnosis and monitoring.

Method: VOQANet uses attention mechanism with speech foundation model embeddings; VOQANet+ integrates self-supervised SFM embeddings with low-level acoustic descriptors (jitter, shimmer, HNR). Evaluated on both vowel-level (PVQD-A) and sentence-level (PVQD-S) speech.

Result: Sentence-based inputs yield higher accuracy, especially at patient level. VOQANet outperforms baselines in RMSE and Pearson correlation across CAPE-V/GRBAS dimensions. VOQANet+ achieves even better performance and maintains consistent performance under noisy conditions.

Conclusion: Combining SFM embeddings with low-level acoustic features enables accurate, robust pathological voice assessment suitable for real-world and telehealth applications.

Abstract: Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess generalizability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment.

[318] MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

Alon Ziv, Sanyuan Chen, Andros Tjandra, Yossi Adi, Wei-Ning Hsu, Bowen Shi

Main category: cs.SD

TL;DR: MR-FlowDPO enhances flow-matching music generation models using Direct Preference Optimization with multiple musical rewards for text alignment, audio quality, and semantic consistency.

Details

Motivation: Music generation models lack direct alignment with human preferences due to the subjective nature of music evaluation, which varies widely across individuals.

Method: Uses DPO with multiple musical rewards across three dimensions: text alignment, audio production quality, and semantic consistency. Employs scalable off-the-shelf models for reward prediction and constructs preference data for DPO while integrating rewards into text prompting. Proposes a novel scoring mechanism using semantic self-supervised representations to improve rhythmic stability.

Result: MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over competitive baselines in terms of audio quality, text alignment, and musicality, as shown through extensive evaluation with music-specific objective metrics and human studies.

Conclusion: The approach successfully addresses the challenge of aligning music generation with human preferences by leveraging multiple musical rewards and DPO, resulting in improved quality across key dimensions of music evaluation.

Abstract: A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.

[319] Neural personal sound zones with flexible bright zone control

Wenye Zhu, Jun Tang, Xiaofei Li

Main category: cs.SD

TL;DR: A 3D CNN method for personal sound zone reproduction that handles flexible control microphone grids and alternative reproduction targets using only one training session.

Details

Motivation: Traditional PSZ systems require fixed receiver arrays to measure reconstruction targets and record room impulse responses, making them inconvenient and costly for real-world applications.

Method: A 3D convolutional neural network designed for PSZ reproduction that takes virtual target scenes as inputs and outputs PSZ pre-filters, enabling flexible control microphone grids and alternative reproduction targets.

Result: The proposed method outperforms traditional approaches by handling varied reproduction targets on flexible control point grids with only one training session, and demonstrates capability to learn global spatial information from sparse sampling points in PSZs.

Conclusion: The 3D CNN approach provides a more practical and cost-effective solution for personal sound zone reproduction by eliminating the need for fixed receiver arrays and enabling flexible control configurations.

Abstract: Personal sound zone (PSZ) reproduction system, which attempts to create distinct virtual acoustic scenes for different listeners at their respective positions within the same spatial area using one loudspeaker array, is a fundamental technology in the application of virtual reality. For practical applications, the reconstruction targets must be measured on the same fixed receiver array used to record the local room impulse responses (RIRs) from the loudspeaker array to the control points in each PSZ, which makes the system inconvenient and costly for real-world use. In this paper, a 3D convolutional neural network (CNN) designed for PSZ reproduction with flexible control microphone grid and alternative reproduction target is presented, utilizing the virtual target scene as inputs and the PSZ pre-filters as output. Experimental results of the proposed method are compared with the traditional method, demonstrating that the proposed method is able to handle varied reproduction targets on flexible control point grid using only one training session. Furthermore, the proposed method also demonstrates the capability to learn global spatial information from sparse sampling points distributed in PSZs.

[320] Investigating training objective for flow matching-based speech enhancement

Liusha Yang, Ziru Ge, Gui Zhang, Junan Zhang, Zhizheng Wu

Main category: cs.SD

TL;DR: Flow matching for speech enhancement with three training objectives and perceptual/signal-based improvements

Details

Motivation: Current generative approaches for speech enhancement (score matching, Schrodinger bridge) are computationally expensive, so flow matching offers a more efficient alternative by learning velocity fields directly.

Method: Systematic study of flow matching for SE using three training objectives: velocity prediction, x₁ prediction, and preconditioned x₁ prediction. Introduces perceptual (PESQ) and signal-based (SI-SDR) objectives to enhance convergence and quality.

Result: The approach yields substantial improvements across evaluation metrics, with enhanced convergence efficiency and speech quality compared to existing methods.

Conclusion: Flow matching provides an effective and efficient framework for speech enhancement, with specific training objectives and perceptual/signal-based enhancements leading to significant performance gains.

Abstract: Speech enhancement(SE) aims to recover clean speech from noisy recordings. Although generative approaches such as score matching and Schrodinger bridge have shown strong effectiveness, they are often computationally expensive. Flow matching offers a more efficient alternative by directly learning a velocity field that maps noise to data. In this work, we present a systematic study of flow matching for SE under three training objectives: velocity prediction, $x_1$ prediction, and preconditioned $x_1$ prediction. We analyze their impact on training dynamics and overall performance. Moreover, by introducing perceptual(PESQ) and signal-based(SI-SDR) objectives, we further enhance convergence efficiency and speech quality, yielding substantial improvements across evaluation metrics.

[321] BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, Wentao Zhang

Main category: cs.SD

TL;DR: BRACE is a new benchmark for evaluating reference-free audio caption alignment quality, testing both CLAPScore-based metrics and Large Audio Language Models, revealing significant limitations in current approaches.

Details

Motivation: Current reference-free audio caption evaluation metrics lack systematic validation of robustness, and there's a need for better benchmarks to assess both evaluation metrics and Large Audio Language Models' modality alignment abilities.

Method: Created BRACE benchmark with two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinations. Constructed datasets through high-quality filtering, LLM-based corruption, and human annotation.

Result: Best CLAP-based ACEM achieved only 70.01 F1-score on BRACE-Main, while best LALM reached just 63.19, revealing significant limitations in current audio caption evaluation approaches.

Conclusion: BRACE benchmark exposes weaknesses in current CLAP models and LALMs for audio caption evaluation, providing valuable insights and direction for future research in audio-language alignment.

Abstract: Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.

[322] Universal Discrete-Domain Speech Enhancement

Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling

Main category: cs.SD

TL;DR: UDSE is a universal discrete-domain speech enhancement model that treats SE as a classification task predicting clean discrete tokens from a pre-trained neural codec, enabling robust enhancement across multiple simultaneous distortions.

Details

Motivation: Real-world speech suffers from multiple simultaneous distortions, but most SE methods only handle limited types of interference. This gap limits generalization and practical usability in real environments.

Method: UDSE reframes SE as discrete-domain classification rather than regression. It extracts global features from degraded speech, then predicts clean RVQ tokens from a pre-trained neural codec in a sequential manner (each VQ prediction depends on previous ones), and finally decodes tokens to reconstruct clean waveform.

Result: UDSE effectively enhances speech degraded by various conventional and unconventional distortions (additive noise, reverberation, band limitation, clipping, phase distortion, compression) and their combinations, outperforming advanced regression-based methods.

Conclusion: The discrete-domain classification approach provides superior universality and practicality for real-world speech enhancement compared to traditional regression-based methods, enabling robust performance across multiple simultaneous distortions.

Abstract: In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.

cs.LG

[323] HGC-Herd: Efficient Heterogeneous Graph Condensation via Representative Node Herding

Fuyan Ou, Siqi Ai, Yulin Hu

Main category: cs.LG

TL;DR: HGC-Herd: A training-free condensation framework for scalable heterogeneous graph neural networks that generates compact graphs while maintaining semantic and structural fidelity.

Details

Motivation: HGNNs face scalability challenges on large graphs due to structural redundancy and high-dimensional features. Existing graph condensation methods like GCond are designed for homogeneous graphs and rely on gradient matching, causing computational, memory, and optimization overhead.

Method: HGC-Herd integrates lightweight feature propagation to encode multi-hop relational context and uses a class-wise herding mechanism to identify representative nodes per class, producing balanced and discriminative subsets without training.

Result: Extensive experiments on ACM, DBLP, and Freebase show HGC-Herd achieves comparable or superior accuracy to full-graph training while significantly reducing runtime and memory consumption.

Conclusion: HGC-Herd demonstrates practical value for efficient and scalable heterogeneous graph representation learning by addressing the scalability limitations of existing HGNN approaches.

Abstract: Heterogeneous graph neural networks (HGNNs) have demonstrated strong capability in modeling complex semantics across multi-type nodes and relations. However, their scalability to large-scale graphs remains challenging due to structural redundancy and high-dimensional node features. Existing graph condensation approaches, such as GCond, are primarily developed for homogeneous graphs and rely on gradient matching, resulting in considerable computational, memory, and optimization overhead. We propose HGC-Herd, a training-free condensation framework that generates compact yet informative heterogeneous graphs while maintaining both semantic and structural fidelity. HGC-Herd integrates lightweight feature propagation to encode multi-hop relational context and employs a class-wise herding mechanism to identify representative nodes per class, producing balanced and discriminative subsets for downstream learning tasks. Extensive experiments on ACM, DBLP, and Freebase validate that HGC-Herd attains comparable or superior accuracy to full-graph training while markedly reducing both runtime and memory consumption. These results underscore its practical value for efficient and scalable heterogeneous graph representation learning.

[324] BAMBO: Construct Ability and Efficiency LLM Pareto Set via Bayesian Adaptive Multi-objective Block-wise Optimization

Kesheng Chen, Wenjian Luo, Zhenqian Zhu, Yamin Hu, Yiya Xi

Main category: cs.LG

TL;DR: BAMBO is a novel framework that automatically constructs Pareto sets for LLMs by balancing capability-efficiency trade-offs through hybrid block partitioning and Bayesian optimization.

Details

Motivation: Existing model merging techniques are inadequate for constructing Pareto sets - coarse-grained methods produce sparse suboptimal solutions, while fine-grained approaches suffer from computational intractability due to the "curse of dimensionality."

Method: BAMBO introduces Hybrid Optimal Block Partitioning as a 1D clustering problem solved via dynamic programming to balance intra-block homogeneity and inter-block information distribution. The framework uses an evolutionary loop with q-Expected Hypervolume Improvement (qEHVI) acquisition function for automated Pareto set construction.

Result: Experiments show BAMBO discovers superior and more comprehensive Pareto frontiers than baselines, enabling agile model selection tailored to diverse operational constraints.

Conclusion: BAMBO resolves the dichotomy between coarse-grained and fine-grained model merging approaches, making Pareto set construction tractable for LLMs while maintaining critical granularity.

Abstract: Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the “curse of dimensionality,” rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.

[325] Latent Action World Models for Control with Unlabeled Trajectories

Marvin Alles, Xingyuan Zhang, Patrick van der Smagt, Philip Becker-Ehmck

Main category: cs.LG

TL;DR: Latent-action world models learn from both action-conditioned and action-free data by learning shared latent action representations, enabling efficient training with minimal labeled data.

Details

Motivation: Standard world models rely heavily on action-conditioned trajectories, which limits their effectiveness when action labels are scarce. Humans learn by combining direct interaction with passive observation (like watching videos), suggesting that world models should similarly leverage heterogeneous data sources.

Method: Introduce latent-action world models that learn a shared latent action representation aligning observed control signals with actions inferred from passive observations. This enables a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. Use this model to learn latent-action policies through offline reinforcement learning.

Result: On DeepMind Control Suite, the approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines.

Conclusion: Latent actions enable training on both passive and interactive data, making world models learn more efficiently by bridging offline RL (which typically needs action-conditioned data) with action-free training (rarely used with RL).

Abstract: Inspired by how humans combine direct interaction with action-free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action-conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent-action world models that jointly use action-conditioned and action-free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. We use the latent-action world model to learn a latent-action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action-conditioned data, and action-free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.

[326] Cluster-Dags as Powerful Background Knowledge For Causal Discovery

Jan Marco Ruiz de Vargas, Kirtan Padh, Niki Kilbertus

Main category: cs.LG

TL;DR: Cluster-DAGs used as prior knowledge to warm-start causal discovery, with new algorithms Cluster-PC and Cluster-FCI outperforming baselines.

Details

Motivation: Current causal discovery methods struggle with high-dimensional data and complex dependencies. Incorporating prior knowledge can help improve causal discovery performance.

Method: Use Cluster-DAGs as a flexible prior knowledge framework. Introduce two modified constraint-based algorithms: Cluster-PC for fully observed settings and Cluster-FCI for partially observed settings.

Result: Empirical evaluation on simulated data shows that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.

Conclusion: Cluster-DAGs provide an effective prior knowledge framework for warm-starting causal discovery, offering greater flexibility than existing tiered background knowledge approaches.

Abstract: Finding cause-effect relationships is of key importance in science. Causal discovery aims to recover a graph from data that succinctly describes these cause-effect relationships. However, current methods face several challenges, especially when dealing with high-dimensional data and complex dependencies. Incorporating prior knowledge about the system can aid causal discovery. In this work, we leverage Cluster-DAGs as a prior knowledge framework to warm-start causal discovery. We show that Cluster-DAGs offer greater flexibility than existing approaches based on tiered background knowledge and introduce two modified constraint-based algorithms, Cluster-PC and Cluster-FCI, for causal discovery in the fully and partially observed setting, respectively. Empirical evaluation on simulated data demonstrates that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.

[327] Robust Gradient Descent via Heavy-Ball Momentum with Predictive Extrapolation

Sarwan Ali

Main category: cs.LG

TL;DR: HB-SGE combines heavy-ball momentum with predictive gradient extrapolation for robust acceleration that prevents divergence on ill-conditioned/non-convex problems where NAG and standard momentum fail.

Details

Motivation: Accelerated gradient methods like NAG achieve fast convergence on well-conditioned problems but often diverge on ill-conditioned or non-convex landscapes due to aggressive momentum accumulation. There's a need for a robust first-order method that maintains stability while providing adaptive acceleration.

Method: Heavy-Ball Synthetic Gradient Extrapolation (HB-SGE) combines heavy-ball momentum with predictive gradient extrapolation. Unlike classical momentum methods that accumulate historical gradients, HB-SGE estimates future gradient directions using local Taylor approximations, providing adaptive acceleration while maintaining stability.

Result: On ill-conditioned quadratics (κ=50), HB-SGE converges in 119 iterations while both SGD and NAG diverge. On the non-convex Rosenbrock function, HB-SGE achieves convergence in 2,718 iterations where classical momentum methods diverge within 10 steps. While NAG remains faster on well-conditioned problems, HB-SGE provides robust acceleration across diverse landscapes.

Conclusion: HB-SGE offers a robust alternative to NAG with proven convergence guarantees for strongly convex functions, requiring only O(d) memory overhead and the same hyperparameters as standard momentum. It provides reliable acceleration where traditional momentum methods fail, making it suitable for ill-conditioned and non-convex optimization problems.

Abstract: Accelerated gradient methods like Nesterov’s Accelerated Gradient (NAG) achieve faster convergence on well-conditioned problems but often diverge on ill-conditioned or non-convex landscapes due to aggressive momentum accumulation. We propose Heavy-Ball Synthetic Gradient Extrapolation (HB-SGE), a robust first-order method that combines heavy-ball momentum with predictive gradient extrapolation. Unlike classical momentum methods that accumulate historical gradients, HB-SGE estimates future gradient directions using local Taylor approximations, providing adaptive acceleration while maintaining stability. We prove convergence guarantees for strongly convex functions and demonstrate empirically that HB-SGE prevents divergence on problems where NAG and standard momentum fail. On ill-conditioned quadratics (condition number $κ=50$), HB-SGE converges in 119 iterations while both SGD and NAG diverge. On the non-convex Rosenbrock function, HB-SGE achieves convergence in 2,718 iterations where classical momentum methods diverge within 10 steps. While NAG remains faster on well-conditioned problems, HB-SGE provides a robust alternative with speedup over SGD across diverse landscapes, requiring only $O(d)$ memory overhead and the same hyperparameters as standard momentum.

[328] Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

Skyler Wu, Aymen Echarghaoui

Main category: cs.LG

TL;DR: The paper introduces four new weighting strategies for Multiple-Reference Preference Optimization (MRPO) that outperform current methods, but surprisingly finds that single-reference DPO often outperforms all multiple-reference approaches.

Details

Motivation: Current methods for setting reference weights in MRPO are ad-hoc and statistically unsound, leading to unreliable performance. The authors aim to develop more principled weighting strategies to better leverage multiple reference models' collective desirable properties.

Method: The authors introduce four new weighting strategies: two offline methods using held-out validation signal, one online method using sliding-window estimator to reduce overfitting, and one online method treating reference weighting as a K-armed bandit via Thompson Sampling. Experiments use Qwen2.5-0.5B as policy model with seven reference models from various families.

Result: All four new weighting strategies outperform current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. However, surprisingly, single-reference DPO using any of 6 out of 7 references consistently outperforms all tested multiple-reference approaches.

Conclusion: While the new weighting strategies improve MRPO performance, the finding that single-reference DPO often outperforms multiple-reference approaches calls into question the practical appeal of multiple-reference methods for preference optimization.

Abstract: Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches – calling into question the practical appeal of multiple-reference approaches.

[329] SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

Jongmin Lee, Meiqi Sun, Pieter Abbeel

Main category: cs.LG

TL;DR: SEMDICE: A principled off-policy algorithm for unsupervised RL pre-training that learns state entropy maximizing policies from arbitrary off-policy datasets by optimizing directly in the space of stationary distributions.

Details

Motivation: In unsupervised RL pre-training, agents need to learn prior policies for downstream tasks without task-specific rewards. State entropy maximization (SEM) is a promising approach, but existing methods need improvement in learning from arbitrary off-policy datasets.

Method: SEMDICE is an off-policy algorithm that computes a single stationary Markov state-entropy-maximizing policy directly from arbitrary off-policy datasets by optimizing within the space of stationary distributions.

Result: SEMDICE outperforms baseline algorithms in maximizing state entropy and achieves the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

Conclusion: SEMDICE provides an effective approach for unsupervised RL pre-training through state entropy maximization, enabling efficient adaptation to downstream tasks from arbitrary off-policy data.

Abstract: In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

[330] Local LLM Ensembles for Zero-shot Portuguese Named Entity Recognition

João Lucas Luz Lima Sarcinelli, Diego Furtado Silva

Main category: cs.LG

TL;DR: A three-step ensemble pipeline for zero-shot NER using multiple small LLMs outperforms individual models for Portuguese NER tasks, requiring minimal annotated data.

Details

Motivation: LLMs under-perform in NER for lower-resource languages like Portuguese, and while open-weight LLMs enable local deployment, no single model dominates all tasks, creating a need for ensemble approaches specifically for NER.

Method: A novel three-step ensemble pipeline for zero-shot NER using similarly capable, locally run LLMs with a heuristic to select optimal model combinations using minimal annotated data.

Result: Outperforms individual LLMs in 4 out of 5 Portuguese NER datasets, and ensembles obtained on different source datasets generally outperform individual LLMs in cross-dataset configurations.

Conclusion: Advances scalable, low-resource, zero-shot NER by effectively combining multiple small LLMs without fine-tuning, potentially eliminating the need for annotated data for current tasks.

Abstract: Large Language Models (LLMs) excel in many Natural Language Processing (NLP) tasks through in-context learning but often under-perform in Named Entity Recognition (NER), especially for lower-resource languages like Portuguese. While open-weight LLMs enable local deployment, no single model dominates all tasks, motivating ensemble approaches. However, existing LLM ensembles focus on text generation or classification, leaving NER under-explored. In this context, this work proposes a novel three-step ensemble pipeline for zero-shot NER using similarly capable, locally run LLMs. Our method outperforms individual LLMs in four out of five Portuguese NER datasets by leveraging a heuristic to select optimal model combinations with minimal annotated data. Moreover, we show that ensembles obtained on different source datasets generally outperform individual LLMs in cross-dataset configurations, potentially eliminating the need for annotated data for the current task. Our work advances scalable, low-resource, and zero-shot NER by effectively combining multiple small LLMs without fine-tuning. Code is available at https://github.com/Joao-Luz/local-llm-ner-ensemble.

[331] Better Prevent than Tackle: Valuing Defense in Soccer Based on Graph Neural Networks

Hyunsung Kim, Sangwoo Seo, Hoyoung Choi, Tom Boomstra, Jinsung Yoon, Chanyoung Park

Main category: cs.LG

TL;DR: DEFCON is a new framework that uses Graph Attention Networks to quantify defensive contributions in soccer by measuring how defenders prevent dangerous attacks before they happen, not just through visible on-ball actions.

Details

Motivation: Current defensive evaluation methods focus too much on visible on-ball actions (interceptions, tackles) while missing the crucial defensive work that prevents dangerous opportunities from arising in the first place. There's a gap in measuring defenders' true impact on preventing attacks.

Method: Uses Graph Attention Networks to estimate success probability and expected value of each attacking option, plus each defender’s responsibility for stopping it. Calculates Expected Possession Value (EPV) before and after each action, assigning positive/negative credits to defenders based on whether they reduced or increased opponent’s EPV.

Result: Trained on 2023-24 and evaluated on 2024-25 Eredivisie data, DEFCON’s aggregated player credits show strong positive correlations with market valuations. The framework enables practical applications like in-game defensive contribution timelines, spatial analyses across pitch zones, and attacker-defender interaction summaries.

Conclusion: DEFCON provides a comprehensive framework for quantifying defensive contributions that goes beyond traditional on-ball metrics, offering a more complete picture of defenders’ impact and enabling various practical applications for performance analysis.

Abstract: Evaluating defensive performance in soccer remains challenging, as effective defending is often expressed not through visible on-ball actions such as interceptions and tackles, but through preventing dangerous opportunities before they arise. Existing approaches have largely focused on valuing on-ball actions, leaving much of defenders’ true impact unmeasured. To address this gap, we propose DEFCON (DEFensive CONtribution evaluator), a comprehensive framework that quantifies player-level defensive contributions for every attacking situation in soccer. Leveraging Graph Attention Networks, DEFCON estimates the success probability and expected value of each attacking option, along with each defender’s responsibility for stopping it. These components yield an Expected Possession Value (EPV) for the attacking team before and after each action, and DEFCON assigns positive or negative credits to defenders according to whether they reduced or increased the opponent’s EPV. Trained on 2023-24 and evaluated on 2024-25 Eredivisie event and tracking data, DEFCON’s aggregated player credits exhibit strong positive correlations with market valuations. Finally, we showcase several practical applications, including in-game timelines of defensive contributions, spatial analyses across pitch zones, and pairwise summaries of attacker-defender interactions.

[332] Detailed balance in large language model-driven agents

Zhuo-Yang Song, Qing-Hong Cao, Ming-xing Luo, Hua Xing Zhu

Main category: cs.LG

TL;DR: Researchers discovered a detailed balance principle in LLM-generated transitions, suggesting LLMs implicitly learn underlying potential functions rather than explicit rules, revealing a universal physical law in LLM dynamics.

Details

Motivation: Despite empirical success of LLM-driven agents, there's a lack of theoretical framework to understand their macroscopic dynamics. The paper aims to establish a scientific foundation for studying AI agents beyond engineering practices.

Method: Used least action principle to estimate generative directionality of LLMs. Experimentally measured transition probabilities between LLM-generated states to statistically analyze the dynamics.

Result: Discovered detailed balance in LLM-generated transitions, indicating LLMs learn underlying potential functions rather than explicit rule sets. This principle appears to transcend different LLM architectures and prompt templates.

Conclusion: First discovery of macroscopic physical law in LLM generative dynamics independent of specific model details. This work establishes foundation for macroscopic dynamics theory of complex AI systems, elevating AI agent study to a quantifiable science.

Abstract: Large language model (LLM)-driven agents are emerging as a powerful new paradigm for solving complex problems. Despite the empirical success of these practices, a theoretical framework to understand and unify their macroscopic dynamics remains lacking. This Letter proposes a method based on the least action principle to estimate the underlying generative directionality of LLMs embedded within agents. By experimentally measuring the transition probabilities between LLM-generated states, we statistically discover a detailed balance in LLM-generated transitions, indicating that LLM generation may not be achieved by generally learning rule sets and strategies, but rather by implicitly learning a class of underlying potential functions that may transcend different LLM architectures and prompt templates. To our knowledge, this is the first discovery of a macroscopic physical law in LLM generative dynamics that does not depend on specific model details. This work is an attempt to establish a macroscopic dynamics theory of complex AI systems, aiming to elevate the study of AI agents from a collection of engineering practices to a science built on effective measurements that are predictable and quantifiable.

[333] DB2-TransF: All You Need Is Learnable Daubechies Wavelets for Time Series Forecasting

Moulik Gupta, Achyut Mani Tripathi

Main category: cs.LG

TL;DR: DB2-TransF replaces Transformer self-attention with learnable Daubechies wavelet coefficients for efficient multi-scale time series forecasting.

Details

Motivation: Transformers have quadratic complexity limiting scalability for large-scale time series forecasting, needing more efficient architectures that maintain modeling power.

Method: Novel Transformer-inspired architecture using learnable Daubechies wavelet coefficient layer instead of self-attention to capture multi-scale local/global patterns and cross-series correlations.

Result: Achieves comparable or superior accuracy to conventional Transformers on 13 benchmarks while substantially reducing memory usage.

Conclusion: DB2-TransF provides a scalable, resource-efficient framework for advanced time series forecasting with wavelet-based attention replacement.

Abstract: Time series forecasting requires models that can efficiently capture complex temporal dependencies, especially in large-scale and high-dimensional settings. While Transformer-based architectures excel at modeling long-range dependencies, their quadratic computational complexity poses limitations on scalability and adaptability. To overcome these challenges, we introduce DB2-TransF, a novel Transformer-inspired architecture that replaces the self-attention mechanism with a learnable Daubechies wavelet coefficient layer. This wavelet-based module efficiently captures multi-scale local and global patterns and enhances the modeling of correlations across multiple time series for the time series forecasting task. Extensive experiments on 13 standard forecasting benchmarks demonstrate that DB2-TransF achieves comparable or superior predictive accuracy to conventional Transformers, while substantially reducing memory usage for the time series forecasting task. The obtained experimental results position DB2-TransF as a scalable and resource-efficient framework for advanced time series forecasting. Our code is available at https://github.com/SteadySurfdom/DB2-TransF

[334] Mitigating Exposure Bias in Risk-Aware Time Series Forecasting with Soft Tokens

Alireza Namazi, Amirreza Dolatpour Fathkouhi, Heman Shakeri

Main category: cs.LG

TL;DR: SoTra uses soft token trajectory forecasting with continuous probability distributions to reduce exposure bias and improve multi-step forecasting for clinical applications like diabetes and blood pressure management.

Details

Motivation: Standard autoregressive models trained with teacher forcing suffer from exposure bias, leading to unstable multi-step forecasts that are unsuitable for closed-loop clinical control where different operating zones have different clinical risks.

Method: Introduces Soft-Token Trajectory Forecasting (SoTra) which propagates continuous probability distributions (soft tokens) to mitigate exposure bias and learn calibrated, uncertainty-aware trajectories, plus a risk-aware decoding module that minimizes expected clinical harm.

Result: In glucose forecasting, SoTra reduces average zone-based risk by 18%; in blood-pressure forecasting, it lowers effective clinical risk by approximately 15%.

Conclusion: SoTra’s improvements in forecasting accuracy and risk reduction support its use in safety-critical predictive control applications for clinical management.

Abstract: Autoregressive forecasting is central to predictive control in diabetes and hemodynamic management, where different operating zones carry different clinical risks. Standard models trained with teacher forcing suffer from exposure bias, yielding unstable multi-step forecasts for closed-loop use. We introduce Soft-Token Trajectory Forecasting (SoTra), which propagates continuous probability distributions (``soft tokens’’) to mitigate exposure bias and learn calibrated, uncertainty-aware trajectories. A risk-aware decoding module then minimizes expected clinical harm. In glucose forecasting, SoTra reduces average zone-based risk by 18%; in blood-pressure forecasting, it lowers effective clinical risk by approximately 15%. These improvements support its use in safety-critical predictive control.

[335] \textsc{Text2Graph}: Combining Lightweight LLMs and GNNs for Efficient Text Classification in Label-Scarce Scenarios

João Lucas Luz Lima Sarcinelli, Ricardo Marcondes Marcacini

Main category: cs.LG

TL;DR: Text2Graph is an open-source Python package that combines LLM-based partial annotation with GNN label propagation for sustainable, energy-efficient zero-shot text classification.

Details

Motivation: LLMs are effective zero-shot classifiers but have high computational requirements and environmental costs, limiting their practicality for large-scale annotation in HPC environments. There's a need for more sustainable workflows.

Method: Text2Graph provides modular implementation of text-to-graph classification approaches, combining LLM-based partial annotation with Graph Neural Network label propagation. It allows flexible swapping of components like feature extractors, edge construction methods, and sampling strategies.

Result: Benchmarked on five datasets for topic classification and sentiment analysis, graph-based propagation achieves competitive results at a fraction of the energy and environmental cost compared to other zero-shot approaches.

Conclusion: Text2Graph enables sustainable text classification by significantly reducing computational requirements and carbon emissions while maintaining competitive performance through the combination of LLM partial annotation and GNN label propagation.

Abstract: Large Language Models (LLMs) have become effective zero-shot classifiers, but their high computational requirements and environmental costs limit their practicality for large-scale annotation in high-performance computing (HPC) environments. To support more sustainable workflows, we present \textsc{Text2Graph}, an open-source Python package that provides a modular implementation of existing text-to-graph classification approaches. The framework enables users to combine LLM-based partial annotation with Graph Neural Network (GNN) label propagation in a flexible manner, making it straightforward to swap components such as feature extractors, edge construction methods, and sampling strategies. We benchmark \textsc{Text2Graph} on a zero-shot setting using five datasets spanning topic classification and sentiment analysis tasks, comparing multiple variants against other zero-shot approaches for text classification. In addition to reporting performance, we provide detailed estimates of energy consumption and carbon emissions, showing that graph-based propagation achieves competitive results at a fraction of the energy and environmental cost.

[336] MedXAI: A Retrieval-Augmented and Self-Verifying Framework for Knowledge-Guided Medical Image Analysis

Midhat Urooj, Ayan Banerjee, Farhat Shaikh, Kuntal Thakur, Sandeep Gupta

Main category: cs.LG

TL;DR: MedXAI is an explainable medical imaging framework that integrates deep learning with clinician expert knowledge to improve generalization, reduce rare-class bias, and provide clinically aligned explanations without relying on technical post-hoc methods.

Details

Motivation: Deep learning models struggle with real-world distribution shifts, exhibit bias against infrequent pathologies, and lack transparency needed for safety-critical clinical deployment. Current methods rely on technical post-hoc explanations (like Saliency Maps, LIME) that may not align with clinical reasoning.

Method: MedXAI integrates deep vision models with clinician-derived expert knowledge in a unified framework. It uses symbolic components as clinical priors and regularizers to localize relevant diagnostic features rather than relying on technical post-hoc explanation methods.

Result: Experiments across ten multicenter datasets show consistent gains: 3% improvement in cross-domain generalization and 10% improvement in F1 score for rare classes, substantially outperforming strong deep learning baselines. The framework performs well on seizure onset zone localization from fMRI and diabetic retinopathy grading.

Conclusion: MedXAI delivers clinically aligned explanations while achieving superior in-domain and cross-domain performance, particularly for rare diseases in multimodal medical AI. The symbolic components act as effective clinical priors and regularizers, improving robustness under distribution shift.

Abstract: Accurate and interpretable image-based diagnosis remains a fundamental challenge in medical AI, particularly un- der domain shifts and rare-class conditions. Deep learning mod- els often struggle with real-world distribution changes, exhibit bias against infrequent pathologies, and lack the transparency required for deployment in safety-critical clinical environments. We introduce MedXAI (An Explainable Framework for Med- ical Imaging Classification), a unified expert knowledge based framework that integrates deep vision models with clinician- derived expert knowledge to improve generalization, reduce rare- class bias, and provide human-understandable explanations by localizing the relevant diagnostic features rather than relying on technical post-hoc methods (e.g., Saliency Maps, LIME). We evaluate MedXAI across heterogeneous modalities on two challenging tasks: (i) Seizure Onset Zone localization from resting-state fMRI, and (ii) Diabetic Retinopathy grading. Ex periments on ten multicenter datasets show consistent gains, including a 3% improvement in cross-domain generalization and a 10% improvmnet in F1 score of rare class, substantially outperforming strong deep learning baselines. Ablations confirm that the symbolic components act as effective clinical priors and regularizers, improving robustness under distribution shift. MedXAI delivers clinically aligned explanations while achieving superior in-domain and cross-domain performance, particularly for rare diseases in multimodal medical AI.

[337] CHyLL: Learning Continuous Neural Representations of Hybrid Systems

Sangli Teng, Hang Liu, Jingyu Song, Koushil Sreenath

Main category: cs.LG

TL;DR: CHyLL learns continuous neural representations of hybrid systems without trajectory segmentation or mode switching by embedding state space into higher-dimensional quotient manifolds where flows become continuous.

Details

Motivation: Existing methods for learning hybrid system dynamics suffer from discontinuities at mode switches and guard surfaces, requiring explicit segmentation and event detection. CHyLL aims to overcome these limitations by creating a continuous representation that handles both continuous and discrete dynamics seamlessly.

Method: CHyLL reformulates hybrid system state space as a piecewise smooth quotient manifold where reset maps “glue” states at guard surfaces, making flows spatially continuous. Using differential topology embedding theorems, it learns a singularity-free neural embedding in higher-dimensional space and the continuous flow within it, eliminating need for trajectory segmentation or mode switching.

Result: CHyLL achieves superior accuracy in predicting hybrid system flows compared to existing methods, successfully identifies topological invariants of hybrid systems, and demonstrates practical application to stochastic optimal control problems.

Conclusion: CHyLL provides a novel continuous learning framework for hybrid systems that bypasses traditional segmentation and mode-switching approaches through quotient manifold embeddings, enabling more accurate flow prediction and topological analysis while being applicable to control problems.

Abstract: Learning the flows of hybrid systems that have both continuous and discrete time dynamics is challenging. The existing method learns the dynamics in each discrete mode, which suffers from the combination of mode switching and discontinuities in the flows. In this work, we propose CHyLL (Continuous Hybrid System Learning in Latent Space), which learns a continuous neural representation of a hybrid system without trajectory segmentation, event functions, or mode switching. The key insight of CHyLL is that the reset map glues the state space at the guard surface, reformulating the state space as a piecewise smooth quotient manifold where the flow becomes spatially continuous. Building upon these insights and the embedding theorems grounded in differential topology, CHyLL concurrently learns a singularity-free neural embedding in a higher-dimensional space and the continuous flow in it. We showcase that CHyLL can accurately predict the flow of hybrid systems with superior accuracy and identify the topological invariants of the hybrid systems. Finally, we apply CHyLL to the stochastic optimal control problem.

[338] Partitioning the Sample Space for a More Precise Shannon Entropy Estimation

Gabriel F. A. Bastos, Jugurta Montalvão

Main category: cs.LG

TL;DR: Proposes a new discrete entropy estimator that addresses negative bias in undersampled regimes by combining decomposability with missing mass and unseen outcomes estimation.

Details

Motivation: Reliable entropy estimation from small datasets is critical when sample size is smaller than possible outcomes, but existing estimators suffer from negative bias in undersampled regimes.

Method: Uses decomposability property combined with estimations of missing mass and number of unseen outcomes to compensate for negative bias in entropy estimation.

Result: Outperforms classical estimators in undersampled regimes and performs comparably with well-established state-of-the-art estimators.

Conclusion: The proposed method provides reliable entropy estimation for small datasets where traditional estimators fail due to undersampling bias.

Abstract: Reliable data-driven estimation of Shannon entropy from small data sets, where the number of examples is potentially smaller than the number of possible outcomes, is a critical matter in several applications. In this paper, we introduce a discrete entropy estimator, where we use the decomposability property in combination with estimations of the missing mass and the number of unseen outcomes to compensate for the negative bias induced by them. Experimental results show that the proposed method outperforms some classical estimators in undersampled regimes, and performs comparably with some well-established state-of-the-art estimators.

[339] Sequence-to-Image Transformation for Sequence Classification Using Rips Complex Construction and Chaos Game Representation

Sarwan Ali, Taslim Murad, Imdadullah Khan

Main category: cs.LG

TL;DR: A topological approach transforms molecular sequences into images using Chaos Game Representation and Rips complexes, achieving superior performance on anticancer peptide classification.

Details

Motivation: Traditional feature engineering for molecular sequences suffers from sparsity and computational complexity, while deep learning models underperform on tabular biological data. There's a need for better representations that preserve sequence information while enabling effective use of vision-based architectures.

Method: Combines Chaos Game Representation (CGR) with Rips complex construction from algebraic topology. Maps sequence elements to 2D coordinates via CGR, computes pairwise distances, and constructs Rips complexes to capture both local structural and global topological features.

Result: Achieves 86.8% accuracy on breast cancer dataset and 94.5% accuracy on lung cancer dataset, outperforming vector-based, sequence language models, and existing image-based methods. Provides formal guarantees on representation uniqueness, topological stability, and information preservation.

Conclusion: The topological representation effectively preserves critical sequence information while enabling successful application of vision-based deep learning architectures for molecular sequence analysis, offering a novel and superior approach to molecular sequence classification.

Abstract: Traditional feature engineering approaches for molecular sequence classification suffer from sparsity issues and computational complexity, while deep learning models often underperform on tabular biological data. This paper introduces a novel topological approach that transforms molecular sequences into images by combining Chaos Game Representation (CGR) with Rips complex construction from algebraic topology. Our method maps sequence elements to 2D coordinates via CGR, computes pairwise distances, and constructs Rips complexes to capture both local structural and global topological features. We provide formal guarantees on representation uniqueness, topological stability, and information preservation. Extensive experiments on anticancer peptide datasets demonstrate superior performance over vector-based, sequence language models, and existing image-based methods, achieving 86.8% and 94.5% accuracy on breast and lung cancer datasets, respectively. The topological representation preserves critical sequence information while enabling effective utilization of vision-based deep learning architectures for molecular sequence analysis.

[340] Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

Sarwan Ali, Taslim Murad

Main category: cs.LG

TL;DR: A scalable embedding method using hashing for SARS-CoV-2 spike sequences achieves 86.4% classification accuracy with 99.81% faster embedding generation compared to existing methods.

Details

Motivation: Current methods for COVID-19 sequence analysis face scalability issues: phylogenetic trees are computationally intensive for millions of sequences, while existing embedding methods require aligned sequences or have poor performance and high runtime costs.

Method: Developed a scalable embedding method using hashing to generate compact, low-dimensional representations of SARS-CoV-2 spike protein sequences, then trained various machine learning models for supervised lineage classification.

Result: The proposed embeddings achieved up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81% compared to baseline and state-of-the-art biological sequence embedding methods.

Conclusion: The hashing-based embedding method provides a fast, effective, and scalable solution for large-scale viral sequence analysis, addressing computational bottlenecks in current approaches.

Abstract: Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today’s multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81%. This highlights the method’s potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.

[341] Rethinking Causal Discovery Through the Lens of Exchangeability

Tiago Brogueira, Mário Figueiredo

Main category: cs.LG

TL;DR: The paper reframes i.i.d. causal discovery in terms of exchangeability, shows existing methods rely on exchangeability, creates an exchangeable-only synthetic dataset that better matches real-world data, and demonstrates a neural network trained on this dataset performs competitively.

Details

Motivation: Traditional causal discovery methods have been developed separately for i.i.d. and time-series data, but the i.i.d. setting can be reframed using exchangeability - a more general symmetry principle that better reflects real-world data assumptions.

Method: 1) Conceptual argument linking experimental causal inference’s dependency on exchangeability to causal discovery; 2) Empirical analysis showing existing i.i.d. methods rely on exchangeability; 3) Creation of novel synthetic dataset enforcing only exchangeability (not i.i.d.); 4) Development of neural-network-based causal discovery algorithm trained exclusively on the exchangeable synthetic dataset.

Result: The exchangeable synthetic dataset mirrors the statistical structure of the real-world Tübingen dataset more closely than i.i.d. synthetic datasets. The neural network trained on this exchangeable dataset performs similarly to state-of-the-art i.i.d. methods on the real-world benchmark.

Conclusion: Exchangeability provides a more appropriate foundation for causal discovery than i.i.d. assumptions, better reflecting real-world data characteristics and enabling more effective synthetic data generation and algorithm development.

Abstract: Causal discovery methods have traditionally been developed under two distinct regimes: independent and identically distributed (i.i.d.) and timeseries data, each governed by separate modelling assumptions. In this paper, we argue that the i.i.d. setting can and should be reframed in terms of exchangeability, a strictly more general symmetry principle. We present the implications of this reframing, alongside two core arguments: (1) a conceptual argument, based on extending the dependency of experimental causal inference on exchangeability to causal discovery; and (2) an empirical argument, showing that many existing i.i.d. causal-discovery methods are predicated on exchangeability assumptions, and that the sole extensive widely-used real-world “i.i.d.” benchmark (the Tübingen dataset) consists mainly of exchangeable (and not i.i.d.) examples. Building on this insight, we introduce a novel synthetic dataset that enforces only the exchangeability assumption, without imposing the stronger i.i.d. assumption. We show that our exchangeable synthetic dataset mirrors the statistical structure of the real-world “i.i.d.” dataset more closely than all other i.i.d. synthetic datasets. Furthermore, we demonstrate the predictive capability of this dataset by proposing a neural-network-based causal-discovery algorithm trained exclusively on our synthetic dataset, and which performs similarly to other state-of-the-art i.i.d. methods on the real-world benchmark.

[342] CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation

Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada

Main category: cs.LG

TL;DR: CIEGAD is a geometry-aware, domain-aligned data augmentation framework that uses cluster conditioning and hierarchical allocation to systematically fill semantically uncovered regions in real-world data distributions, improving performance on imbalanced classification tasks.

Details

Motivation: Practical deep learning faces challenges from data scarcity and label imbalance, creating semantically uncovered regions in data distributions that cause misclassification near class boundaries and unstable model behavior. Existing LLM-based augmentation lacks integrated directional control, domain alignment, and quality control.

Method: CIEGAD uses cluster conditioning to construct domain profiles, hierarchical frequency-geometric allocation for generation planning, interpolative and extrapolative synthesis for directional control, and geometry-constrained filtering with LLM-as-a-Judge for quality control.

Result: Experiments show CIEGAD effectively extends data distribution peripheries while maintaining high alignment with real-world data and semantic diversity. It consistently improves F1 and recall scores for long-tailed and multi-class classification tasks.

Conclusion: CIEGAD provides a practical data augmentation framework that complements underrepresented regions while preserving real-world data alignment, achieving harmony between distributional consistency, diversity, and quality.

Abstract: In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.

[343] Assessing Neuromorphic Computing for Fingertip Force Decoding from Electromyography

Abolfazl Shahrooei, Luke Arthur, Om Patel, Derek Kamper

Main category: cs.LG

TL;DR: SNN vs TCN for decoding fingertip force from HD-sEMG: TCN outperforms SNN (4.44% vs 8.25% MVC RMSE), but SNN shows promise as neuromorphic baseline.

Details

Motivation: HD-sEMG provides noninvasive neural interface for assistive/rehabilitation control, but mapping neural activity to motor intent remains challenging. Need to compare neuromorphic (SNN) vs conventional (TCN) approaches for force decoding.

Method: Used HD-sEMG data from single participant (10 trials) with two forearm electrode arrays. Extracted MU firing via FastICA-based decomposition. Trained SNN (neuromorphic) and TCN (temporal convolutional network) on overlapping windows with end-to-end causal convolutions for fingertip force decoding.

Result: TCN achieved better performance: 4.44% MVC RMSE (Pearson r = 0.974). SNN achieved 8.25% MVC RMSE (r = 0.922). TCN was more accurate on held-out trials.

Conclusion: While TCN outperformed SNN, the SNN serves as a realistic neuromorphic baseline that could close the performance gap with architectural and hyperparameter refinements, offering potential for neuromorphic computing in neural interfaces.

Abstract: High-density surface electromyography (HD-sEMG) provides a noninvasive neural interface for assistive and rehabilitation control, but mapping neural activity to user motor intent remains challenging. We assess a spiking neural network (SNN) as a neuromorphic architecture against a temporal convolutional network (TCN) for decoding fingertip force from motor-unit (MU) firing derived from HD-sEMG. Data were collected from a single participant (10 trials) with two forearm electrode arrays; MU activity was obtained via FastICA-based decomposition, and models were trained on overlapping windows with end-to-end causal convolutions. On held-out trials, the TCN achieved 4.44% MVC RMSE (Pearson r = 0.974) while the SNN achieved 8.25% MVC (r = 0.922). While the TCN was more accurate, we view the SNN as a realistic neuromorphic baseline that could close much of this gap with modest architectural and hyperparameter refinements.

[344] MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou

Main category: cs.LG

TL;DR: First translation of miniF2F math reasoning benchmark to Dafny automated theorem prover, achieving 40-45% verification with empty proofs and 55.7% success with LLM hints.

Details

Motivation: To bridge mathematical reasoning benchmarks from interactive theorem provers to automated theorem provers, specifically translating miniF2F to Dafny to enable automated verification and explore LLM-assisted proof generation.

Method: Translated miniF2F benchmark to Dafny, tested empty proof verification, then evaluated 12 LLMs on providing proof hints for remaining problems using iterative error correction techniques.

Result: Dafny verified 40.6% of test set and 44.7% of validation set with empty proofs. Best LLM achieved 55.7% pass@4 success rate with iterative error correction for remaining problems.

Conclusion: Demonstrates effective division of labor: LLMs provide high-level guidance while automation handles low-level details. The benchmark enables further research on automated theorem proving with Dafny.

Abstract: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny’s automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs–requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

[345] Exact Recovery of Non-Random Missing Multidimensional Time Series via Temporal Isometric Delay-Embedding Transform

Hao Shu, Jicheng Li, Yu Jin, Ling Zhou

Main category: cs.LG

TL;DR: Proposes LRTC-TIDT, a tensor completion method using temporal isometric delay-embedding transform to handle non-random missing data in multidimensional time series with theoretical guarantees.

Details

Motivation: Non-random missing data in multidimensional time series undermines data-driven analysis, but existing low-rank tensor completion methods fail to handle non-random missingness both methodologically and theoretically. Hankel-based approaches lack clear sources of low-rankness and recovery theory for non-random patterns.

Method: Introduces temporal isometric delay-embedding transform to construct Hankel tensor whose low-rankness comes naturally from time series smoothness/periodicity. Develops LRTC-TIDT model using Tensor Singular Value Decomposition (t-SVD) framework for low-rank structure characterization.

Result: Theoretical exact recovery guarantee under prescribed non-random sampling conditions and mild incoherence assumptions. Simulation experiments confirm exact recovery across various non-random missing patterns. Outperforms existing tensor methods in real-world tasks: network flow reconstruction, urban traffic estimation, and temperature field prediction.

Conclusion: LRTC-TIDT provides theoretically grounded solution for non-random missing data in multidimensional time series, addressing limitations of existing Hankel-based methods through principled delay-embedding transform with proven recovery guarantees and superior practical performance.

Abstract: Non-random missing data is a ubiquitous yet undertreated flaw in multidimensional time series, fundamentally threatening the reliability of data-driven analysis and decision-making. Pure low-rank tensor completion, as a classical data recovery method, falls short in handling non-random missingness, both methodologically and theoretically. Hankel-structured tensor completion models provide a feasible approach for recovering multidimensional time series with non-random missing patterns. However, most Hankel-based multidimensional data recovery methods both suffer from unclear sources of Hankel tensor low-rankness and lack an exact recovery theory for non-random missing data. To address these issues, we propose the temporal isometric delay-embedding transform, which constructs a Hankel tensor whose low-rankness is naturally induced by the smoothness and periodicity of the underlying time series. Leveraging this property, we develop the \textit{Low-Rank Tensor Completion with Temporal Isometric Delay-embedding Transform} (LRTC-TIDT) model, which characterizes the low-rank structure under the \textit{Tensor Singular Value Decomposition} (t-SVD) framework. Once the prescribed non-random sampling conditions and mild incoherence assumptions are satisfied, the proposed LRTC-TIDT model achieves exact recovery, as confirmed by simulation experiments under various non-random missing patterns. Furthermore, LRTC-TIDT consistently outperforms existing tensor-based methods across multiple real-world tasks, including network flow reconstruction, urban traffic estimation, and temperature field prediction. Our implementation is publicly available at https://github.com/HaoShu2000/LRTC-TIDT.

[346] Federated Domain Generalization with Latent Space Inversion

Ragja Palakkadavath, Hung Le, Thanh Nguyen-Tang, Svetha Venkatesh, Sunil Gupta

Main category: cs.LG

TL;DR: FedDG method improves federated domain generalization with latent space inversion for privacy and important weight aggregation for non-i.i.d. clients, achieving SOTA results with less communication.

Details

Motivation: Existing FedDG methods compromise privacy by sharing client data statistics, and struggle with non-i.i.d. client distributions where local adaptations may be lost during aggregation.

Method: Two key innovations: 1) Latent space inversion to enforce domain invariance across local models while preserving privacy, 2) Important weight aggregation strategy that prioritizes parameters significantly influencing local model predictions.

Result: Superior results over state-of-the-art methods with reduced communication overhead, demonstrating effective generalization to unseen clients while maintaining privacy.

Conclusion: The proposed approach successfully addresses privacy concerns in FedDG while handling non-i.i.d. client distributions through novel training and aggregation techniques, achieving better generalization with less communication.

Abstract: Federated domain generalization (FedDG) addresses distribution shifts among clients in a federated learning framework. FedDG methods aggregate the parameters of locally trained client models to form a global model that generalizes to unseen clients while preserving data privacy. While improving the generalization capability of the global model, many existing approaches in FedDG jeopardize privacy by sharing statistics of client data between themselves. Our solution addresses this problem by contributing new ways to perform local client training and model aggregation. To improve local client training, we enforce (domain) invariance across local models with the help of a novel technique, \textbf{latent space inversion}, which enables better client privacy. When clients are not \emph{i.i.d}, aggregating their local models may discard certain local adaptations. To overcome this, we propose an \textbf{important weight} aggregation strategy to prioritize parameters that significantly influence predictions of local models during aggregation. Our extensive experiments show that our approach achieves superior results over state-of-the-art methods with less communication overhead.

[347] Adaptive Information Routing for Multimodal Time Series Forecasting

Jun Seo, Hyeokjun Choe, Seohui Bae, Soyeon Park, Wonbin Ahn, Taeyoon Lim, Junhyuk Kang, Sangjun Han, Jaehoon Lee, Dongwan Kang, Minjae Kim, Sungdong Yoo, Soonyoung Lee

Main category: cs.LG

TL;DR: AIR framework uses text data to dynamically guide time series models by controlling how multivariate time series information should be combined, improving forecasting accuracy.

Details

Motivation: Traditional time series forecasting relying only on historical data is insufficient for accurate predictions due to limited information. Multimodal approaches incorporating text data are needed but existing methods treat text as interchangeable auxiliary features rather than effectively guiding the forecasting process.

Method: Proposes Adaptive Information Routing (AIR) framework that uses text information to dynamically guide time series models by controlling how and to what extent multivariate time series information should be combined. Also introduces a text-refinement pipeline using LLMs to convert raw text data into suitable form for multimodal forecasting, and creates a benchmark for multimodal forecasting experiments.

Result: Experiments with real-world market data (crude oil price and exchange rates) show AIR effectively modulates time series model behavior using textual inputs, significantly enhancing forecasting accuracy across various time series forecasting tasks.

Conclusion: AIR framework successfully addresses limitations of traditional multimodal forecasting by using text data to dynamically guide time series models, demonstrating superior performance in real-world applications and providing a practical text-refinement pipeline and benchmark for future research.

Abstract: Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.

[348] R^2-HGP: A Double-Regularized Gaussian Process for Heterogeneous Transfer Learning

Duo Wang, Xinming Wang, Chao Wang, Xiaowei Yue, Jianguo Wu

Main category: cs.LG

TL;DR: R²-HGP: A double-regularized heterogeneous Gaussian process framework that addresses multi-source transfer learning challenges by aligning heterogeneous input domains, incorporating physical knowledge, and adaptively selecting informative sources to prevent negative transfer.

Details

Motivation: Traditional multi-output Gaussian process models face three key challenges in transfer learning: 1) heterogeneous input spaces between source and target domains, 2) ignoring prior knowledge and physical information, and 3) inappropriate information sharing leading to negative transfer. Existing models fail to address these issues in a unified framework.

Method: Proposes R²-HGP framework with: 1) trainable prior probability mapping model to align heterogeneous input domains, 2) multi-source transfer GP model built on aligned latent variables integrated into CVAE framework, 3) physical knowledge regularization to ensure alignment adheres to known physics, and 4) sparsity penalty on transfer coefficients to adaptively select informative sources and suppress negative transfer.

Result: Extensive simulations and real-world engineering case studies validate R²-HGP’s effectiveness, demonstrating consistent superiority over state-of-the-art benchmarks across diverse evaluation metrics.

Conclusion: R²-HGP successfully addresses the three key challenges of heterogeneous transfer learning by providing a unified framework that aligns heterogeneous domains, incorporates physical knowledge, and prevents negative transfer through adaptive source selection.

Abstract: Multi-output Gaussian process (MGP) models have attracted significant attention for their flexibility and uncertainty-quantification capabilities, and have been widely adopted in multi-source transfer learning scenarios due to their ability to capture inter-task correlations. However, they still face several challenges in transfer learning. First, the input spaces of the source and target domains are often heterogeneous, which makes direct knowledge transfer difficult. Second, potential prior knowledge and physical information are typically ignored during heterogeneous transfer, hampering the utilization of domain-specific insights and leading to unstable mappings. Third, inappropriate information sharing among target and sources can easily lead to negative transfer. Traditional models fail to address these issues in a unified way. To overcome these limitations, this paper proposes a Double-Regularized Heterogeneous Gaussian Process framework (R^2-HGP). Specifically, a trainable prior probability mapping model is first proposed to align the heterogeneous input domains. The resulting aligned inputs are treated as latent variables, upon which a multi-source transfer GP model is constructed and the entire structure is integrated into a novel conditional variational autoencoder (CVAE) based framework. Physical insights is further incorporated as a regularization term to ensure that the alignment results adhere to known physical knowledge. Next, within the multi-source transfer GP model, a sparsity penalty is imposed on the transfer coefficients, enabling the model to adaptively select the most informative source outputs and suppress negative transfer. Extensive simulations and real-world engineering case studies validate the effectiveness of our R^2-HGP, demonstrating consistent superiority over state-of-the-art benchmarks across diverse evaluation metrics.

[349] A Kernel-based Resource-efficient Neural Surrogate for Multi-fidelity Prediction of Aerodynamic Field

Apurba Sarker, Reza T. Batley, Darshan Sarojini, Sourav Saha

Main category: cs.LG

TL;DR: KHRONOS, a kernel-based neural surrogate using variational principles and tensor decomposition, outperforms MLP, GNN, and PINN in resource-constrained aerodynamic field prediction with sparse high-fidelity data.

Details

Motivation: Need for fast surrogate models in aerodynamic design/optimization that can effectively blend sparse high-fidelity data with abundant low-fidelity information under varying computational constraints.

Method: Proposes KHRONOS - a kernel-based neural surrogate built on variational principles, interpolation theory, and tensor decomposition for heavy pruning. Uses AirfRANS dataset (HF) and NeuralFoil (LF) to predict surface pressure coefficient distribution. Compares with MLP, GNN, and PINN across varying HF data availability (0%, 10%, 30%) and geometry complexity.

Result: While all models achieve comparable accuracy eventually, KHRONOS excels in resource-constrained conditions: requires orders of magnitude fewer trainable parameters, delivers much faster training/inference at comparable accuracy, and performs best with limited HF data.

Conclusion: KHRONOS and similar architectures effectively balance accuracy and efficiency in multi-fidelity aerodynamic field prediction, particularly valuable when computational resources are constrained.

Abstract: Surrogate models provide fast alternatives to costly aerodynamic simulations and are extremely useful in design and optimization applications. This study proposes the use of a recent kernel-based neural surrogate, KHRONOS. In this work, we blend sparse high-fidelity (HF) data with low-fidelity (LF) information to predict aerodynamic fields under varying constraints in computational resources. Unlike traditional approaches, KHRONOS is built upon variational principles, interpolation theory, and tensor decomposition. These elements provide a mathematical basis for heavy pruning compared to dense neural networks. Using the AirfRANS dataset as a high-fidelity benchmark and NeuralFoil to generate low-fidelity counterparts, this work compares the performance of KHRONOS with three contemporary model architectures: a multilayer perceptron (MLP), a graph neural network (GNN), and a physics-informed neural network (PINN). We consider varying levels of high-fidelity data availability (0%, 10%, and 30%) and increasingly complex geometry parameterizations. These are used to predict the surface pressure coefficient distribution over the airfoil. Results indicate that, whilst all models eventually achieve comparable predictive accuracy, KHRONOS excels in resource-constrained conditions. In this domain, KHRONOS consistently requires orders of magnitude fewer trainable parameters and delivers much faster training and inference than contemporary dense neural networks at comparable accuracy. These findings highlight the potential of KHRONOS and similar architectures to balance accuracy and efficiency in multi-fidelity aerodynamic field prediction.

[350] An Interpretable AI Tool for SAVR vs TAVR in Low to Intermediate Risk Patients with Severe Aortic Stenosis

Vasiliki Stoumpou, Maciej Tysarowski, Talhat Azemi, Jawad Haider, Howard L. Haronian, Robert C. Hagberg, Dimitris Bertsimas

Main category: cs.LG

TL;DR: Interpretable prescriptive framework using Optimal Policy Trees provides personalized TAVR vs SAVR recommendations, reducing estimated 5-year mortality by 20.3% and 13.8% in two hospital cohorts.

Details

Motivation: Treatment selection between SAVR and TAVR for aortic stenosis patients remains variable due to patient heterogeneity and institutional preferences. Existing models only predict risk but lack interpretable, individualized treatment recommendations that directly optimize long-term outcomes.

Method: Developed an interpretable prescriptive framework integrating prognostic matching, counterfactual outcome modeling, and Optimal Policy Trees (OPT). Used data from two hospitals, emulated randomization via prognostic matching and sample weighting, estimated counterfactual mortality under both treatments, and trained policy model to partition patients into clinically coherent subgroups for treatment recommendations.

Result: Counterfactual evaluation showed estimated 5-year mortality reduction of 20.3% in Hartford Hospital and 13.8% in St. Vincent’s Hospital when applying OPT prescriptions compared to real-life treatments. The framework demonstrated promising generalizability across institutions, with learned decision boundaries aligning with real-world outcomes and clinical observations.

Conclusion: This is the first interpretable prescriptive framework providing transparent, data-driven TAVR vs SAVR recommendations that improve estimated long-term outcomes in both internal and external cohorts. It contributes to a more systematic, evidence-based approach to precision medicine in structural heart disease while remaining clinically grounded.

Abstract: Background. Treatment selection for low to intermediate risk patients with severe aortic stenosis between surgical (SAVR) and transcatheter (TAVR) aortic valve replacement remains variable in clinical practice, driven by patient heterogeneity and institutional preferences. While existing models predict postprocedural risk, there is a lack of interpretable, individualized treatment recommendations that directly optimize long-term outcomes. Methods. We introduce an interpretable prescriptive framework that integrates prognostic matching, counterfactual outcome modeling, and an Optimal Policy Tree (OPT) to recommend the treatment minimizing expected 5-year mortality. Using data from Hartford Hospital and St. Vincent’s Hospital, we emulate randomization via prognostic matching and sample weighting and estimate counterfactual mortality under both SAVR and TAVR. The policy model, trained on these counterfactual predictions, partitions patients into clinically coherent subgroups and prescribes the treatment associated with lower estimated risk. Findings. If the OPT prescriptions are applied, counterfactual evaluation showed an estimated reduction in 5-year mortality of 20.3% in Hartford and 13.8% in St. Vincent’s relative to real-life prescriptions, showing promising generalizability to unseen data from a different institution. The learned decision boundaries aligned with real-world outcomes and clinical observations. Interpretation. Our interpretable prescriptive framework is, to the best of our knowledge, the first to provide transparent, data-driven recommendations for TAVR versus SAVR that improve estimated long-term outcomes both in an internal and external cohort, while remaining clinically grounded and contributing toward a more systematic and evidence-based approach to precision medicine in structural heart disease.

[351] A Privacy-Preserving Cloud Architecture for Distributed Machine Learning at Scale

Vinoth Punniyamoorthy, Ashok Gadi Parthi, Mayilsamy Palanigounder, Ravi Kiran Kodali, Bikesh Kumar, Kabilan Kannan

Main category: cs.LG

TL;DR: A cloud-native privacy-preserving architecture combining federated learning, differential privacy, zero-knowledge proofs, and RL-based governance for scalable, verifiable distributed ML across multi-cloud environments.

Details

Motivation: Distributed ML systems need strong privacy guarantees, verifiable compliance, and scalable deployment across heterogeneous multi-cloud environments while preventing sensitive data centralization.

Method: Integrated framework with federated learning for decentralized training, differential privacy for formal privacy protection, zero-knowledge proofs for cryptographically verifiable compliance, and reinforcement learning for adaptive governance. Deployed across hybrid Kubernetes clusters.

Result: Prototype demonstrates reduced membership-inference risk, consistent enforcement of formal privacy budgets, stable model performance under differential privacy, and maintains utility with minimal overhead across multi-institution workloads.

Conclusion: The framework establishes a practical foundation for deploying trustworthy and compliant distributed ML systems at scale with continuous, risk-aware governance.

Abstract: Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.

[352] Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories

Nicolas Tacheny

Main category: cs.LG

TL;DR: Agentic LLM systems operate through recursive loops, but their geometric behavior is poorly understood. This paper introduces a framework to analyze these loops as discrete dynamical systems in semantic embedding space, identifying convergent and divergent regimes controlled by prompt design.

Details

Motivation: Agentic systems built on large language models operate through recursive feedback loops where each output becomes the next input, but the geometric behavior of these loops (whether they converge, diverge, or exhibit complex dynamics) remains poorly understood. There's a need for a systematic framework to analyze and control these iterative transformations.

Method: The paper introduces a geometric framework treating iterative LLM transformations as discrete dynamical systems in semantic embedding space. It distinguishes between artifact space (where linguistic transformations occur) and embedding space (where geometric measurements are performed). To address cosine similarity bias from embedding anisotropy, the authors introduce an isotonic calibration that eliminates systematic bias while aligning similarities with human semantic judgments and preserving local stability. This enables rigorous measurement of trajectories, clusters, and attractors.

Result: Through controlled experiments on singular agentic loops, the paper identifies two fundamental regimes: 1) A contractive rewriting loop that converges toward a stable attractor with decreasing dispersion, and 2) An exploratory summarize and negate loop that produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion.

Conclusion: Prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence, and trajectory structure in iterative LLM transformations. The geometric framework provides tools for understanding and controlling agentic system behavior.

Abstract: Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.

[353] GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

Hangyu Mao, Guangting Dong, Zhicheng Dou

Main category: cs.LG

TL;DR: GPG Theorem generalizes policy gradient for Transformer policies, showing standard PG and GRPO as special cases, with applications to LLM training.

Details

Motivation: Current policy gradient methods are not specifically designed for Transformer-based policies, and there's a need for a unified framework that can encompass existing approaches like standard Policy Gradient and GRPO while providing efficient optimization for LLMs.

Method: Develops Generalized Policy Gradient (GPG) Theorem specifically for Transformer-based policies, demonstrating that it generalizes both standard Policy Gradient Theorem and GRPO as special cases within this framework.

Result: Provides a unified theoretical framework for policy gradient methods applied to Transformer policies, with practical applications demonstrated for training Large Language Models.

Conclusion: GPG Theorem offers a generalized approach to policy optimization for Transformer-based policies, providing new insights into efficient training of LLMs while unifying existing policy gradient methods.

Abstract: We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

[354] Fitting magnetization data using continued fraction of straight lines

Vijay Prakash S

Main category: cs.LG

TL;DR: The paper approximates the nonlinear magnetization function of ferromagnetic materials using continued fraction of straight lines to model domain alignment behavior.

Details

Motivation: To develop a mathematical framework for understanding the nonlinear magnetization behavior in ferromagnetic materials, where magnetic domains align with applied fields in a complex nonlinear manner.

Method: Approximates the nonlinear magnetization function as a combination of continued fraction of straight lines, which provides an algebraic expression suitable for parameter estimation via nonlinear regression.

Result: The continued fraction approximation successfully models both growing and shrinking magnetic domains, providing a mathematical tool to interpret nonlinear magnetization behavior.

Conclusion: The continued fraction of straight lines approach offers an effective algebraic method for approximating and analyzing the nonlinear magnetization characteristics of ferromagnetic materials.

Abstract: Magnetization of a ferromagnetic substance in response to an externally applied magnetic field increases with the strength of the field. This is because at the microscopic level, magnetic moments in certain regions or domains of the substance increasingly align with the applied field, while the amount of misaligned domains decreases. The alignment of such magnetic domains with an applied magnetic field forms the physical basis for the nonlinearity of magnetization. In this paper, the nonlinear function is approximated as a combination of continued fraction of straight lines. The resulting fit is used to interpret the nonlinear behavior in both growing and shrinking magnetic domains. The continued fraction of straight lines used here is an algebraic expression which can be used to estimate parameters using nonlinear regression.

[355] The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks

Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Tianyu Du, Jinbao Li, Jianhai Chen, Shouling Ji

Main category: cs.LG

TL;DR: The paper proposes Eminence, an explainable black-box backdoor attack framework with theoretical guarantees that achieves >90% attack success with <0.1% poison rate by exploiting sparse decision boundaries.

Details

Motivation: Current backdoor attacks rely on heuristic methods without theoretical understanding, limiting predictability and adaptability. The lack of rigorous analysis prevents understanding of why low poison rates can be effective.

Method: Theoretical analysis reveals sparse decision boundaries enable disproportionate manipulation. Derives closed-form ambiguous boundary region where minimal relabeled samples cause misclassification. Proposes Eminence framework that optimizes universal subtle triggers exploiting vulnerable boundaries.

Result: Eminence achieves >90% attack success rate with <0.1% poison rate (vs SOTA requiring >1%), negligible clean-accuracy loss, high transferability across models/datasets, and demonstrates exponential relationship between margin poisoning and boundary manipulation.

Conclusion: Theoretical grounding explains why low poison rates suffice for effective backdoor attacks. Eminence provides explainable, robust backdoor attacks with provable guarantees, advancing understanding of backdoor mechanisms beyond empirical methods.

Abstract: Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks, typically reliant on heuristic brute-force methods. Despite significant empirical advancements in backdoor research, the lack of rigorous theoretical analysis limits understanding of underlying mechanisms, constraining attack predictability and adaptability. Therefore, we provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation. Based on this finding, we derive a closed-form, ambiguous boundary region, wherein negligible relabeled samples induce substantial misclassification. Influence function analysis further quantifies significant parameter shifts caused by these margin samples, with minimal impact on clean accuracy, formally grounding why such low poison rates suffice for efficacious attacks. Leveraging these insights, we propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries and effectively achieves robust misclassification with exceptionally low poison rates (< 0.1%, compared to SOTA methods typically requiring > 1%). Comprehensive experiments validate our theoretical discussions and demonstrate the effectiveness of Eminence, confirming an exponential relationship between margin poisoning and adversarial boundary manipulation. Eminence maintains > 90% attack success rate, exhibits negligible clean-accuracy loss, and demonstrates high transferability across diverse models, datasets and scenarios.

[356] The Operator Origins of Neural Scaling Laws: A Generalized Spectral Transport Dynamics of Deep Learning

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper develops a unified operator-theoretic framework for neural training dynamics, deriving a spectral transport-dissipation PDE that explains scaling laws, double descent, and connects NTK training with feature learning.

Details

Motivation: Modern deep networks operate in a rough, finite-regularity regime where Jacobian-induced operators show heavy-tailed spectra and strong basis drift. There's a need for a unified theoretical description connecting operator geometry, optimization dynamics, and universal scaling behavior.

Method: Derive training dynamics from gradient descent, apply Kato perturbation theory to obtain coupled mode ODEs, coarse-grain to get a spectral transport-dissipation PDE. Analyze the PDE in weak-coupling regime to find self-similar solutions and scaling laws.

Result: Neural training preserves functional regularity, forcing drift to take asymptotic power-law form. The PDE admits self-similar solutions with resolution frontier, polynomial amplitude growth, and power-law dissipation. NTK training and feature learning emerge as two limits of the same PDE.

Conclusion: Provides a unified spectral framework connecting operator geometry, optimization dynamics, and universal scaling behavior of deep networks, explaining double descent geometry and showing effective training time follows specific scaling laws.

Abstract: Modern deep networks operate in a rough, finite-regularity regime where Jacobian-induced operators exhibit heavy-tailed spectra and strong basis drift. In this work, we derive a unified operator-theoretoretic description of neural training dynamics directly from gradient descent. Starting from the exact evolution $\dot e_t = -M(t)e_t$ in function space, we apply Kato perturbation theory to obtain a rigorous system of coupled mode ODEs and show that, after coarse-graining, these dynamics converge to a spectral transport–dissipation PDE [ \partial_t g + \partial_λ(v g) = -λg + S, ] where $v$ captures eigenbasis drift and $S$ encodes nonlocal spectral coupling. We prove that neural training preserves functional regularity, forcing the drift to take an asymptotic power-law form $v(λ,t)\sim -c(t)λ^b$. In the weak-coupling regime – naturally induced by spectral locality and SGD noise – the PDE admits self-similar solutions with a resolution frontier, polynomial amplitude growth, and power-law dissipation. This structure yields explicit scaling-law exponents, explains the geometry of double descent, and shows that the effective training time satisfies $τ(t)=t^αL(t)$ for slowly varying $L$. Finally, we show that NTK training and feature learning arise as two limits of the same PDE: $v\equiv 0$ recovers lazy dynamics, while $v\neq 0$ produces representation drift. Our results provide a unified spectral framework connecting operator geometry, optimization dynamics, and the universal scaling behavior of modern deep networks.

[357] Metacognitive Sensitivity for Test-Time Dynamic Model Selection

Le Tuan Minh Trinh, Le Minh Vu Pham, Thi Minh Anh Pham, An Duc Nguyen

Main category: cs.LG

TL;DR: The paper proposes a metacognition framework for AI models using meta-d’ to measure metacognitive sensitivity, then uses this score for bandit-based model selection, improving ensemble accuracy.

Details

Motivation: Deep learning models often have poor calibration where their confidence doesn't reflect true competence. The paper aims to evaluate whether models truly know what they know, drawing inspiration from human metacognition in cognitive science.

Method: Introduces meta-d’, a psychologically-grounded measure of metacognitive sensitivity to characterize how reliably a model’s confidence predicts its accuracy. Uses this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which expert model to trust for given tasks.

Result: Experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that the metacognitive approach improves joint-inference accuracy over constituent models.

Conclusion: Provides a novel behavioral account of AI models, recasting ensemble selection as evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).

Abstract: A key aspect of human cognition is metacognition - the ability to assess one’s own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d’, a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model’s confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).

[358] Hybrid Physics-ML Model for Forward Osmosis Flux with Complete Uncertainty Quantification

Shiv Ratn, Shivang Rampriyan, Bahni Ray

Main category: cs.LG

TL;DR: A hybrid physics-ML framework using Gaussian Process Regression achieves highly accurate water flux prediction in Forward Osmosis with robust uncertainty quantification, outperforming traditional models with only 120 training data points.

Details

Motivation: Traditional mechanistic models for Forward Osmosis water flux prediction struggle with empirical parameter variability, while purely data-driven models lack physical consistency and rigorous uncertainty quantification, creating a need for a robust hybrid approach.

Method: A hybrid physics-ML framework using Gaussian Process Regression trained on residual errors between physical model predictions and experimental data, with full uncertainty quantification decomposing variance into epistemic (model) and aleatoric (input) uncertainties using the Delta method.

Result: The model achieved state-of-the-art performance with Mean Absolute Percentage Error of 0.26% and R² of 0.999 on independent test data, demonstrating highly accurate predictions despite being trained on only 120 data points.

Conclusion: The proposed robust hybrid physics-ML framework provides a reliable surrogate model for Forward Osmosis process optimization and digital twin development, successfully addressing the limitations of both traditional mechanistic and purely data-driven approaches.

Abstract: Forward Osmosis (FO) is a promising low-energy membrane separation technology, but challenges in accurately modelling its water flux (Jw) persist due to complex internal mass transfer phenomena. Traditional mechanistic models struggle with empirical parameter variability, while purely data-driven models lack physical consistency and rigorous uncertainty quantification (UQ). This study introduces a novel Robust Hybrid Physics-ML framework employing Gaussian Process Regression (GPR) for highly accurate, uncertainty-aware Jw prediction. The core innovation lies in training the GPR on the residual error between the detailed, non-linear FO physical model prediction (Jw_physical) and the experimental water flux (Jw_actual). Crucially, we implement a full UQ methodology by decomposing the total predictive variance (sigma2_total) into model uncertainty (epistemic, from GPR’s posterior variance) and input uncertainty (aleatoric, analytically propagated via the Delta method for multi-variate correlated inputs). Leveraging the inherent strength of GPR in low-data regimes, the model, trained on a meagre 120 data points, achieved a state-of-the-art Mean Absolute Percentage Error (MAPE) of 0.26% and an R2 of 0.999 on the independent test data, validating a truly robust and reliable surrogate model for advanced FO process optimization and digital twin development.

[359] T-SKM-Net: Trainable Neural Network Framework for Linear Constraint Satisfaction via Sampling Kaczmarz-Motzkin Method

Haoyu Zhu, Yao Zhang, Jiashen Ren, Qingchun Hou

Main category: cs.LG

TL;DR: T-SKM-Net integrates Sampling Kaczmarz-Motzkin method into neural networks for constraint satisfaction, achieving 25× speedup over traditional solvers with zero constraint violations.

Details

Motivation: Existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from high computational complexity or restrictive assumptions on constraint structures. Neural network constraint satisfaction is crucial for safety-critical applications like power systems, robotics, and autonomous driving.

Method: Proposes Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework that: 1) transforms mixed constraint problems into pure inequality problems via null space transformation, 2) employs SKM for iterative solving, 3) maps solutions back to original constraint space. Provides theoretical guarantees for post-processing effectiveness and end-to-end trainability despite non-differentiable operations.

Result: On DCOPF case118 benchmark: 4.27ms/item GPU inference with 0.0025% max optimality gap (post-processing mode) and 5.25ms/item with 0.0008% max optimality gap (joint training mode). Achieves over 25× speedup compared to pandapower solver while maintaining zero constraint violations under given tolerance.

Conclusion: T-SKM-Net successfully integrates SKM-type methods into neural network constraint satisfaction, overcoming non-differentiability challenges and providing efficient, trainable solutions for mixed constraint problems with theoretical guarantees and practical performance improvements.

Abstract: Neural network constraint satisfaction is crucial for safety-critical applications such as power system optimization, robotic path planning, and autonomous driving. However, existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from either high computational complexity or restrictive assumptions on constraint structures. The Sampling Kaczmarz-Motzkin (SKM) method is a randomized iterative algorithm for solving large-scale linear inequality systems with favorable convergence properties, but its argmax operations introduce non-differentiability, posing challenges for neural network applications. This work proposes the Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework and, for the first time, systematically integrates SKM-type methods into neural network constraint satisfaction. The framework transforms mixed constraint problems into pure inequality problems through null space transformation, employs SKM for iterative solving, and maps solutions back to the original constraint space, efficiently handling both equality and inequality constraints. We provide theoretical proof of post-processing effectiveness in expectation and end-to-end trainability guarantees based on unbiased gradient estimators, demonstrating that despite non-differentiable operations, the framework supports standard backpropagation. On the DCOPF case118 benchmark, our method achieves 4.27ms/item GPU serial forward inference with 0.0025% max optimality gap with post-processing mode and 5.25ms/item with 0.0008% max optimality gap with joint training mode, delivering over 25$\times$ speedup compared to the pandapower solver while maintaining zero constraint violations under given tolerance.

[360] Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Alina Shutova, Denis Kuznedelev

Main category: cs.LG

TL;DR: Enables LLMs to think, listen, and generate outputs simultaneously using rotary embeddings, reducing response delays by 6-11x for real-time applications.

Details

Motivation: Current LLMs must think sequentially before responding, making them unsuitable for real-time interactive applications like voice assistants that require simultaneous listening, thinking, and responding.

Method: Uses properties of rotary embeddings to enable LLMs designed for sequential interactions to operate asynchronously - thinking while listening and generating outputs simultaneously, without additional training.

Result: Reduces time to first non-thinking token from minutes to ≤5 seconds, decreases overall real-time delays by 6-11x, and maintains accuracy on math, commonsense, and safety reasoning tasks.

Conclusion: Enables reasoning-capable LLMs to operate in real-time interactive scenarios without retraining, bridging the gap between reasoning capabilities and interactive requirements.

Abstract: Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.

[361] UACER: An Uncertainty-Aware Critic Ensemble Framework for Robust Adversarial Reinforcement Learning

Jiaxi Wu, Tiantian Zhang, Yuxing Wang, Yongzhe Chang, Xueqian Wang

Main category: cs.LG

TL;DR: UACER proposes a robust adversarial RL method with diversified critic ensemble and time-varying decay uncertainty mechanism to address training instability in adversarial environments.

Details

Motivation: Robust adversarial RL faces training instability due to non-stationary learning dynamics caused by trainable adversaries, especially in high-dimensional complex environments like autonomous driving and robotic control.

Method: UACER uses two strategies: 1) Diversified critic ensemble with K parallel critic networks for stable Q-value estimation, and 2) Time-varying Decay Uncertainty mechanism that uses variance-derived Q-value aggregation with epistemic uncertainty to dynamically regulate exploration-exploitation trade-off.

Result: Comprehensive experiments across MuJoCo control problems show UACER outperforms state-of-the-art methods in overall performance, stability, and efficiency.

Conclusion: UACER effectively addresses training instability in robust adversarial RL through ensemble critics and uncertainty-aware mechanisms, providing superior performance in complex control environments.

Abstract: Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.

[362] Stronger Normalization-Free Transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

Main category: cs.LG

TL;DR: Derf (erf-based function) outperforms normalization layers like LayerNorm, RMSNorm, and Dynamic Tanh across vision, speech, and DNA tasks, offering better generalization for normalization-free Transformers.

Details

Motivation: To find point-wise functions that can surpass Dynamic Tanh (DyT) and traditional normalization layers by studying how intrinsic properties of such functions influence training and performance.

Method: 1. Study intrinsic properties of point-wise functions and their influence on training. 2. Conduct large-scale search for more effective function designs. 3. Identify Derf(x) = erf(αx + s) as optimal, where erf is the rescaled Gaussian CDF.

Result: Derf outperforms LayerNorm, RMSNorm, and DyT across multiple domains: vision (image recognition and generation), speech representation, and DNA sequence modeling.

Conclusion: Derf’s performance gains come from improved generalization rather than stronger fitting capacity. Its simplicity and superior performance make it a practical choice for normalization-free Transformer architectures.

Abstract: Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

[363] Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song, Jaewoo Lee, Jinkyoo Park

Main category: cs.LG

TL;DR: ARB is a learning-free adaptive replay buffer that uses ‘on-policyness’ to dynamically balance offline and online data sampling in O2O RL, improving both early stability and final performance.

Details

Motivation: Standard O2O RL methods with fixed data-mixing ratios struggle to balance early learning stability (needing offline data) with asymptotic performance (needing relevant online data). There's a need for a simple, adaptive approach that doesn't require complex learning procedures.

Method: Introduces Adaptive Replay Buffer (ARB) that dynamically prioritizes data sampling based on ‘on-policyness’ - a lightweight metric measuring how closely collected trajectories align with current policy behavior. It assigns proportional sampling weights to transitions within trajectories, is learning-free, and easily integrates into existing O2O RL algorithms.

Result: Extensive experiments on D4RL benchmarks show ARB consistently mitigates early performance degradation and significantly improves final performance of various O2O RL algorithms.

Conclusion: ARB demonstrates the importance of adaptive, behavior-aware replay buffer design for effective offline-to-online RL, providing a simple yet powerful solution to the data-mixing dilemma.

Abstract: Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call ‘on-policyness’. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy’s behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.

[364] Disentangled and Distilled Encoder for Out-of-Distribution Reasoning with Rademacher Guarantees

Zahra Rahiminasab, Michael Yuhas, Arvind Easwaran

Main category: cs.LG

TL;DR: Proposes Disentangled Distilled Encoder (DDE) framework to compress disentangled VAE models for resource-constrained devices while preserving disentanglement properties through constrained optimization and theoretical guarantees.

Details

Motivation: Disentangled VAEs are useful for multi-label OOD reasoning but are too large for deployment on resource-constrained devices. Need to compress these models while preserving their valuable disentanglement properties.

Method: DDE framework formalizes student-teacher distillation as constrained optimization with disentanglement constraints. Uses theoretical guarantees based on Rademacher complexity to ensure disentanglement preservation during compression.

Result: Empirical evaluation shows successful deployment of compressed disentangled models on NVIDIA devices. The framework reduces model size while maintaining disentanglement for OOD reasoning tasks.

Conclusion: DDE enables practical deployment of disentangled OOD reasoners on resource-constrained devices by providing a theoretically-grounded compression method that preserves disentanglement properties essential for multi-label OOD reasoning.

Abstract: Recently, the disentangled latent space of a variational autoencoder (VAE) has been used to reason about multi-label out-of-distribution (OOD) test samples that are derived from different distributions than training samples. Disentangled latent space means having one-to-many maps between latent dimensions and generative factors or important characteristics of an image. This paper proposes a disentangled distilled encoder (DDE) framework to decrease the OOD reasoner size for deployment on resource-constrained devices while preserving disentanglement. DDE formalizes student-teacher distillation for model compression as a constrained optimization problem while preserving disentanglement with disentanglement constraints. Theoretical guarantees for disentanglement during distillation based on Rademacher complexity are established. The approach is evaluated empirically by deploying the compressed model on an NVIDIA

[365] Mode-Seeking for Inverse Problems with Diffusion Models

Sai Bharath Chandra Gutha, Ricardo Vinuesa, Hossein Azizpour

Main category: cs.LG

TL;DR: VML-MAP: A new algorithm using variational mode-seeking loss to guide diffusion models toward maximum a posteriori estimates for solving inverse problems without task-specific training.

Details

Motivation: Existing methods for solving inverse problems with pre-trained diffusion models rely on approximations and are computationally expensive. There's a need for more efficient and theoretically grounded approaches.

Method: Proposes variational mode-seeking loss (VML) derived from minimizing KL divergence between diffusion posterior and measurement posterior. For linear inverse problems, VML can be analytically derived without approximations. VML-MAP algorithm minimizes VML during each reverse diffusion step.

Result: VML-MAP outperforms existing methods in both performance and computational time across diverse image-restoration tasks on multiple datasets.

Conclusion: VML-MAP provides an effective, theoretically grounded, and computationally efficient approach for solving inverse problems using pre-trained diffusion models without task-specific training.

Abstract: A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.

[366] Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

Qingsen Ma, Dianyun Wang, Jiaming Lyu, Yaoye Wang, Lechen Ning, Sujie Zhu, Zhenbo Xu, Liuyu Xiang, Huining Li, Huijia Wu, Zhaofeng He

Main category: cs.LG

TL;DR: STA-Attention uses Top-K Sparse Autoencoders to decompose KV cache into interpretable semantic atoms, revealing Key-Value asymmetry and enabling efficient compression while maintaining model performance.

Details

Motivation: KV cache is the primary memory bottleneck in long-context LLMs, but current approaches treat it as an opaque numerical tensor without interpretability. There's a need to bridge mechanistic interpretability with efficient attention modeling.

Method: Proposes STA-Attention framework using Top-K Sparse Autoencoders (SAEs) to decompose KV cache into semantic atoms. Unlike standard L1-regularized SAEs, Top-K eliminates shrinkage bias to preserve dot-product geometry. Introduces Dual-Budget Strategy based on discovered Key-Value asymmetry: Keys are sparse routers dominated by “Semantic Elbow” while Values carry dense content requiring larger budget.

Result: Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B show semantic reconstructions maintain perplexity and zero-shot performance comparable to original models. The approach effectively bridges mechanistic interpretability with faithful attention modeling.

Conclusion: STA-Attention provides an interpretable decomposition of KV cache that reveals fundamental Key-Value asymmetry, enabling efficient compression while preserving model performance, thus addressing both memory efficiency and interpretability challenges in long-context LLMs.

Abstract: The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose \textbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental \textbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a Semantic Elbow,’’ deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.

[367] Is the Information Bottleneck Robust Enough? Towards Label-Noise Resistant Information Bottleneck Learning

Yi Huang, Qingyun Sun, Yisen Gao, Haonan Yuan, Xingcheng Fu, Jianxin Li

Main category: cs.LG

TL;DR: LaT-IB is a novel Information Bottleneck method that introduces a “Minimal-Sufficient-Clean” criterion to make representation learning robust against label noise, using noise-aware latent disentanglement and a three-phase training framework.

Details

Motivation: Standard Information Bottleneck (IB) methods are vulnerable to label noise because they strongly rely on accurate labels, which leads to performance degradation and overfitting in real-world scenarios where noisy labels are common.

Method: LaT-IB introduces a “Minimal-Sufficient-Clean” criterion as a mutual information regularizer, employs noise-aware latent disentanglement to separate clean label information from noise, and uses a three-phase training framework (Warmup, Knowledge Injection, Robust Training) to progressively build noise-resistant representations.

Result: Extensive experiments show LaT-IB achieves superior robustness and efficiency under label noise, significantly enhancing robustness and applicability in real-world scenarios with noisy labels.

Conclusion: LaT-IB effectively addresses the vulnerability of standard IB methods to label noise through its MSC criterion and noise-aware disentanglement approach, making representation learning more robust and practical for real-world applications with imperfect labels.

Abstract: The Information Bottleneck (IB) principle facilitates effective representation learning by preserving label-relevant information while compressing irrelevant information. However, its strong reliance on accurate labels makes it inherently vulnerable to label noise, prevalent in real-world scenarios, resulting in significant performance degradation and overfitting. To address this issue, we propose LaT-IB, a novel Label-Noise ResistanT Information Bottleneck method which introduces a “Minimal-Sufficient-Clean” (MSC) criterion. Instantiated as a mutual information regularizer to retain task-relevant information while discarding noise, MSC addresses standard IB’s vulnerability to noisy label supervision. To achieve this, LaT-IB employs a noise-aware latent disentanglement that decomposes the latent representation into components aligned with to the clean label space and the noise space. Theoretically, we first derive mutual information bounds for each component of our objective including prediction, compression, and disentanglement, and moreover prove that optimizing it encourages representations invariant to input noise and separates clean and noisy label information. Furthermore, we design a three-phase training framework: Warmup, Knowledge Injection and Robust Training, to progressively guide the model toward noise-resistant representations. Extensive experiments demonstrate that LaT-IB achieves superior robustness and efficiency under label noise, significantly enhancing robustness and applicability in real-world scenarios with label noise.

[368] THeGAU: Type-Aware Heterogeneous Graph Autoencoder and Augmentation

Ming-Yi Hong, Miao-Chen Chiang, Youchen Teng, Yu-Hsiang Wang, Chih-Yu Wang, Che Lin

Main category: cs.LG

TL;DR: THeGAU is a model-agnostic framework that combines type-aware graph autoencoder with guided graph augmentation to improve HGNN performance on heterogeneous graphs by preserving type semantics and refining noisy structures.

Details

Motivation: HGNNs suffer from type information loss and structural noise in heterogeneous information networks, which limits their representational fidelity and generalization capabilities for node classification tasks.

Method: THeGAU uses a type-aware graph autoencoder that reconstructs schema-valid edges as an auxiliary task to preserve node-type semantics, combined with a decoder-driven augmentation mechanism to selectively refine noisy graph structures.

Result: Extensive experiments on three benchmark HIN datasets (IMDB, ACM, DBLP) show THeGAU consistently outperforms existing HGNN methods, achieving state-of-the-art performance across multiple backbones while reducing computational overhead.

Conclusion: THeGAU effectively addresses type information loss and structural noise in HGNNs through its joint design of type-aware reconstruction and guided augmentation, enhancing robustness, accuracy, and efficiency for heterogeneous graph learning.

Abstract: Heterogeneous Graph Neural Networks (HGNNs) are effective for modeling Heterogeneous Information Networks (HINs), which encode complex multi-typed entities and relations. However, HGNNs often suffer from type information loss and structural noise, limiting their representational fidelity and generalization. We propose THeGAU, a model-agnostic framework that combines a type-aware graph autoencoder with guided graph augmentation to improve node classification. THeGAU reconstructs schema-valid edges as an auxiliary task to preserve node-type semantics and introduces a decoder-driven augmentation mechanism to selectively refine noisy structures. This joint design enhances robustness, accuracy, and efficiency while significantly reducing computational overhead. Extensive experiments on three benchmark HIN datasets (IMDB, ACM, and DBLP) demonstrate that THeGAU consistently outperforms existing HGNN methods, achieving state-of-the-art performance across multiple backbones.

[369] Multi-Objective Reward and Preference Optimization: Theory and Algorithms

Akhil Agnihotri

Main category: cs.LG

TL;DR: This thesis advances constrained reinforcement learning across three areas: average-cost CMDPs (ACPO), episodic CMDPs (e-COP), and preference-based RL for model alignment, with applications to large language models.

Details

Motivation: To develop unified theoretical frameworks and algorithms for constrained RL that can handle different paradigms (average-cost, episodic, preference-driven) and scale to real-world applications including safety-critical environments and large language model alignment.

Method: Four main contributions: 1) ACPO for average-cost CMDPs using sensitivity analysis with trust-region updates; 2) e-COP for episodic CMDPs based on episodic policy difference lemma; 3) warmPref-PS for RLHF with posterior sampling and rater competence modeling; 4) PSPL for preference-based RL with joint sampling of rewards and transitions; plus MOPO for large-scale model alignment via multi-objective constrained optimization.

Result: Theoretical guarantees for all algorithms, state-of-the-art empirical performance for ACPO, provable performance and scalability for e-COP, substantial regret reduction for warmPref-PS, Bayesian simple-regret guarantees for PSPL, and robust scaling to multi-billion-parameter models for MOPO.

Conclusion: The thesis successfully unifies constrained RL across average-cost, episodic, and preference-driven paradigms, providing both theoretical advances and practical tools for safe and aligned decision-making in complex real-world applications.

Abstract: This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences. warmPref-PS introduces a posterior sampling strategy for linear bandits that integrates offline preference data from heterogeneous raters into online learning. Explicit modeling of rater competence yields substantial regret reduction and more efficient data collection for RLHF. The PSPL algorithm further advances preference-based RL by jointly sampling reward models and transition dynamics from pairwise trajectory comparisons, providing Bayesian simple-regret guarantees and robust empirical identification of optimal policies. The final contribution applies these methods to large-scale model alignment. A multi-objective constrained optimization view yields MOPO, an iterative algorithm with closed-form updates that scales to multi-billion-parameter language models and remains robust across alignment settings. Collectively, the thesis unifies constrained RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.

[370] Uncertainty-Preserving QBNNs: Multi-Level Quantization of SVI-Based Bayesian Neural Networks for Image Classification

Hendrik Borras, Yong Wu, Bernhard Klein, Holger Fröning

Main category: cs.LG

TL;DR: BNNs can be quantized to 4-bit precision with minimal performance loss using a novel multi-level quantization framework that maintains uncertainty calibration.

Details

Motivation: Bayesian Neural Networks provide uncertainty quantification but have high computational and memory overhead, limiting deployment on resource-constrained devices. While quantization helps standard models, it hasn't been systematically explored for probabilistic models.

Method: A multi-level quantization framework for Stochastic Variational Inference BNNs with three strategies: Variational Parameter Quantization (VPQ), Sampled Parameter Quantization (SPQ), and Joint Quantization (JQ). Uses logarithmic quantization for variance parameters and specialized activation functions to preserve distributional structure.

Result: BNNs can be quantized down to 4-bit precision while maintaining classification accuracy and uncertainty disentanglement. Joint Quantization achieves 8x memory reduction at 4 bits with minimal degradation in epistemic and aleatoric uncertainty estimation on Dirty-MNIST.

Conclusion: The framework enables deployment of BNNs on resource-constrained edge devices and provides design guidelines for future analog “Bayesian Machines” operating at low precision, bridging the gap between uncertainty quantification and practical deployment.

Abstract: Bayesian Neural Networks (BNNs) provide principled uncertainty quantification but suffer from substantial computational and memory overhead compared to deterministic networks. While quantization techniques have successfully reduced resource requirements in standard deep learning models, their application to probabilistic models remains largely unexplored. We introduce a systematic multi-level quantization framework for Stochastic Variational Inference based BNNs that distinguishes between three quantization strategies: Variational Parameter Quantization (VPQ), Sampled Parameter Quantization (SPQ), and Joint Quantization (JQ). Our logarithmic quantization for variance parameters, and specialized activation functions to preserve the distributional structure are essential for calibrated uncertainty estimation. Through comprehensive experiments on Dirty-MNIST, we demonstrate that BNNs can be quantized down to 4-bit precision while maintaining both classification accuracy and uncertainty disentanglement. At 4 bits, Joint Quantization achieves up to 8x memory reduction compared to floating-point implementations with minimal degradation in epistemic and aleatoric uncertainty estimation. These results enable deployment of BNNs on resource-constrained edge devices and provide design guidelines for future analog “Bayesian Machines” operating at inherently low precision.

[371] Supporting Migration Policies with Forecasts: Illegal Border Crossings in Europe through a Mixed Approach

C. Bosco, U. Minora, D. de Rigo, J. Pingsdorf, R. Cortinovis

Main category: cs.LG

TL;DR: A mixed-methodology combining machine learning and expert qualitative insights to forecast illegal border crossings in Europe across five migratory routes with one-year horizon, addressing EU migration policy needs.

Details

Motivation: To address challenges posed by sudden shifts in migration patterns and limitations in traditional datasets, while responding to the forecasting needs outlined in the EU Pact on Migration and Asylum and supporting the Asylum and Migration Management Regulation (AMMR).

Method: Integrates machine learning techniques with qualitative insights from migration experts, including a human-assessed covariate to improve predictive capacity of data-driven models.

Result: The methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context, providing policy-relevant forecasts.

Conclusion: This work introduces a novel operational tool for EU migration governance that aligns with academic recommendations by combining data-driven modeling with expert judgment to inform strategic decisions, early warning systems, and solidarity mechanisms.

Abstract: This paper presents a mixed-methodology to forecast illegal border crossings in Europe across five key migratory routes, with a one-year time horizon. The methodology integrates machine learning techniques with qualitative insights from migration experts. This approach aims at improving the predictive capacity of data-driven models through the inclusion of a human-assessed covariate, an innovation that addresses challenges posed by sudden shifts in migration patterns and limitations in traditional datasets. The proposed methodology responds directly to the forecasting needs outlined in the EU Pact on Migration and Asylum, supporting the Asylum and Migration Management Regulation (AMMR). It is designed to provide policy-relevant forecasts that inform strategic decisions, early warning systems, and solidarity mechanisms among EU Member States. By joining data-driven modeling with expert judgment, this work aligns with existing academic recommendations and introduces a novel operational tool tailored for EU migration governance. The methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context.

[372] Token Sample Complexity of Attention

Léa Bohbot, Cyril Letrouit, Gabriel Peyré, François-Xavier Vialard

Main category: cs.LG

TL;DR: The paper introduces token-sample complexity to analyze how attention converges as sequence length increases, showing different convergence rates for attention maps vs. transformed token distributions, with experimental validation.

Details

Motivation: As LLM context windows expand, understanding attention behavior at extreme sequence lengths becomes crucial. The paper aims to characterize how attention converges to its infinite-token limit.

Method: Introduces token-sample complexity concept and analyzes convergence at two levels: 1) pointwise uniform convergence of attention maps with rate C(R)/√n, and 2) convergence of moments for transformed token distributions with rate C’(R)/n^β (β<½). Also examines hardmax limit case.

Result: For compactly supported/sub-Gaussian distributions, attention maps converge uniformly at √n rate but with exponential dependence on radius R. Moment convergence has polynomial dependence on support size and β<½. Hardmax limit shows logarithmic convergence. Experiments on synthetic Gaussian data and BERT on Wikipedia confirm predictions.

Conclusion: Token-sample complexity provides theoretical framework for understanding attention convergence at large sequence lengths, revealing different convergence regimes with practical implications for scaling LLM context windows.

Abstract: As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token-sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C’(R)/n^β$ with $β<\tfrac{1}{2}$, and $C’(R)$ depends polynomially on the size of the support of the distribution. The exponent $β$ depends on the attention geometry and the spectral properties of the tokens distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic Gaussian data and real BERT models on Wikipedia text confirm our predictions.

[373] DCFO Additional Material

Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent

Main category: cs.LG

TL;DR: DCFO is a new method that generates counterfactual explanations specifically for the Local Outlier Factor (LOF) outlier detection algorithm, outperforming existing approaches on 50 datasets.

Details

Motivation: Outlier detection needs interpretability, especially for widely-used algorithms like LOF. Current counterfactual explanation methods don't address the unique challenges of outlier detection and fail to target classical algorithms like LOF, which lacks interpretability despite its popularity.

Method: DCFO (Density-based Counterfactuals for Outliers) partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimization to generate counterfactual explanations for LOF outliers.

Result: Extensive experiments on 50 OpenML datasets show DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

Conclusion: DCFO successfully addresses the interpretability gap in LOF outlier detection by providing effective counterfactual explanations, demonstrating better performance than existing methods across diverse datasets.

Abstract: Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF’s widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

[374] Learning by Analogy: A Causal Framework for Composition Generalization

Lingjing Kong, Shaoan Xie, Yang Jiao, Yetian Chen, Yanhui Guo, Simone Shao, Yan Gao, Guangyi Chen, Kun Zhang

Main category: cs.LG

TL;DR: The paper proposes a causal modularity framework for compositional generalization, showing how hierarchical concept decomposition enables novel concept combinations and proving identifiability from observable data.

Details

Motivation: Current models lack understanding of the data structures and principles enabling compositional generalization - the ability to understand and generate novel combinations of learned concepts. The authors aim to formalize how humans achieve this through analogical reasoning and concept decomposition.

Method: The authors formalize compositional generalization using principles of causal modularity and minimal changes. They introduce a hierarchical data-generating process encoding different concept levels and their interactions. Theoretically, they prove identifiability of latent hierarchical structure from observable data like text-image pairs.

Result: The approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work assuming simpler interactions. The latent hierarchical structure is provably recoverable from observable data. Applying insights from the framework achieves significant improvements on benchmark datasets.

Conclusion: Compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across contexts. The causal modularity framework provides a principled approach to understanding and achieving this capability, with theoretical guarantees and practical improvements on benchmarks.

Abstract: Compositional generalization – the ability to understand and generate novel combinations of learned concepts – enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice. In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.

[375] HybridVFL: Disentangled Feature Learning for Edge-Enabled Vertical Federated Multimodal Classification

Mostafa Anoosha, Zeinab Dehghani, Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker

Main category: cs.LG

TL;DR: HybridVFL improves Vertical Federated Learning for Edge AI by using client-side feature disentanglement and server-side cross-modal transformers for better feature fusion, achieving superior performance on medical datasets while preserving privacy.

Details

Motivation: Standard VFL systems in Edge AI scenarios (like mobile health diagnostics) suffer from performance limitations due to simplistic feature fusion approaches, especially when dealing with sensitive multimodal data on distributed, resource-constrained devices.

Method: HybridVFL framework employs client-side feature disentanglement to extract meaningful features locally, paired with a server-side cross-modal transformer that performs context-aware fusion of the disentangled features from different modalities.

Result: Systematic evaluation on the multimodal HAM10000 skin lesion dataset shows that HybridVFL significantly outperforms standard federated baselines, demonstrating improved performance in privacy-preserving medical diagnostics.

Conclusion: Advanced fusion mechanisms like cross-modal transformers are critical for building robust, privacy-preserving VFL systems, especially in medical applications where both performance and data privacy are essential.

Abstract: Vertical Federated Learning (VFL) offers a privacy-preserving paradigm for Edge AI scenarios like mobile health diagnostics, where sensitive multimodal data reside on distributed, resource-constrained devices. Yet, standard VFL systems often suffer performance limitations due to simplistic feature fusion. This paper introduces HybridVFL, a novel framework designed to overcome this bottleneck by employing client-side feature disentanglement paired with a server-side cross-modal transformer for context-aware fusion. Through systematic evaluation on the multimodal HAM10000 skin lesion dataset, we demonstrate that HybridVFL significantly outperforms standard federated baselines, validating the criticality of advanced fusion mechanisms in robust, privacy-preserving systems.

[376] UniExtreme: A Universal Foundation Model for Extreme Weather Forecasting

Hang Ni, Weijia Zhang, Hao Liu

Main category: cs.LG

TL;DR: UniExtreme is a universal extreme weather forecasting foundation model that addresses spectral disparities and hierarchical drivers of diverse extreme events through adaptive frequency modulation and event prior augmentation.

Details

Motivation: Current foundation models for weather forecasting have limited ability to predict extreme weather events. Existing approaches either focus on general weather conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events.

Method: Proposes UniExtreme with two key components: (1) Adaptive Frequency Modulation (AFM) module that captures region-wise spectral differences between normal and extreme weather using learnable Beta-distribution filters and multi-granularity spectral aggregation, and (2) Event Prior Augmentation (EPA) module that incorporates region-specific extreme event priors to resolve hierarchical extreme diversity and composite extreme schema via a dual-level memory fusion network.

Result: Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showcasing superior adaptability across diverse extreme scenarios.

Conclusion: UniExtreme provides a comprehensive solution for universal extreme weather forecasting by addressing both spectral characteristics and hierarchical drivers of diverse extreme events, offering improved performance over existing approaches.

Abstract: Recent advancements in deep learning have led to the development of Foundation Models (FMs) for weather forecasting, yet their ability to predict extreme weather events remains limited. Existing approaches either focus on general weather conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events. In this work, we identify two key characteristics of extreme events: (1) the spectral disparity against normal weather regimes, and (2) the hierarchical drivers and geographic blending of diverse extremes. Along this line, we propose UniExtreme, a universal extreme weather forecasting foundation model that integrates (1) an Adaptive Frequency Modulation (AFM) module that captures region-wise spectral differences between normal and extreme weather, through learnable Beta-distribution filters and multi-granularity spectral aggregation, and (2) an Event Prior Augmentation (EPA) module which incorporates region-specific extreme event priors to resolve hierarchical extreme diversity and composite extreme schema, via a dual-level memory fusion network. Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showcasing superior adaptability across diverse extreme scenarios.

[377] Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song, Eric P. Xing, Kun Zhang

Main category: cs.LG

TL;DR: The paper proposes using causal minimality principles to make deep generative models interpretable, enabling component-wise identifiable control and extraction of hierarchical concept graphs from diffusion and autoregressive models.

Details

Motivation: Deep generative models operate as opaque black boxes, hindering human understanding, control, and alignment. Current methods like sparse autoencoders lack theoretical guarantees, risking subjective insights.

Method: Introduces a theoretical framework for hierarchical selection models where higher-level concepts emerge from constrained composition of lower-level variables. Applies causal minimality principles (manifesting as sparsity/compression constraints) to diffusion vision and autoregressive language models.

Result: Under minimality conditions, learned representations become equivalent to true latent variables of the data-generating process. Empirically extracts innate hierarchical concept graphs from leading generative models and enables fine-grained model steering.

Conclusion: Causal minimality provides a principled foundation for interpretable generative models, offering transparent, reliable systems with clear causal interpretation and robust control.

Abstract: Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque black boxes, hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality – favoring the simplest causal explanation – can endow the latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions (manifesting as sparsity or compression constraints), we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading generative models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.

[378] Generalized Spherical Neural Operators: Green’s Function Formulation

Hao Tang, Hao Chen, Chao Li

Main category: cs.LG

TL;DR: Proposes GSNO, a spherical neural operator framework using designable Green’s functions for flexible balance of equivariance/invariance, with GSHNet architecture for multi-scale modeling on spherical domains.

Details

Motivation: Existing neural operators struggle with spherical domains due to geometry preservation needs and lack flexibility for real-world complexity, despite rotational equivariance approaches.

Method: Operator-design framework based on spherical Green’s function harmonic expansion; absolute/relative position-dependent Green’s function; GSNO with spectral learning; GSHNet hierarchical architecture with multi-scale spectral modeling and spherical up-down sampling.

Result: GSNO and GSHNet outperform state-of-the-art methods on diffusion MRI, shallow water dynamics, and global weather forecasting tasks.

Conclusion: GSNO provides principled, general framework for spherical operator learning that bridges rigorous theory with real-world complexity.

Abstract: Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a general operator-design framework based on the designable spherical Green’s function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green’s function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green’s-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to anisotropic, constraint-rich systems while retaining spectral efficiency. To exploit GSNO, we develop GSHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and GSHNet consistently outperform state-of-the-art methods. Our results position GSNO as a principled and general framework for spherical operator learning, bridging rigorous theory with real-world complexity.

[379] LGAN: An Efficient High-Order Graph Neural Network via the Line Graph Aggregation

Lin Du, Lu Bai, Jincheng Li, Lixin Cui, Hangyuan Du, Lichi Zhang, Yuting Chen, Zhao Li

Main category: cs.LG

TL;DR: LGAN is a novel GNN architecture that constructs line graphs from node-centered subgraphs for higher-order aggregation, achieving greater expressivity than 2-WL with lower complexity and better interpretability.

Details

Motivation: Existing GNNs have limited expressivity (bounded by 1-WL), while k-WL-based GNNs suffer from high computational cost and poor interpretability due to their inability to retain node/edge-level semantics needed for attribution methods.

Method: Proposes Line Graph Aggregation Network (LGAN) that constructs a line graph from the induced subgraph centered at each node to perform higher-order aggregation, theoretically achieving greater expressive power than 2-WL with lower time complexity.

Result: Empirical evaluations show LGAN outperforms state-of-the-art k-WL-based GNNs on benchmarks while offering better interpretability.

Conclusion: LGAN provides an effective solution to overcome the limitations of existing GNNs by achieving higher expressivity than 2-WL with lower computational cost and better interpretability through line graph construction from node-centered subgraphs.

Abstract: Graph Neural Networks (GNNs) have emerged as a dominant paradigm for graph classification. Specifically, most existing GNNs mainly rely on the message passing strategy between neighbor nodes, where the expressivity is limited by the 1-dimensional Weisfeiler-Lehman (1-WL) test. Although a number of k-WL-based GNNs have been proposed to overcome this limitation, their computational cost increases rapidly with k, significantly restricting the practical applicability. Moreover, since the k-WL models mainly operate on node tuples, these k-WL-based GNNs cannot retain fine-grained node- or edge-level semantics required by attribution methods (e.g., Integrated Gradients), leading to the less interpretable problem. To overcome the above shortcomings, in this paper, we propose a novel Line Graph Aggregation Network (LGAN), that constructs a line graph from the induced subgraph centered at each node to perform the higher-order aggregation. We theoretically prove that the LGAN not only possesses the greater expressive power than the 2-WL under injective aggregation assumptions, but also has lower time complexity. Empirical evaluations on benchmarks demonstrate that the LGAN outperforms state-of-the-art k-WL-based GNNs, while offering better interpretability.

[380] Template-Free Retrosynthesis with Graph-Prior Augmented Transformers

Youjun Zhao

Main category: cs.LG

TL;DR: Transformer-based template-free retrosynthesis model with graph-enhanced attention and data augmentation achieves SOTA on USPTO-50K.

Details

Motivation: Existing retrosynthesis models lack the accuracy and robustness needed for practical deployment, and many rely on handcrafted templates or chemical rule engines which limit flexibility.

Method: Template-free Transformer framework that injects molecular graph information into attention mechanism to combine SMILES sequences with structural cues, plus paired data augmentation for training diversity and scale.

Result: Achieves state-of-the-art performance among template-free methods on USPTO-50K benchmark, substantially outperforming vanilla Transformer baseline.

Conclusion: The proposed graph-enhanced Transformer with data augmentation provides an effective template-free approach for retrosynthesis prediction that improves accuracy without relying on handcrafted rules.

Abstract: Retrosynthesis reaction prediction seeks to infer plausible reactant molecules for a given product and is a central problem in computer-aided organic synthesis. Despite recent progress, many existing models still fall short of the accuracy and robustness required for practical deployment. This work studies a template-free, Transformer-based framework that eliminates reliance on handcrafted reaction templates or additional chemical rule engines. The model injects molecular graph information into the attention mechanism to jointly exploit \SMILES\ sequences and structural cues, and further applies a paired data augmentation strategy to enhance training diversity and scale. On the USPTO-50K benchmark, our proposed approach achieves state-of-the-art performance among template-free methods and substantially outperforming a vanilla Transformer baseline.

[381] Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli

Main category: cs.LG

TL;DR: CB-SAE improves sparse autoencoders for LVLMs by pruning low-utility neurons and adding concept bottlenecks, boosting interpretability by 32.1% and steerability by 14.5%.

Details

Motivation: Current sparse autoencoders (SAEs) have limitations: most neurons have low interpretability/steerability, and user-desired concepts are often missing from the learned dictionary, limiting practical utility for mechanistic interpretability and model steering.

Method: Propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to user-defined concepts.

Result: CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks compared to baseline SAEs.

Conclusion: CB-SAE addresses key limitations of SAEs by combining pruning with concept bottlenecks, making SAEs more practical for interpretability and steering applications in vision-language models.

Abstract: Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.

[382] Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values

Brian P. Powell, Jordan A. Caraballo-Vega, Mark L. Carroll, Thomas Maxwell, Andrew Ptak, Greg Olmschenk, Jorge Martinez-Palomera

Main category: cs.LG

TL;DR: Binary encoding enables neural networks to extrapolate periodic functions beyond training bounds using Normalized Base-2 Encoding (NB2E).

Details

Motivation: Neural networks typically struggle with extrapolation beyond training data, especially for periodic functions. The paper aims to address this limitation by exploring how binary encoding can enable better extrapolation capabilities.

Method: Introduces Normalized Base-2 Encoding (NB2E) for encoding continuous numerical values. Uses vanilla multi-layer perceptrons (MLPs) with this encoding to learn periodic signals without prior knowledge of their functional form.

Result: MLPs with NB2E successfully extrapolate diverse periodic signals beyond training bounds. Internal activation analysis reveals that NB2E induces bit-phase representations, allowing networks to learn signal structure independently of position.

Conclusion: Binary encoding through NB2E provides a simple yet effective method for enabling neural networks to extrapolate periodic functions, overcoming a fundamental limitation in neural network generalization.

Abstract: We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.

[383] Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

Atahan Cilan, Atay Özgövde

Main category: cs.LG

TL;DR: A reinforcement learning framework for generating controllable and diverse player behaviors without human data, using continuous behavior space sampling and distance-based rewards.

Details

Motivation: Existing methods require large human gameplay datasets, train separate models for different player types, or lack interpretable mapping between behavioral parameters and learned policies, limiting scalability and controllability.

Method: Define player behavior in N-dimensional continuous space, uniformly sample target behavior vectors from real human style region. Train agents with current and target behavior vectors as input, using reward based on normalized reduction in distance between them. Single PPO-based multi-agent policy learns how actions influence behavioral statistics.

Result: Framework produces significantly greater behavioral diversity than win-only baseline, reliably matches specified behavior vectors across diverse targets. Single policy can reproduce new or unseen play styles without retraining.

Conclusion: Method offers scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games without requiring human gameplay data.

Abstract: This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.

[384] Bayesian Symbolic Regression via Posterior Sampling

Geoffrey F. Bomarito, Patrick E. Leser

Main category: cs.LG

TL;DR: Bayesian symbolic regression using Sequential Monte Carlo (SMC) framework improves robustness to noise and enables uncertainty quantification compared to traditional genetic programming approaches.

Details

Motivation: Symbolic regression is sensitive to noise, which limits its broader application in scientific discovery and engineering. Current methods lack robustness and uncertainty quantification when dealing with noisy data.

Method: Proposes a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates posterior distributions over symbolic expressions. Combines probabilistic selection, adaptive tempering, and normalized marginal likelihood to efficiently explore the search space.

Result: The method outperforms standard genetic programming baselines on noisy benchmark datasets, showing reduced overfitting and improved ability to discover accurate, interpretable, and parsimonious equations.

Conclusion: The SMC-based Bayesian symbolic regression framework provides enhanced robustness to noise and uncertainty quantification, paving the way for more reliable symbolic regression in scientific discovery and engineering applications.

Abstract: Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.

[385] Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, Joan Bruna

Main category: cs.LG

TL;DR: SCSI: A transport-based generative model that learns to invert corruption channels using only noisy data and black-box access to the forward model, enabling clean data generation from corrupted observations.

Details

Motivation: Many scientific and engineering domains lack clean datasets - only noisy, corrupted measurements are available. Need generative models that can handle ill-conditioned inverse problems at the distribution level.

Method: Self-consistent stochastic interpolants: iteratively update transport map between corrupted and clean data using only corrupted dataset and black-box access to corruption channel. Converges to self-consistent map that inverts the corruption.

Result: Superior performance on inverse problems in natural image processing and scientific reconstruction. Computationally efficient, handles arbitrary nonlinear forward models, and has theoretical convergence guarantees.

Conclusion: SCSI provides an effective framework for generative modeling from corrupted data, offering computational efficiency, flexibility with black-box forward models, and theoretical soundness for distribution-level inverse problems.

Abstract: Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.

[386] Scaling Behavior of Discrete Diffusion Language Models

Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto

Main category: cs.LG

TL;DR: DLMs’ scaling behavior depends on noise type, with uniform diffusion requiring more parameters but less data than masked diffusion, making it promising for data-limited scenarios.

Details

Motivation: To understand the scaling behavior of discrete diffusion language models (DLMs) compared to autoregressive language models (ALMs), as prior work suggests DLMs need more resources to match ALM performance, and to explore how different noise types affect DLM scaling.

Method: Studied DLM scaling behavior by interpolating between masked and uniform diffusion noise types, carefully controlling hyperparameters like batch size and learning rate, and scaling uniform diffusion models up to 10B parameters trained for 10^22 FLOPs.

Result: DLM scaling behavior strongly depends on noise type and differs from ALMs. While all noise types converge to similar loss in compute-bound scaling, uniform diffusion requires more parameters but less data for compute-efficient training compared to masked diffusion, making it better for data-bound settings.

Conclusion: Uniform diffusion DLMs are promising for data-limited scenarios due to their parameter-efficient scaling behavior, with the 10B parameter uniform diffusion model confirming predicted scaling patterns and becoming the largest publicly known model of its type.

Abstract: Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

[387] UrbanAI 2025 Challenge: Linear vs Transformer Models for Long-Horizon Exogenous Temperature Forecasting

Ruslan Gokhman

Main category: cs.LG

TL;DR: Linear models outperform Transformers for long-horizon temperature forecasting using only past temperature data, with DLinear achieving best accuracy.

Details

Motivation: To evaluate forecasting performance in challenging exogenous-only settings where only past temperature values are available for prediction, comparing linear and Transformer-family models.

Method: Systematic comparison of Linear, NLinear, DLinear, Transformer, Informer, and Autoformer models using standardized train/validation/test splits on long-horizon temperature forecasting with only past temperature data.

Result: Linear baselines (Linear, NLinear, DLinear) consistently outperform Transformer-family architectures, with DLinear achieving the best overall accuracy across all splits.

Conclusion: Carefully designed linear models remain strong baselines for time series forecasting in challenging exogenous-only settings, outperforming more complex Transformer architectures.

Abstract: We study long-horizon exogenous-only temperature forecasting - a challenging univariate setting where only the past values of the indoor temperature are used for prediction - using linear and Transformer-family models. We evaluate Linear, NLinear, DLinear, Transformer, Informer, and Autoformer under standardized train, validation, and test splits. Results show that linear baselines (Linear, NLinear, DLinear) consistently outperform more complex Transformer-family architectures, with DLinear achieving the best overall accuracy across all splits. These findings highlight that carefully designed linear models remain strong baselines for time series forecasting in challenging exogenous-only settings.

[388] Guided Transfer Learning for Discrete Diffusion Models

Julian Kleutgens, Claudio Battiloro, Lingkai Kong, Benjamin Grewe, Francesca Dominici, Mauricio Tec

Main category: cs.LG

TL;DR: GTL enables transfer learning for discrete diffusion models without fine-tuning, using guidance to adapt pretrained models to new domains, with efficient sampling for large vocabularies.

Details

Motivation: Discrete diffusion models perform well but need large training datasets, which are costly/risky for new domains. Current transfer learning requires expensive fine-tuning of large models.

Method: Guided Transfer Learning (GTL) uses guidance to sample from target distributions without modifying pretrained denoisers. Works for both discrete-time and continuous-time discrete diffusion. Includes efficient sampler focusing on planner-selected positions and top tokens.

Result: GTL enables practical guided language modeling at scale for large vocabularies and long sequences. Evaluated on sequential data including synthetic Markov chains and language modeling.

Conclusion: GTL provides a unified, efficient transfer learning approach for discrete diffusion models that avoids expensive fine-tuning and makes guided sampling practical for real-world applications.

Abstract: Discrete diffusion models achieve strong performance across language and other discrete domains, providing a powerful alternative to autoregressive models. However, their strong performance relies on large training datasets, which are costly or risky to obtain, especially when adapting to new domains. Transfer learning is the natural way to adapt pretrained discrete diffusion models, but current methods require fine-tuning large diffusion models, which is computationally expensive and often impractical. Building on ratio-based transfer learning for continuous diffusion, we provide Guided Transfer Learning for discrete diffusion models (GTL). This enables sampling from a target distribution without modifying the pretrained denoiser. The same guidance formulation applies to both discrete-time diffusion and continuous-time score-based discrete diffusion, yielding a unified treatment. Guided discrete diffusion often requires many forward passes of the guidance network, which becomes impractical for large vocabularies and long sequences. To address this, we further present an efficient guided sampler that concentrates evaluations on planner-selected positions and top candidate tokens, thus lowering sampling time and computation. This makes guided language modeling practical at scale for large vocabularies and long sequences. We evaluate GTL on sequential data, including synthetic Markov chains and language modeling, and provide empirical analyses of its behavior.

[389] Classifier Reconstruction Through Counterfactual-Aware Wasserstein Prototypes

Xuan Zhao, Zhuo Cao, Arya Bangun, Hanno Scharr, Ira Assent

Main category: cs.LG

TL;DR: Counterfactual explanations improve model reconstruction by using boundary-proximate counterfactuals as informative samples, integrated with original data via Wasserstein barycenter to preserve class distributions and prevent boundary shift.

Details

Motivation: Counterfactuals provide interpretability but can also enhance model reconstruction, especially when labeled data is limited. However, naive use of counterfactuals as training samples causes decision boundary shift due to their proximity to decision boundaries.

Method: Integrate original data samples with counterfactuals to approximate class prototypes using Wasserstein barycenter. This preserves the underlying distributional structure of each class while leveraging counterfactuals as informative boundary samples.

Result: Empirical results across multiple datasets show improved fidelity between surrogate and target models, validating the effectiveness of the approach in enhancing model reconstruction quality.

Conclusion: Counterfactuals can significantly improve model reconstruction when properly integrated with original data via Wasserstein barycenter, mitigating decision boundary shift and enhancing surrogate model quality.

Abstract: Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly improved by recognizing that counterfactuals, which typically lie close to the decision boundary, can serve as informative though less representative samples for both classes. This is particularly beneficial in settings with limited access to labeled data. We propose a method that integrates original data samples with counterfactuals to approximate class prototypes using the Wasserstein barycenter, thereby preserving the underlying distributional structure of each class. This approach enhances the quality of the surrogate model and mitigates the issue of decision boundary shift, which commonly arises when counterfactuals are naively treated as ordinary training instances. Empirical results across multiple datasets show that our method improves fidelity between the surrogate and target models, validating its effectiveness.

[390] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: The paper introduces Regularized Policy Gradient (RPG), a unified framework for KL-regularized policy gradient methods in LLM reasoning that corrects off-policy weighting issues and enables stable, scalable training.

Details

Motivation: KL regularization is widely used in policy gradient algorithms for LLMs, but there's inconsistency in KL direction, normalization, and estimators across literature, often mixed with off-policy estimation. The paper aims to clarify what weighting is needed for each KL variant to ensure the surrogate optimization yields the exact gradient of the intended KL-regularized objective.

Method: Proposes Regularized Policy Gradient (RPG) view - a unified derivation that: (1) unifies normalized/unnormalized KL variants, (2) specifies conditions for gradient-equivalence between REINFORCE-style losses and differentiable surrogates, (3) corrects off-policy importance-weighting mismatch in GRPO’s KL term, and (4) introduces RPG-Style Clip for stable off-policy training.

Result: RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 percentage points over DAPO on mathematical reasoning benchmarks (AIME24, AIME25). At 8K context length, achieves 52% accuracy on AIME25, surpassing Qwen3-4B-Instruct (47%).

Conclusion: RPG provides a stable and scalable RL algorithm for LLM reasoning through three key components: KL-correct objective, clipped importance sampling, and iterative reference-policy update scheme.

Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO’s KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme.

[391] Physics-Informed Learning of Flow Distribution and Receiver Heat Losses in Parabolic Trough Solar Fields

Stefan Matthes, Markus Schramm

Main category: cs.LG

TL;DR: Physics-informed learning framework infers loop-level mass-flow ratios and receiver heat-transfer coefficients from CSP operational data using nocturnal homogenization periods and differentiable optimization.

Details

Motivation: Parabolic trough CSP plants have unobserved loop-level mass flows and receiver heat-loss parameters, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools.

Method: Physics-informed learning framework that exploits nocturnal homogenization periods (hot oil circulated through non-irradiated field) to isolate hydraulic and thermal-loss effects. Uses differentiable conjugate heat-transfer model discretized and embedded into end-to-end learning pipeline optimized with historical plant data from Andasol 3.

Result: Model accurately reconstructs loop temperatures (RMSE <2°C), produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography shows strong correspondence, correctly identifying all areas with high-loss receivers.

Conclusion: Noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.

Abstract: Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools. We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods – when hot oil is circulated through a non-irradiated field – to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field. The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.

[392] Reparameterized LLM Training via Orthogonal Equivalence Transformation

Zeju Qiu, Simon Buchholz, Tim Z. Xiao, Maximilian Dax, Bernhard Schölkopf, Weiyang Liu

Main category: cs.LG

TL;DR: POET is a novel reparameterized training algorithm using orthogonal equivalence transformation to optimize neurons, improving training stability and generalization for large language models.

Details

Motivation: Effectively and reliably training large language models remains a significant challenge in AI. Current training methods may lack stability and generalization capabilities for these massive models.

Method: POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. It uses orthogonal equivalence transformation to optimize neurons while preserving spectral properties of weight matrices. The method includes efficient approximations for scalability.

Result: POET enables stable optimization with improved generalization. Extensive experiments validate its effectiveness and scalability in training large-scale neural networks and LLMs.

Conclusion: POET provides a novel, effective, and scalable approach to address the significant challenge of training large language models, offering improved stability and generalization through orthogonal reparameterization.

Abstract: While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field’s most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.

Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta

Main category: cs.LG

TL;DR: A new pruning method for LLMs that uses efficient 1-swap optimization to minimize pruning error without full retraining, outperforming existing approaches.

Details

Motivation: Traditional pruning methods are inefficient for LLMs - full retraining is too expensive, and existing approaches use suboptimal approximations. There's a need for more tractable, effective pruning at LLM scale.

Method: Decouples rows by enforcing equal sparsity per row, derives optimal 1-swaps using Gram matrix of calibration data, and proposes a GPU-efficient 1-swap algorithm that warm starts from any pruning mask.

Result: Reduces per-layer pruning error by up to 60% over Wanda, consistently improves perplexity and zero-shot accuracy across GPT architectures, and runs efficiently on GPUs at LLM scale.

Conclusion: The proposed 1-swap algorithm provides a tractable, hyperparameter-free solution to LLM pruning that significantly outperforms state-of-the-art methods while maintaining computational efficiency.

Abstract: The resource requirements of Neural Networks can be significantly reduced through pruning – the removal of seemingly less important parameters. However, with the rise of Large Language Models (LLMs), full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as global magnitude pruning are suboptimal on Transformer architectures. State-of-the-art methods hence solve a layer-wise mask selection problem, the problem of finding a pruning mask which minimizes the per-layer pruning error on a small set of calibration data. Exactly solving this problem to optimality using Integer Programming (IP) solvers is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches therefore rely on approximations or heuristics. In this work, we demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) that can be computed efficiently using the Gram matrix of the calibration data. Using these observations, we propose a tractable and simple 1-swap algorithm that warm starts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. We demonstrate that our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2023) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.

Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap

Main category: cs.LG

TL;DR: Deep reinforcement learning (PPO) enables autonomous underwater navigation for BlueROV2, outperforming traditional DWA planner in cluttered environments with successful sim-to-real transfer.

Details

Motivation: Underwater navigation is challenging due to GPS absence, poor visibility, and submerged obstacles. The paper addresses these issues using the BlueROV2 platform to develop robust autonomous navigation solutions.

Method: Proximal Policy Optimization (PPO) deep reinforcement learning with observation space combining target-oriented navigation info, virtual occupancy grid, and ray-casting along operational boundaries. Compared against Dynamic Window Approach (DWA) baseline.

Result: PPO policy consistently outperforms DWA in highly cluttered environments with better local adaptation and reduced collisions. Successful transfer from simulation to physical BlueROV2 using 3D digital twin validation.

Conclusion: Deep reinforcement learning is relevant and effective for autonomous underwater navigation, demonstrating successful sim-to-real transfer and superior performance over traditional kinematic planners in complex environments.

Abstract: Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.

[395] Decoupled Q-Chunking

Qiyang Li, Seohong Park, Sergey Levine

Main category: cs.LG

TL;DR: Proposes a novel algorithm that decouples critic and policy chunk lengths to address bootstrapping bias in TD methods, enabling policy reactivity while retaining multi-step value benefits.

Details

Motivation: TD methods suffer from bootstrapping bias where errors accumulate across steps. Chunked critics speed up value backup but force policies to output entire action chunks open-loop, which is suboptimal for reactive environments and challenging for long chunks.

Method: Decouples critic chunk length from policy chunk length. Optimizes policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate maximum value when partial chunks are extended.

Result: Method reliably outperforms prior methods on challenging, long-horizon offline goal-conditioned tasks.

Conclusion: The proposed approach retains benefits of multi-step value propagation while avoiding open-loop sub-optimality and difficulty of learning long action chunking policies.

Abstract: Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences (“chunks”) rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.

[396] Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

Main category: cs.LG

TL;DR: The paper investigates the optimal trade-off between number of evaluation items (N) and annotations per item (K) for reliable ML evaluation given fixed budget, showing that accounting for human disagreement requires N×K ≤ 1000 with K > 10 for most datasets.

Details

Motivation: Reproducibility is crucial for scientific validation in ML, but human disagreement in ground truth annotations is often ignored despite being prevalent. Limited annotation budgets make it expensive to collect multiple ratings per item, creating a need to understand the optimal allocation between items and annotations per item.

Method: Analyzed diverse categorical datasets with multiple annotations per item, and simulated distributions fit to these datasets. Investigated the optimal (N, K) configuration given fixed budget (N×K) for reliable model performance comparison across different evaluation metrics.

Result: Accounting for human disagreement requires N×K ≤ 1000 (often lower) for all tested datasets on at least one metric. Minimal N×K almost always occurred for K > 10. The tradeoff between K and N depends on evaluation metrics, with distribution-sensitive metrics performing better at higher K.

Conclusion: ML practitioners can optimize evaluation data collection by finding optimal metrics and (N, K) configurations for their budget. The study provides practical guidance for more effective test data collection that accounts for human disagreement while staying within budget constraints.

Abstract: Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$, or if one even existed, depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

[397] Empirical evaluation of the Frank-Wolfe methods for constructing white-box adversarial attacks

Kristina Korotkova, Aleksandr Katrutsa

Main category: cs.LG

TL;DR: The paper proposes using modified Frank-Wolfe (projection-free) methods to construct efficient white-box adversarial attacks, comparing them with standard projection-based approaches on MNIST and CIFAR-10 datasets.

Details

Motivation: Adversarial attack construction is crucial for assessing neural network robustness, but current methods need to be faster and more efficient. The optimization-based nature of attack construction suggests that advanced numerical optimization techniques could improve attack efficiency and effectiveness.

Method: The authors propose using modified Frank-Wolfe methods (projection-free optimization techniques) to construct white-box adversarial attacks. They perform theoretical analysis and numerical experiments comparing these methods with standard projection-based approaches on MNIST and CIFAR-10 datasets, testing on logistic regression, CNNs, and Vision Transformers.

Result: The modified Frank-Wolfe methods demonstrate competitive performance compared to standard projection-based approaches for constructing adversarial attacks. The projection-free nature of these methods offers computational advantages while maintaining attack effectiveness across different model architectures.

Conclusion: Projection-free optimization methods, specifically modified Frank-Wolfe approaches, provide an efficient and effective alternative to standard projection-based methods for constructing adversarial attacks, offering theoretical advantages and practical benefits for assessing neural network robustness.

Abstract: The construction of adversarial attacks for neural networks appears to be a crucial challenge for their deployment in various services. To estimate the adversarial robustness of a neural network, a fast and efficient approach is needed to construct adversarial attacks. Since the formalization of adversarial attack construction involves solving a specific optimization problem, we consider the problem of constructing an efficient and effective adversarial attack from a numerical optimization perspective. Specifically, we suggest utilizing advanced projection-free methods, known as modified Frank-Wolfe methods, to construct white-box adversarial attacks on the given input data. We perform a theoretical and numerical evaluation of these methods and compare them with standard approaches based on projection operations or geometrical intuition. Numerical experiments are performed on the MNIST and CIFAR-10 datasets, utilizing a multiclass logistic regression model, the convolutional neural networks (CNNs), and the Vision Transformer (ViT).

[398] GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino, Paolo Mori

Main category: cs.LG

TL;DR: GTPO improves upon GRPO by addressing token-level penalization and policy collapse issues through selective gradient updates and entropy filtering, eliminating the need for KL-divergence regularization and reference models.

Details

Motivation: GRPO suffers from training instability and suboptimal convergence due to two main issues: (1) token-level penalization where valuable tokens receive contradictory feedback, and (2) policy collapse where negative rewards penalize confident responses and shift decisions toward unlikely tokens.

Method: GTPO introduces two key improvements: (1) prevents conflicting gradients by skipping negative updates while amplifying positive ones for valuable tokens, and (2) filters out completions whose entropy exceeds a provable threshold to prevent policy collapse. Unlike GRPO, GTPO doesn’t require KL-divergence regularization or reference models.

Result: GTPO demonstrates greater training stability and improved performance compared to GRPO, validated through experiments on GSM8K, MATH, AIME 2024, AIME 2025, and AMC 2023 datasets.

Conclusion: GTPO successfully addresses GRPO’s limitations by eliminating contradictory gradient updates and preventing policy collapse, resulting in more stable training and better performance without requiring reference models or KL-divergence regularization.

Abstract: Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.

Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou

Main category: cs.LG

TL;DR: DaSH is a hierarchical dataset selection method that models utility at both dataset and group levels to efficiently select entire datasets from heterogeneous pools, outperforming existing methods by up to 26.2% in accuracy with fewer exploration steps.

Details

Motivation: Real-world machine learning often involves data from multiple sources (public repositories, institutions) that vary in quality and relevance. Existing methods select individual samples and treat all data equally, ignoring differences between datasets and their sources, making dataset selection a critical but unsolved problem.

Method: DaSH (Dataset Selection via Hierarchies) formalizes dataset selection as selecting entire datasets from heterogeneous pools. It models utility at both dataset and group levels (e.g., collections, institutions), enabling efficient generalization from limited observations through hierarchical modeling.

Result: DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy on two public benchmarks (Digit-Five and DomainNet), while requiring significantly fewer exploration steps. It’s robust to low-resource settings and lack of relevant datasets.

Conclusion: DaSH provides an effective solution for scalable and adaptive dataset selection in practical multi-source learning workflows, addressing the critical need to select high-quality datasets rather than just individual samples from heterogeneous data sources.

Abstract: The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.

[400] Bidirectional Normalizing Flow: From Data to Noise and Back

Yiyang Lu, Qiao Sun, Xianbang Wang, Zhicheng Jiang, Hanhong Zhao, Kaiming He

Main category: cs.LG

TL;DR: BiFlow removes the need for exact analytic inverses in normalizing flows by learning a reverse model that approximates the noise-to-data mapping, enabling more flexible architectures and faster sampling.

Details

Motivation: Standard normalizing flows require explicit invertibility constraints, and recent Transformer-based autoregressive flows (like TARFlow) suffer from causal decoding bottlenecks that slow down sampling.

Method: BiFlow learns a separate reverse model that approximates the inverse mapping from noise to data, rather than requiring exact analytic inverses. This allows more flexible loss functions and architectures without invertibility constraints.

Result: On ImageNet, BiFlow improves generation quality while accelerating sampling by up to two orders of magnitude compared to causal decoding methods. It achieves state-of-the-art results among NF-based methods and competitive performance with single-evaluation methods.

Conclusion: BiFlow demonstrates that removing the exact inverse requirement enables faster, higher-quality normalizing flows, potentially revitalizing interest in this classical generative modeling paradigm.

Abstract: Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation (“1-NFE”) methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.

[401] OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Yuping Yan, Yuhan Xie, Yuanshuai Li, Yingchao Yu, Lingjuan Lyu, Yaochu Jin

Main category: cs.LG

TL;DR: OutSafe-Bench is a comprehensive multimodal safety evaluation suite with 18K+ bilingual prompts, 4,500 images, 450 audio/video clips across 9 risk categories, featuring novel metrics (MCRS) and evaluation framework (FairScore) to assess MLLM safety vulnerabilities.

Details

Motivation: Growing concerns about MLLMs outputting unsafe content (toxic language, biased imagery, privacy violations, misinformation) and limitations of current safety benchmarks in modality coverage and evaluation robustness.

Method: Created OutSafe-Bench dataset spanning 4 modalities with bilingual prompts and multimodal content. Introduced Multidimensional Cross Risk Score (MCRS) for overlapping risk assessment. Developed FairScore framework using top-performing models as adaptive juries for weighted aggregation.

Result: Evaluation of 9 state-of-the-art MLLMs revealed persistent and substantial safety vulnerabilities, highlighting significant gaps in current MLLM safety measures.

Conclusion: The work demonstrates pressing need for robust safeguards in MLLMs and provides comprehensive evaluation tools (OutSafe-Bench, MCRS, FairScore) to advance multimodal content safety research.

Abstract: Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

[402] CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia

Mihir Panchal, Ying-Jung Chen, Surya Parkash

Main category: cs.LG

TL;DR: CC-GRMAS is a multi-agent framework using satellite data and environmental signals for improved landslide forecasting and disaster response in high mountain Asia.

Details

Motivation: Landslides are increasing climate hazards in high mountain Asia with severe consequences. Current detection and response systems are fragmented and underdeveloped despite available satellite data.

Method: CC-GRMAS framework uses three interlinked agents (Prediction, Planning, Execution) that leverage satellite observations and environmental signals for real-time situational awareness, response planning, and intervention.

Result: The approach enhances landslide forecasting accuracy and enables proactive disaster response through multi-agent coordination and local environmental factor integration.

Conclusion: CC-GRMAS offers a scalable, proactive solution for climate-resilient disaster preparedness in vulnerable mountainous terrains by operationalizing multi-agent coordination with environmental data.

Abstract: Landslides are a growing climate induced hazard with severe environmental and human consequences, particularly in high mountain Asia. Despite increasing access to satellite and temporal datasets, timely detection and disaster response remain underdeveloped and fragmented. This work introduces CC-GRMAS, a framework leveraging a series of satellite observations and environmental signals to enhance the accuracy of landslide forecasting. The system is structured around three interlinked agents Prediction, Planning, and Execution, which collaboratively enable real time situational awareness, response planning, and intervention. By incorporating local environmental factors and operationalizing multi agent coordination, this approach offers a scalable and proactive solution for climate resilient disaster preparedness across vulnerable mountainous terrains.

[403] Noisy Spiking Actor Network for Exploration

Ding Chen, Peixi Peng, Tiejun Huang, Yonghong Tian

Main category: cs.LG

TL;DR: NoisySAN introduces time-correlated noise to spiking neural networks for better exploration in deep RL, outperforming state-of-the-art methods on continuous control tasks.

Details

Motivation: Spiking neural networks (SNNs) have strong robustness to noise due to their binary firing mechanism, which makes it difficult to achieve efficient exploration with local disturbances in reinforcement learning. The paper aims to solve this exploration problem in SNN-based RL.

Method: Proposes NoisySAN (noisy spiking actor network) that introduces time-correlated noise during both charging and transmission phases. Also develops a noise reduction method to find stable policies for agents.

Result: Extensive experiments demonstrate that the method outperforms state-of-the-art performance on a wide range of continuous control tasks from OpenAI gym.

Conclusion: The proposed NoisySAN successfully addresses the exploration challenge in SNN-based RL by introducing time-correlated noise and noise reduction techniques, achieving superior performance on continuous control benchmarks.

Abstract: As a general method for exploration in deep reinforcement learning (RL), NoisyNet can produce problem-specific exploration strategies. Spiking neural networks (SNNs), due to their binary firing mechanism, have strong robustness to noise, making it difficult to realize efficient exploration with local disturbances. To solve this exploration problem, we propose a noisy spiking actor network (NoisySAN) that introduces time-correlated noise during charging and transmission. Moreover, a noise reduction method is proposed to find a stable policy for the agent. Extensive experimental results demonstrate that our method outperforms the state-of-the-art performance on a wide range of continuous control tasks from OpenAI gym.

[404] Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation

Seo Taek Kong, Sihan Zeng, Thinh T. Doan, R. Srikant

Main category: cs.LG

TL;DR: First nonasymptotic CLT for two-time-scale stochastic approximation with Polyak-Ruppert averaging, showing expected error decays at optimal 1/√n rate.

Details

Motivation: Recent machine learning applications need finite-time error rates for two-time-scale stochastic approximation, but conventional analysis focuses on asymptotic convergence or suboptimal finite-time bounds. Prior CLTs suggest 1/√n error is possible, but existing finite-time rates are much slower.

Method: Derive first nonasymptotic central limit theorem with respect to Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging.

Result: Expected error achieved by Polyak-Ruppert averaging decays at rate 1/√n, significantly improving on prior convergence rates.

Conclusion: Establishes optimal finite-time convergence rate for two-time-scale stochastic approximation with averaging, bridging gap between asymptotic CLTs and practical finite-time performance.

Abstract: We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve $1/\sqrt{n}$ error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first nonasymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate $1/\sqrt{n}$, which significantly improves on the rates of convergence in prior works.

[405] Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization

Yuhao He, Jinyu Tian, Xianwei Zheng, Li Dong, Yuanman Li, Jiantao Zhou

Main category: cs.LG

TL;DR: The paper introduces a stealthy “Deferred Poisoning Attack” that makes models vulnerable to evasion attacks/natural noise while appearing normal during training/validation, achieved via singular Hessian regularization.

Details

Motivation: Traditional poisoning attacks are less threatening because they create detectable inconsistencies between training and validation performance, alerting defenders. The authors aim to create a more stealthy attack that only reveals vulnerabilities later.

Method: Proposes Deferred Poisoning Attack that ensures poisoned models have similar loss values as normal models but with large local curvature. Uses Singularization Regularization to make the model’s Hessian singular at optimal points, creating high sensitivity to small perturbations while maintaining normal-appearing training/validation performance.

Result: Theoretical and empirical analysis validates effectiveness on image classification tasks. The attack successfully creates models that appear normal during training/validation but are highly vulnerable to evasion attacks and natural noise, demonstrating worse robustness without detection.

Conclusion: Deferred Poisoning Attack represents a more threatening form of poisoning that bypasses traditional detection methods by only revealing vulnerabilities later, offering new security research perspectives and highlighting the need for more robust defenses.

Abstract: Recent studies have shown that deep learning models are very vulnerable to poisoning attacks. Many defense methods have been proposed to address this issue. However, traditional poisoning attacks are not as threatening as commonly believed. This is because they often cause differences in how the model performs on the training set compared to the validation set. Such inconsistency can alert defenders that their data has been poisoned, allowing them to take the necessary defensive actions. In this paper, we introduce a more threatening type of poisoning attack called the Deferred Poisoning Attack. This new attack allows the model to function normally during the training and validation phases but makes it very sensitive to evasion attacks or even natural noise. We achieve this by ensuring the poisoned model’s loss function has a similar value as a normally trained model at each input sample but with a large local curvature. A similar model loss ensures that there is no obvious inconsistency between the training and validation accuracy, demonstrating high stealthiness. On the other hand, the large curvature implies that a small perturbation may cause a significant increase in model loss, leading to substantial performance degradation, which reflects a worse robustness. We fulfill this purpose by making the model have singular Hessian information at the optimal point via our proposed Singularization Regularization term. We have conducted both theoretical and empirical analyses of the proposed method and validated its effectiveness through experiments on image classification tasks. Furthermore, we have confirmed the hazards of this form of poisoning attack under more general scenarios using natural noise, offering a new perspective for research in the field of security.

[406] Enhanced Spatial Clustering of Single-Molecule Localizations with Graph Neural Networks

Jesús Pineda, Sergi Masó-Orriols, Montse Masoliver, Joan Bertran, Mattias Goksör, Giovanni Volpe, Carlo Manzo

Main category: cs.LG

TL;DR: MIRO is a graph neural network algorithm that transforms single-molecule localization microscopy point clouds to improve clustering efficiency for analyzing molecular organization.

Details

Motivation: Spatial cluster analysis of single-molecule localization microscopy data is challenging due to localization noise, high point density, and complex biological structures, requiring better methods for accurate molecular organization analysis.

Method: MIRO uses recurrent graph neural networks to transform point clouds, enabling improved clustering efficiency when applying conventional clustering techniques. It supports simultaneous processing of clusters with different shapes and at multiple scales.

Result: MIRO demonstrates improved performance across varied datasets, showing transformative potential for single-molecule localization applications with accurate, reliable details of molecular architecture.

Conclusion: MIRO revolutionizes cluster analysis in single-molecule localization microscopy and has promising applications in neuroscience (neural connectivity) and environmental science (ecological spatial distributions).

Abstract: Single-molecule localization microscopy generates point clouds corresponding to fluorophore localizations. Spatial cluster identification and analysis of these point clouds are crucial for extracting insights about molecular organization. However, this task becomes challenging in the presence of localization noise, high point density, or complex biological structures. Here, we introduce MIRO (Multifunctional Integration through Relational Optimization), an algorithm that uses recurrent graph neural networks to transform the point clouds in order to improve clustering efficiency when applying conventional clustering techniques. We show that MIRO supports simultaneous processing of clusters of different shapes and at multiple scales, demonstrating improved performance across varied datasets. Our comprehensive evaluation demonstrates MIRO’s transformative potential for single-molecule localization applications, showcasing its capability to revolutionize cluster analysis and provide accurate, reliable details of molecular architecture. In addition, MIRO’s robust clustering capabilities hold promise for applications in various fields such as neuroscience, for the analysis of neural connectivity patterns, and environmental science, for studying spatial distributions of ecological data.

[407] Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits

Shogo Iwazaki

Main category: cs.LG

TL;DR: The paper shows that noise-free GP-UCB achieves nearly optimal regret bounds, including constant cumulative regret for certain kernels, resolving the gap between theoretical and empirical performance.

Details

Motivation: There's a gap between GP-UCB's strong empirical performance in noise-free Gaussian Process bandits and its suboptimal theoretical regret bounds compared to other algorithms. The paper aims to resolve this discrepancy by providing better theoretical analysis.

Method: The authors analyze the noise-free Gaussian Process bandits problem using GP-UCB algorithm, focusing on theoretical regret analysis for specific kernel types (squared exponential and Matérn kernels).

Result: The analysis shows nearly optimal regret upper bounds for noise-free GP-UCB, including the first constant cumulative regret results for squared exponential kernel and Matérn kernel with sufficient smoothness.

Conclusion: The paper successfully bridges the theory-practice gap for GP-UCB in noise-free settings, demonstrating its theoretical near-optimality matches its strong empirical performance.

Abstract: We study the noise-free Gaussian Process (GP) bandits problem, in which the learner seeks to minimize regret through noise-free observations of the black-box objective function lying on the known reproducing kernel Hilbert space (RKHS). Gaussian process upper confidence bound (GP-UCB) is the well-known GP-bandits algorithm whose query points are adaptively chosen based on the GP-based upper confidence bound score. Although several existing works have reported the practical success of GP-UCB, the current theoretical results indicate its suboptimal performance. However, GP-UCB tends to perform well empirically compared with other nearly optimal noise-free algorithms that rely on a non-adaptive sampling scheme of query points. This paper resolves this gap between theoretical and empirical performance by showing the nearly optimal regret upper bound of noise-free GP-UCB. Specifically, our analysis shows the first constant cumulative regret in the noise-free settings for the squared exponential kernel and Matérn kernel with some degree of smoothness.

[408] Balanced Online Class-Incremental Learning via Dual Classifiers

Shunjie Wen, Thomas Heinis, Dong-Wan Choi

Main category: cs.LG

TL;DR: BISON: A replay-based OCIL method using dual classifiers with inclusive training separation to achieve better balance between plasticity and stability.

Details

Motivation: Existing OCIL methods struggle to balance plasticity (learning new classes) and stability (preserving old knowledge) due to exclusive training separation and difficulty in knowledge integration.

Method: Proposes Balanced Inclusive Separation (BISON) with dual classifiers and inclusive training strategy that allows knowledge from both old and new classes to be integrated, plus implicit knowledge transfer between classifiers.

Result: Extensive experiments on three OCIL benchmarks show BISON achieves more balanced and superior performance compared to state-of-the-art replay-based methods.

Conclusion: BISON effectively addresses the plasticity-stability trade-off in OCIL through inclusive training separation and dual classifiers, demonstrating better balanced performance.

Abstract: Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing methods rely on explicit knowledge interaction through experience replay, and often employ exclusive training separation to address bias problems. Nevertheless, it still remains a big challenge to achieve a well-balanced learner, as these methods often exhibit either reduced plasticity or limited stability due to difficulties in continually integrating knowledge in the OCIL setting. In this paper, we propose a novel replay-based method, called Balanced Inclusive Separation for Online iNcremental learning (BISON), which can achieve both high plasticity and stability, thus ensuring more balanced performance in OCIL. Our BISON method proposes an inclusive training separation strategy using dual classifiers so that knowledge from both old and new classes can effectively be integrated into the model, while introducing implicit approaches for transferring knowledge across the two classifiers. Extensive experimental evaluations over three widely-used OCIL benchmark datasets demonstrate the superiority of BISON, showing more balanced yet better performance compared to state-of-the-art replay-based OCIL methods.

[409] Internal Evaluation of Density-Based Clusterings with Noise

Anna Beer, Lena Krieger, Pascal Weber, Martin Ritzert, Ira Assent, Claudia Plant

Main category: cs.LG

TL;DR: DISCO is a new cluster validation index that evaluates noise assignments in density-based clustering methods like DBSCAN/HDBSCAN, unlike traditional CVIs that ignore noise quality assessment.

Details

Motivation: Most cluster validation indices fail to properly evaluate noise assignments in density-based clustering methods, even though accurate noise detection is crucial for successful clustering. Existing CVIs either ignore noise or simply count noise points without assessing their quality.

Method: DISCO extends the Silhouette Coefficient concept by incorporating density-connectivity to handle arbitrary cluster shapes and introduces explicit noise evaluation. It rewards correctly assigned noise labels and penalizes noise labels that should belong to clusters. The pointwise definition enables seamless integration of noise evaluation into overall clustering assessment.

Result: DISCO is the first CVI to explicitly assess noise assignment quality rather than just counting noise points. It handles edge cases like singleton clusters or single cluster plus noise scenarios that regularly appear in clustering outputs.

Conclusion: DISCO provides a comprehensive cluster validation index that properly evaluates noise assignments in density-based clustering, enabling more accurate assessment of clustering quality and explainable evaluations of clustered data.

Abstract: Being able to evaluate the quality of a clustering result even in the absence of ground truth cluster labels is fundamental for research in data mining. However, most cluster validation indices (CVIs) do not capture noise assignments by density-based clustering methods like DBSCAN or HDBSCAN, even though the ability to correctly determine noise is crucial for successful clustering. In this paper, we propose DISCO, a Density-based Internal Score for Clusterings with nOise, the first CVI to explicitly assess the quality of noise assignments rather than merely counting them. DISCO is based on the established idea of the Silhouette Coefficient, but adopts density-connectivity to evaluate clusters of arbitrary shapes, and proposes explicit noise evaluation: it rewards correctly assigned noise labels and penalizes noise labels where a cluster label would have been more appropriate. The pointwise definition of DISCO allows for the seamless integration of noise evaluation into the final clustering evaluation, while also enabling explainable evaluations of the clustered data. In contrast to most state-of-the-art, DISCO is well-defined and also covers edge cases that regularly appear as output from clustering algorithms, such as singleton clusters or a single cluster plus noise.

[410] LLM4FS: Leveraging Large Language Models for Feature Selection

Jianhao Li, Xianchao Xiu

Main category: cs.LG

TL;DR: LLM4FS: A hybrid feature selection method combining LLMs with traditional data-driven techniques that outperforms both individual approaches.

Details

Motivation: To leverage the strengths of both LLMs (contextual understanding) and traditional data-driven methods (statistical reliability) for improved feature selection performance, addressing limitations of each approach individually.

Method: Proposes LLM4FS hybrid strategy that inputs data samples into LLMs and directly integrates traditional techniques like random forest and forward sequential selection, combining LLM contextual analysis with statistical methods.

Result: The hybrid approach achieves excellent feature selection performance, surpassing both LLM-only methods (including state-of-the-art models like DeepSeek-R1, GPT-o3-mini, GPT-4.5) and traditional data-driven methods alone.

Conclusion: LLM4FS demonstrates the value of combining LLMs with traditional techniques for feature selection, though limitations in decision-making applications are acknowledged, with code made publicly available.

Abstract: Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.

[411] Learning (Approximately) Equivariant Networks via Constrained Optimization

Andrei Manolache, Luiz F. O. Chamon, Mathias Niepert

Main category: cs.LG

TL;DR: ACE (Adaptive Constrained Equivariance) is a constrained optimization method that gradually transitions flexible non-equivariant models toward equivariance, balancing symmetry constraints with data fitting needs.

Details

Motivation: Real-world data often has imperfect symmetries due to noise, structural variation, or measurement bias. Strictly equivariant models struggle to fit such data, while unconstrained models can't leverage partial symmetries. Even with symmetric data, enforced equivariance can limit parameter exploration during training.

Method: ACE uses constrained optimization inspired by homotopy principles. It starts with a flexible non-equivariant model and gradually reduces its deviation from equivariance through adaptive constraint tightening, creating a smooth training trajectory that finds a data-driven balance.

Result: Across multiple architectures and tasks, ACE consistently improves performance metrics, sample efficiency, and robustness to input perturbations compared to strictly equivariant models and heuristic equivariance relaxations.

Conclusion: ACE provides a principled approach to handle partial symmetries in real-world data, offering better performance than both rigidly equivariant models and unconstrained approaches by adaptively balancing symmetry constraints with data fitting needs.

Abstract: Equivariant neural networks are designed to respect symmetries through their architecture, boosting generalization and sample efficiency when those symmetries are present in the data distribution. Real-world data, however, often departs from perfect symmetry because of noise, structural variation, measurement bias, or other symmetry-breaking effects. Strictly equivariant models may struggle to fit the data, while unconstrained models lack a principled way to leverage partial symmetries. Even when the data is fully symmetric, enforcing equivariance can hurt training by limiting the model to a restricted region of the parameter space. Guided by homotopy principles, where an optimization problem is solved by gradually transforming a simpler problem into a complex one, we introduce Adaptive Constrained Equivariance (ACE), a constrained optimization approach that starts with a flexible, non-equivariant model and gradually reduces its deviation from equivariance. This gradual tightening smooths training early on and settles the model at a data-driven equilibrium, balancing between equivariance and non-equivariance. Across multiple architectures and tasks, our method consistently improves performance metrics, sample efficiency, and robustness to input perturbations compared with strictly equivariant models and heuristic equivariance relaxations.

[412] Improved Regret Bounds for Gaussian Process Upper Confidence Bound in Bayesian Optimization

Shogo Iwazaki

Main category: cs.LG

TL;DR: GP-UCB achieves near-optimal regret bounds for Bayesian optimization under Matérn and squared exponential kernels, closing the gap to known lower bounds.

Details

Motivation: The paper addresses the Bayesian optimization/Gaussian process bandit problem, where existing regret bounds for GP-UCB didn't match the best-known lower bounds. There was a gap between theoretical guarantees and what was achievable.

Method: The authors analyze the Gaussian Process Upper Confidence Bound (GP-UCB) algorithm. The key innovation is a refined proof technique that captures the concentration behavior of the input sequence generated by GP-UCB, enabling better analysis of the Gaussian process’s information gain.

Result: Under Matérn kernels with certain smoothness, GP-UCB achieves $\tilde{O}(\sqrt{T})$ cumulative regret with high probability. For squared exponential kernels, the algorithm achieves $O(\sqrt{T \ln^2 T})$ regret. These results match or come close to the best-known lower bounds from Scarlett (2018).

Conclusion: The paper successfully closes the theoretical gap between GP-UCB’s performance and known lower bounds, showing that GP-UCB achieves near-optimal regret rates for Bayesian optimization with Gaussian processes under common kernel choices.

Abstract: This paper addresses the Bayesian optimization problem (also referred to as the Bayesian setting of the Gaussian process bandit), where the learner seeks to minimize the regret under a function drawn from a known Gaussian process (GP). Under a Matérn kernel with a certain degree of smoothness, we show that the Gaussian process upper confidence bound (GP-UCB) algorithm achieves $\tilde{O}(\sqrt{T})$ cumulative regret with high probability. Furthermore, our analysis yields $O(\sqrt{T \ln^2 T})$ regret under a squared exponential kernel. These results fill the gap between the existing regret upper bound for GP-UCB and the best-known bound provided by Scarlett (2018). The key idea in our proof is to capture the concentration behavior of the input sequence realized by GP-UCB, enabling a more refined analysis of the GP’s information gain.

[413] Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks

Artun Saday, Yaşar Cahit Yıldırım, Cem Tekin

Main category: cs.LG

TL;DR: The paper proposes novel algorithms for Gaussian Process optimization under adversarial perturbations using a robust satisficing framework, where the goal is to consistently achieve a predefined performance threshold rather than optimizing for worst-case scenarios.

Details

Motivation: Traditional robust optimization focuses on worst-case performance maximization, which may be overly conservative. The authors address GP optimization with unknown and varying adversarial perturbations, proposing a more practical robust satisficing objective where the goal is to consistently achieve a predefined performance threshold under adversarial conditions.

Method: Two novel algorithms based on distinct formulations of robust satisficing, both instances of a general robust satisficing framework. The algorithms offer different guarantees depending on the adversary’s nature: one with sublinear regret bounds under certain conditions, and another that scales with perturbation magnitude but requires no adversary assumptions.

Result: Derived two regret bounds: (1) sublinear over time with certain conditions on adversary and satisficing threshold τ, and (2) scaling with perturbation magnitude but requiring no assumptions on the adversary. Experimental results show the approach outperforms established robust optimization methods in achieving satisficing objectives, especially when ambiguity sets are inaccurately specified.

Conclusion: The proposed robust satisficing framework provides a practical alternative to traditional robust optimization for GP optimization under adversarial perturbations, offering flexible algorithms with different theoretical guarantees and demonstrating superior performance in achieving consistent threshold satisfaction.

Abstract: We address the problem of Gaussian Process (GP) optimization in the presence of unknown and potentially varying adversarial perturbations. Unlike traditional robust optimization approaches that focus on maximizing performance under worst-case scenarios, we consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold $τ$, even under adversarial conditions. We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework. Further, each algorithm offers different guarantees depending on the nature of the adversary. Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold $τ$, and another that scales with the perturbation magnitude but requires no assumptions on the adversary. Through extensive experiments, we demonstrate that our approach outperforms the established robust optimization methods in achieving the satisficing objective, particularly when the ambiguity set of the robust optimization framework is inaccurately specified.

[414] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari

Main category: cs.LG

TL;DR: ENMA is a generative neural operator that predicts spatio-temporal PDE dynamics using masked autoregressive transformers in latent space, enabling robust generalization across physical parameters with in-context learning.

Details

Motivation: Solving time-dependent parametric PDEs is challenging for neural solvers, especially with uncertain/incomplete data. Generative models offer a natural approach to handle these limitations and generalize across diverse physical regimes.

Method: ENMA uses a generative masked autoregressive transformer with flow matching loss for tokenwise generation in compressed latent space. Irregular spatial observations are encoded via attention mechanisms and spatio-temporal convolutional encoder, enabling in-context learning by conditioning on past states or similar auxiliary trajectories.

Result: ENMA creates a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

Conclusion: ENMA provides an effective generative approach for modeling spatio-temporal dynamics from physical phenomena, addressing data uncertainty and enabling generalization across diverse parametric PDE settings through latent space prediction and in-context learning.

Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

[415] Symmetry in Neural Network Parameter Spaces

Bo Zhao, Robin Walters, Rose Yu

Main category: cs.LG

TL;DR: Survey paper on parameter space symmetries in deep learning, explaining how transformations that leave network functions unchanged shape loss landscapes and constrain learning dynamics.

Details

Motivation: Deep learning models are highly overparameterized with many parameter configurations yielding same outputs. Understanding symmetries in parameter space can provide new insights into optimization, generalization, and model complexity that complement existing theory.

Method: Survey methodology - summarizing existing literature on parameter space symmetry, uncovering connections between symmetry and learning theory, and identifying gaps in this emerging field.

Result: Provides overview of parameter space symmetry research, connects symmetry concepts to learning theory, and identifies opportunities for future work in understanding deep learning through symmetry lens.

Conclusion: Parameter space symmetries offer a valuable new perspective for understanding deep learning phenomena that complements existing theoretical frameworks, with potential to advance optimization, generalization, and model complexity analysis.

Abstract: Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space–transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.

[416] Geometric Regularity in Deterministic Sampling Dynamics of Diffusion-based Generative Models

Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu

Main category: cs.LG

TL;DR: Diffusion model sampling trajectories exhibit surprising geometric regularity - all follow nearly identical boomerang-shaped paths in extremely low-dimensional subspaces, regardless of model architecture or content.

Details

Motivation: The paper aims to uncover hidden geometric patterns in diffusion model sampling dynamics, which could lead to better understanding and improved sampling efficiency.

Method: Analyzed deterministic sampling trajectories of diffusion models, characterized their geometric properties, and proposed a dynamic programming scheme to align sampling schedules with trajectory structure.

Result: Discovered that all sampling trajectories lie in extremely low-dimensional subspaces and exhibit almost identical boomerang shapes across different models, conditions, and generated content.

Conclusion: The discovered trajectory regularity enables practical improvements to sampling efficiency through better time schedule alignment, achieving superior image generation with minimal computational overhead.

Abstract: Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics of diffusion generative models: each simulated sampling trajectory along the gradient field lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical boomerang shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing deterministic numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only 5 - 10 function evaluations.

[417] T-SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders

Alexey Yermakov, David Zoro, Mars Liyao Gao, J. Nathan Kutz

Main category: cs.LG

TL;DR: T-SHRED modifies SHRED by replacing RNNs with transformers and adding symbolic regression via SINDy attention, enabling interpretable latent dynamics and avoiding auto-regressive forecasting.

Details

Motivation: To improve upon SHRED models by enhancing interpretability and avoiding auto-regressive long-term forecasting limitations through symbolic regression and transformer-based temporal encoding.

Method: Modified SHRED architecture by replacing RNN temporal encoding with transformers, incorporating SINDy attention mechanism for sparse identification of nonlinear dynamics, and using symbolic regression to regularize and interpret latent space dynamics.

Result: T-SHRED was analyzed on three different dynamical systems across low-data to high-data regimes, demonstrating improved interpretability while maintaining forecasting capabilities.

Conclusion: T-SHRED successfully combines transformers with symbolic regression to create interpretable models for chaotic dynamical systems forecasting from sparse measurements, overcoming limitations of auto-regressive approaches.

Abstract: SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we modify SHRED by leveraging transformers (T-SHRED) embedded with symbolic regression for the temporal encoding, circumventing auto-regressive long-term forecasting for physical data. This is achieved through a new sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to impose sparsity regularization on the latent space, which also allows for immediate symbolic interpretation. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes.

[418] Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson

Main category: cs.LG

TL;DR: Proposed a new ASR evaluation metric combining NLI, semantic similarity, and phonetic similarity to better capture intelligibility for dysarthric/dysphonic speech, achieving 0.890 correlation with human judgments.

Details

Motivation: Traditional ASR metrics like WER/CER fail to capture intelligibility for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types but human listeners can understand the meaning despite errors.

Method: Proposed a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity to evaluate ASR output for dysarthric speech. Used Speech Accessibility Project data for evaluation.

Result: The proposed metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods like WER and CER.

Conclusion: There is a need to prioritize intelligibility over error-based measures for dysarthric/dysphonic speech evaluation. The proposed metric effectively captures intelligibility where traditional metrics fail.

Abstract: Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.

[419] Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort

Thomas Jack Samuels, Franco Rugolon, Stephan Hau, Lennart Högman

Main category: cs.LG

TL;DR: Multimodal ML detects deception in dyadic interactions better than single-modality approaches, with best results (71% accuracy) using late fusion of both participants’ audio and video data.

Details

Motivation: To investigate deception detection in dyadic interactions using multimodal machine learning, focusing on both deceiver and deceived participants, and to establish baseline research for Scandinavian populations.

Method: Used early and late fusion approaches with audio (speech) and video (Action Units and gaze) data from both participants. Analyzed all modality combinations on a new Swedish dataset of truth/lie scenarios about emotional topics.

Result: Combining speech and facial information outperformed single modalities. Including both participants’ data significantly improved accuracy, with best performance (71%) from late fusion of both modalities and participants.

Conclusion: Multimodal approaches with data from both participants enhance deception detection, supporting psychological theories about differential control of facial/vocal expressions. This Scandinavian study provides foundation for future dyadic interaction research, especially in psychotherapy.

Abstract: This study investigates the efficacy of using multimodal machine learning techniques to detect deception in dyadic interactions, focusing on the integration of data from both the deceiver and the deceived. We compare early and late fusion approaches, utilizing audio and video data - specifically, Action Units and gaze information - across all possible combinations of modalities and participants. Our dataset, newly collected from Swedish native speakers engaged in truth or lie scenarios on emotionally relevant topics, serves as the basis for our analysis. The results demonstrate that incorporating both speech and facial information yields superior performance compared to single-modality approaches. Moreover, including data from both participants significantly enhances deception detection accuracy, with the best performance (71%) achieved using a late fusion strategy applied to both modalities and participants. These findings align with psychological theories suggesting differential control of facial and vocal expressions during initial interactions. As the first study of its kind on a Scandinavian cohort, this research lays the groundwork for future investigations into dyadic interactions, particularly within psychotherapy settings.

[420] Proof of a perfect platonic representation hypothesis

Liu Ziyin, Isaac Chuang

Main category: cs.LG

TL;DR: The paper provides a detailed explanation of the proof for the “perfect” Platonic Representation Hypothesis in embedded deep linear networks, showing SGD leads to identical representations across layers up to rotation, and connects this to progressive sharpening phenomena.

Details

Motivation: To elaborate and explain in detail the proof of the Platonic Representation Hypothesis for embedded deep linear networks, making it instructive while avoiding jargon and technical complexity.

Method: Analyzes the proof by Ziyin et al. (2025) showing that when trained with stochastic gradient descent, two embedded deep linear networks with different architectures trained on different data become “Perfectly Platonic” - all layer pairs learn identical representations up to rotation.

Result: SGD finds perfectly Platonic solutions despite most global minima not being Platonic, identifies six ways the hypothesis can be broken, and shows Platonic representations emerge from the same cause as progressive sharpening.

Conclusion: The theory highlights the importance of understanding emergent “entropic forces” from SGD’s irreversibility in representation learning, revealing a common cause for two seemingly unrelated deep learning phenomena.

Abstract: In this note, we elaborate on and explain in detail the proof given by Ziyin et al. (2025) of the ``perfect" Platonic Representation Hypothesis (PRH) for the embedded deep linear network model (EDLN). We show that if trained with the stochastic gradient descent (SGD), two EDLNs with different widths and depths and trained on different data will become Perfectly Platonic, meaning that every possible pair of layers will learn the same representation up to a rotation. Because most of the global minima of the loss function are not Platonic, that SGD only finds the perfectly Platonic solution is rather extraordinary. The proof also suggests at least six ways the PRH can be broken. We also show that in the EDLN model, the emergence of the Platonic representations is due to the same reason as the emergence of progressive sharpening. This implies that these two seemingly unrelated phenomena in deep learning can, surprisingly, have a common cause. Overall, the theory and proof highlight the importance of understanding emergent “entropic forces” due to the irreversibility of SGD training and their role in representation learning. The goal of this note is to be instructive while avoiding jargon and lengthy technical details.

[421] Dynamic Regret Reduces to Kernelized Static Regret

Andrew Jacobsen, Alessandro Rudi, Francesco Orabona, Nicolo Cesa-Bianchi

Main category: cs.LG

TL;DR: The paper presents a novel reduction from dynamic regret minimization to static regret in RKHS function spaces, enabling optimal dynamic regret bounds for linear losses and new adaptive bounds for general loss sequences.

Details

Motivation: Dynamic regret in online convex optimization aims to compete with arbitrary comparator sequences rather than fixed comparators. Existing dynamic-to-static reductions only work for linear losses, limiting their applicability to broader settings.

Method: Frame dynamic regret as static regret in a function space by treating comparator sequences as functions from time to decision space. Construct a suitable RKHS to enable this reduction, leveraging reproducing properties for practical computation.

Result: Achieves optimal O(√(∑||u_t-u_{t-1}||T)) dynamic regret for linear losses, and new scale-free adaptive bounds. Extends to exp-concave and improper linear regression with O(||u||²_H + d_eff(λ)lnT) bounds, where d_eff(λ) measures RKHS complexity.

Conclusion: The RKHS-based reduction provides a unified framework for dynamic regret that works for general loss sequences, yields optimal bounds for linear losses, and produces practical algorithms despite infinite-dimensional function spaces.

Abstract: We study dynamic regret in online convex optimization, where the objective is to achieve low cumulative loss relative to an arbitrary benchmark sequence. By observing that competing with an arbitrary sequence of comparators $u_{1},\ldots,u_{T}$ in $\mathcal{W}\subseteq\mathbb{R}^{d}$ is equivalent to competing with a fixed comparator function $u:[1,T]\to \mathcal{W}$, we frame dynamic regret minimization as a static regret problem in a function space. By carefully constructing a suitable function space in the form of a Reproducing Kernel Hilbert Space (RKHS), our reduction enables us to recover the optimal $R_{T}(u_{1},\ldots,u_{T}) = \mathcal{O}(\sqrt{\sum_{t}|u_{t}-u_{t-1}|T})$ dynamic regret guarantee in the setting of linear losses, and yields new scale-free and directionally-adaptive dynamic regret guarantees. Moreover, unlike prior dynamic-to-static reductions – which are valid only for linear losses – our reduction holds for any sequence of losses, allowing us to recover $\mathcal{O}\big(|u|^2_{\mathcal{H}}+d_{\mathrm{eff}}(λ)\ln T\big)$ bounds in exp-concave and improper linear regression settings, where $d_{\mathrm{eff}}(λ)$ is a measure of complexity of the RKHS. Despite working in an infinite-dimensional space, the resulting reduction leads to algorithms that are computable in practice, due to the reproducing property of RKHSs.

[422] Optimizing Drivers’ Discount Order Acceptance Strategies: A Policy-Improved Deep Deterministic Policy Gradient Framework

Hanwen Dai, Chang Gao, Fang He, Congyuan Ji, Yanni Yang

Main category: cs.LG

TL;DR: This paper proposes a policy-improved deep deterministic policy gradient (pi-DDPG) framework to dynamically manage drivers’ acceptance of Discount Express orders for ride-hailing platforms, addressing the challenge of online learning without historical data while ensuring reliable early-stage performance.

Details

Motivation: Platform integration consolidates multiple ride-hailing platforms into single apps, with third-party integrators offering Discount Express services via express drivers at lower fares. Individual platforms face a trade-off: encouraging driver participation in Discount Express can expand demand and improve matching efficiency, but reduces profit margins. The lack of historical data under this new business model requires online learning, but early-stage exploration through trial-and-error is costly, necessitating reliable early-stage performance in real-world deployment.

Method: The study formulates driver acceptance of discount orders as a continuous control task. To address high stochasticity and opaque matching mechanisms of third-party integrators, the authors propose a policy-improved deep deterministic policy gradient (pi-DDPG) framework with a refiner module to boost policy performance during early training. They develop a customized simulator based on real-world data to validate the approach.

Result: Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses compared to alternatives, enhancing its applicability to practical ride-hailing scenarios.

Conclusion: The proposed pi-DDPG framework effectively addresses the challenge of dynamically managing drivers’ acceptance of Discount Express orders for individual ride-hailing platforms, providing a practical solution that balances learning efficiency with reliable early-stage performance in the absence of historical data.

Abstract: The rapid expansion of platform integration has emerged as an effective solution to mitigate market fragmentation by consolidating multiple ride-hailing platforms into a single application. To address heterogeneous passenger preferences, third-party integrators provide Discount Express service delivered by express drivers at lower trip fares. For the individual platform, encouraging broader participation of drivers in Discount Express services has the potential to expand the accessible demand pool and improve matching efficiency, but often at the cost of reduced profit margins. This study aims to dynamically manage drivers’ acceptance of Discount Express from the perspective of an individual platform. The lack of historical data under the new business model necessitates online learning. However, early-stage exploration through trial and error can be costly in practice, highlighting the need for reliable early-stage performance in real-world deployment. To address these challenges, this study formulates the decision regarding the proportion of drivers accepting discount orders as a continuous control task. In response to the high stochasticity and the opaque matching mechanisms employed by third-party integrator, we propose an innovative policy-improved deep deterministic policy gradient (pi-DDPG) framework. The proposed framework incorporates a refiner module to boost policy performance during the early training phase. A customized simulator based on a real-world dataset is developed to validate the effectiveness of the proposed pi-DDPG. Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses, enhancing its applicability to practical ride-hailing scenarios.

[423] Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Jonas Hübotter, Patrik Wolf, Alexander Shevchenko, Dennis Jüni, Andreas Krause, Gil Kur

Main category: cs.LG

TL;DR: TTT (test-time training) improves performance by allowing foundation models to specialize on test-task concepts after initial generalization, addressing global underparameterization.

Details

Motivation: Existing explanations for TTT effectiveness focus on out-of-distribution adaptation or privileged data, but these don't explain why TTT works with in-distribution data for large foundation models. The paper proposes that foundation models are globally underparameterized and need specialization after generalization.

Method: Proposes theoretical model under linear representation hypothesis showing TTT achieves smaller in-distribution test error than global training. Empirically validates assumptions using sparse autoencoder on ImageNet to show semantically related data points share few concepts. Conducts scaling studies across image and language tasks.

Result: Shows TTT provides substantial performance improvements by specializing model capacity on test-task relevant concepts. Identifies regimes where specialization is most effective through empirical validation and scaling studies.

Conclusion: TTT works because foundation models are globally underparameterized - they need specialization after generalization to focus capacity on test-task concepts, not just for OOD adaptation. This explains TTT effectiveness even with in-distribution data for large models.

Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model’s key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

[424] Generalized Kernelized Bandits: A Novel Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds

Alberto Maria Metelli, Simone Drago, Marco Mussi

Main category: cs.LG

TL;DR: This paper introduces generalized kernelized bandits (GKBs), a unified framework combining kernelized bandits and generalized linear bandits, and proposes GKB-UCB algorithm with optimal regret bounds.

Details

Motivation: The paper aims to bridge kernelized bandits (KBs) and generalized linear bandits (GLBs) by creating a unified framework that handles non-linear reward models in RKHS settings, addressing limitations of existing concentration inequalities.

Method: Proposes GKB-UCB algorithm using a novel self-normalized Bernstein-like concentration inequality for Hilbert spaces, and introduces a tractable version Trac-GKB-UCB with similar regret guarantees.

Result: Achieves tight regret bound of $\widetilde{O}(γ_T\sqrt{T/κ_})$ where $γ_T$ is maximal information gain and $κ_$ characterizes reward non-linearity, with optimal dependence on $T$, $γ_T$, and $κ_*$ for both KBs and GLBs.

Conclusion: The GKB framework successfully unifies KBs and GLBs, with GKB-UCB achieving optimal regret bounds through novel concentration inequalities, providing a comprehensive solution for bandit problems with non-linear reward models in RKHS.

Abstract: We study the regret minimization problem in the novel setting of generalized kernelized bandits (GKBs), where we optimize an unknown function $f^$ belonging to a reproducing kernel Hilbert space (RKHS) having access to samples generated by an exponential family (EF) reward model whose mean is a non-linear function $μ(f^)$. This setting extends both kernelized bandits (KBs) and generalized linear bandits (GLBs), providing a unified view of both settings. We propose an optimistic regret minimization algorithm, GKB-UCB, and we explain why existing self-normalized concentration inequalities used for KBs and GLBs do not allow to provide tight regret guarantees. For this reason, we devise a novel self-normalized Bernstein-like dimension-free inequality that applies to a Hilbert space of functions with bounded norm, representing a contribution of independent interest. Based on it, we analyze GKB-UCB, deriving a regret bound of order $\widetilde{O}( γ_T \sqrt{T/κ_})$, being $T$ the learning horizon, $γ_T$ the maximal information gain, and $κ_$ a term characterizing the magnitude of the expected reward non-linearity. Our result is tight in its dependence on $T$, $γ_T$, and $κ_*$ for both KBs and GLBs. Finally, we present a tractable version GKB-UCB, Trac-GKB-UCB, which attains similar regret guarantees, and we discuss its time and space complexity.

[425] RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, Le-Minh Nguyen

Main category: cs.LG

TL;DR: RegMean++ improves upon RegMean by incorporating intra- and cross-layer dependencies when merging models, leading to better performance across diverse tasks and settings.

Details

Motivation: RegMean merges each linear layer independently, ignoring how features and information propagate through layers and influence final predictions. This limitation motivates a more holistic approach to model merging.

Method: RegMean++ extends RegMean by explicitly incorporating both intra- and cross-layer dependencies between merge models’ layers into the regression objective, better capturing the behaviors of the merge model.

Result: RegMean++ consistently outperforms RegMean across diverse settings including in-domain/out-of-domain generalization, sequential merging, large-scale tasks, and robustness under distribution shifts. It achieves competitive or state-of-the-art performance compared to recent advanced model merging methods.

Conclusion: RegMean++ provides a simple yet effective improvement over RegMean by accounting for layer dependencies, offering better model merging performance while maintaining explainability and computational efficiency.

Abstract: Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models’ layers into RegMean’s objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at https://github.com/nthehai01/RegMean-plusplus.

[426] Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction

Yisen Gao, Xingcheng Fu, Qingyun Sun, Jianxin Li, Xianxian Li

Main category: cs.LG

TL;DR: GeoMancer is a Riemannian graph diffusion framework that addresses geometric entanglement in graph data by learning distinct manifold signatures for different curvatures, solving numerical instability and manifold deviation issues.

Details

Motivation: Existing graph diffusion models embed node, edge, and graph-level features into a unified latent space, which entangles features of different curvatures due to the non-Euclidean nature of graph data, failing to capture their geometric potential.

Method: Proposes GeoMancer with two key innovations: 1) replaces exponential mapping with isometric-invariant Riemannian gyrokernel approach to mitigate numerical instability, and decouples multi-level features onto task-specific manifolds; 2) introduces manifold-constrained diffusion method and self-guided strategy for unconditional generation to address manifold deviation.

Result: Extensive experiments demonstrate superior performance across various tasks, validating the effectiveness of the approach in capturing distinct manifold signatures of complex graph data.

Conclusion: GeoMancer successfully constructs an ideal Riemannian diffusion model that captures distinct manifold signatures of graph data, addressing key challenges in numerical stability and manifold alignment during generation and prediction tasks.

Abstract: Graph diffusion models have made significant progress in learning structured graph data and have demonstrated strong potential for predictive tasks. Existing approaches typically embed node, edge, and graph-level features into a unified latent space, modeling prediction tasks including classification and regression as a form of conditional generation. However, due to the non-Euclidean nature of graph data, features of different curvatures are entangled in the same latent space without releasing their geometric potential. To address this issue, we aim to construt an ideal Riemannian diffusion model to capture distinct manifold signatures of complex graph data and learn their distribution. This goal faces two challenges: numerical instability caused by exponential mapping during the encoding proces and manifold deviation during diffusion generation. To address these challenges, we propose GeoMancer: a novel Riemannian graph diffusion framework for both generation and prediction tasks. To mitigate numerical instability, we replace exponential mapping with an isometric-invariant Riemannian gyrokernel approach and decouple multi-level features onto their respective task-specific manifolds to learn optimal representations. To address manifold deviation, we introduce a manifold-constrained diffusion method and a self-guided strategy for unconditional generation, ensuring that the generated data remains aligned with the manifold signature. Extensive experiments validate the effectiveness of our approach, demonstrating superior performance across a variety of tasks.

[427] A Markov Decision Process Framework for Early Maneuver Decisions in Satellite Collision Avoidance

Francesca Ferrara, Lander W. Schillinger Arana, Florian Dörfler, Sarah H. Q. Li

Main category: cs.LG

TL;DR: Reinforcement learning approach for satellite collision avoidance that optimizes when to initiate maneuvers to reduce propellant consumption while maintaining safety.

Details

Motivation: Current satellite collision avoidance methods often initiate maneuvers too close to the time of closest approach, leading to higher propellant consumption. There's a need for an autonomous decision-making framework that can make earlier maneuver decisions to save fuel while maintaining acceptable collision risks.

Method: Developed a Markov decision process (MDP) framework for satellite collision avoidance guidance decisions, with continuous states, discrete actions, and finite horizon. Used reinforcement learning policy gradient (RL-PG) algorithm to optimize guidance policies directly from historical CAM data. The MDP models rewards using analytical models of collision risk, propellant consumption, and transit orbit geometry.

Result: On synthetic conjunction events: trained policy consumed significantly less propellant overall and per maneuver compared to conventional 24-hour cutoff policy. On historical events: consumed more propellant overall but less per maneuver. The policy was slightly more conservative in identifying conjunctions requiring CAMs compared to cutoff policies.

Conclusion: The RL-based MDP framework enables autonomous satellite collision avoidance decisions that can reduce propellant consumption by making earlier maneuver decisions while maintaining safety. The approach shows promise for optimizing satellite operations and extending mission lifetimes through fuel savings.

Abstract: We develop a Markov decision process (MDP) framework to autonomously make guidance decisions for satellite collision avoidance maneuver (CAM) and a reinforcement learning policy gradient (RL-PG) algorithm to enable direct optimization of guidance policy using historic CAM data. In addition to maintaining acceptable collision risks, this approach seeks to minimize the average propellant consumption of CAMs by making early maneuver decisions. We model CAM as a continuous state, discrete action and finite horizon MDP, where the critical decision is determining when to initiate the maneuver. The MDP models decision rewards using analytical models of collision risk, propellant consumption, and transit orbit geometry. By deciding to maneuver earlier than conventional methods, the Markov policy effectively favors CAMs that achieve comparable rates of collision risk reduction while consuming less propellant. Using historical data of tracked conjunction events, we verify this framework and conduct an extensive parameter-sensitivity study. When evaluated on synthetic conjunction events, the trained policy consumes significantly less propellant overall and per maneuver in comparison to a conventional cut-off policy that initiates maneuvers 24 hours before the time of closest approach (TCA). On historical conjunction events, the trained policy consumes more propellant overall but consumes less propellant per maneuver. For both historical and synthetic conjunction events, the trained policy is slightly more conservative in identifying conjunctions events that warrant CAMs in comparison to cutoff policies.

[428] Unveiling the Latent Directions of Reflection in Large Language Models

Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu

Main category: cs.LG

TL;DR: The paper investigates reflection in LLMs through activation steering, showing how different reflection levels can be controlled and manipulated, with implications for both enhancing defenses and creating adversarial attacks.

Details

Motivation: While reflection is widely used to improve LLM performance on complex reasoning tasks, most prior work focuses on prompting strategies or reinforcement learning, leaving the inner mechanisms of reflection underexplored. The authors aim to understand reflection through the lens of latent directions in model activations.

Method: Proposes activation steering methodology to characterize reflection at different levels: no reflection, intrinsic reflection, and triggered reflection. Constructs steering vectors between these reflection levels to systematically identify reflection-inducing instructions and directly manipulate reflective behavior through activation interventions.

Result: Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT show clear stratification across reflection levels. Steering interventions confirm controllability of reflection, with findings that suppressing reflection is considerably easier than stimulating it.

Conclusion: The work opens a path toward mechanistic understanding of reflective reasoning in LLMs, highlighting both opportunities (reflection-enhancing defenses) and risks (adversarial inhibition of reflection in jailbreak attacks). The findings demonstrate the potential for controlling reflection through activation interventions.

Abstract: Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.

[429] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

Shikun Liu, Deyu Zou, Nima Shoghi, Victor Fung, Kai Liu, Pan Li

Main category: cs.LG

TL;DR: The paper introduces ROFT-MOL, a robust fine-tuning method for molecular graph foundation models that combines weight interpolation and ensemble techniques to improve performance on both regression and classification tasks while addressing data scarcity challenges.

Details

Motivation: Molecular graph foundation models face unique challenges: smaller pre-training datasets, severe data scarcity for downstream tasks, and need to handle diverse objectives (regression and classification). Existing fine-tuning methods need improvement to address model overfitting and sparse labeling issues specific to molecular applications.

Method: The authors first classify eight fine-tuning methods into three mechanisms (weight-based, representation-based, and partial fine-tuning) and benchmark them. Based on insights from this evaluation, they design ROFT-MOL, which combines simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods.

Result: ROFT-MOL delivers improved performance across both regression and classification tasks while maintaining the ease of use inherent in post-hoc weight interpolation. The extensive benchmarking provides valuable insights into fine-tuning techniques for molecular graph foundation models.

Conclusion: The proposed ROFT-MOL method effectively addresses the unique challenges of fine-tuning molecular graph foundation models by combining the strengths of different fine-tuning approaches, offering a robust solution that works well for diverse molecular tasks under data-scarce conditions.

Abstract: In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Molecular graph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severe data scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including both regression and classification tasks. To better understand and improve fine-tuning techniques under these conditions, we classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning. We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings. This extensive evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL. This approach combines the strengths of simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types while maintaining the ease of use inherent in post-hoc weight interpolation.

[430] An efficient probabilistic hardware architecture for diffusion-like models

Andraž Jelinčič, Owen Lockwood, Akhil Garlapati, Peter Schillinger, Isaac Chuang, Guillaume Verdon, Trevor McCourt

Main category: cs.LG

TL;DR: Researchers propose an all-transistor probabilistic computer that implements denoising models at hardware level, achieving GPU performance parity with ~10,000x less energy on image benchmarks.

Details

Motivation: Existing stochastic computing proposals have failed due to limited modeling techniques and unscalable exotic hardware. There's a need for efficient probabilistic AI hardware that overcomes these limitations.

Method: Developed an all-transistor probabilistic computer architecture that implements powerful denoising models directly at the hardware level, avoiding exotic components.

Result: System-level analysis shows devices based on this architecture could achieve performance parity with GPUs on simple image benchmarks while using approximately 10,000 times less energy.

Conclusion: The proposed all-transistor probabilistic computer addresses scalability and modeling limitations of previous stochastic computing approaches, offering dramatic energy efficiency improvements for probabilistic AI tasks.

Abstract: The proliferation of probabilistic AI has prompted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.

[431] If generative AI is the answer, what is the question?

Ambuj Tewari

Main category: cs.LG

TL;DR: A comprehensive survey paper exploring generative AI foundations, model families, probabilistic frameworks, and social responsibility aspects, adopting a task-first perspective on generation as a distinct ML problem.

Details

Motivation: To establish a foundational understanding of generation as a distinct machine learning task beyond just model implementations, connecting it to prediction, compression, and decision-making while addressing the core question: if generative AI is the answer, what is the question?

Method: The paper adopts a task-first framing, surveys five major generative model families (autoregressive models, VAEs, normalizing flows, GANs, diffusion models), introduces a probabilistic framework distinguishing density estimation from generation, presents a game-theoretic two-player adversary-learner setup, and discusses post-training modifications.

Result: Provides a comprehensive framework for understanding generative AI that goes beyond specific model architectures, establishing connections between generation and fundamental ML concepts while creating systematic approaches to study generation as a distinct task.

Conclusion: The paper establishes generation as a distinct machine learning task with its own theoretical foundations, emphasizes the importance of task-first understanding over model-centric approaches, and highlights critical social responsibility considerations including privacy, AI detection, and copyright/IP issues in generative systems.

Abstract: Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.

[432] Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks

Dian Jin, Yancheng Yuan, Xiaoming Tao

Main category: cs.LG

TL;DR: MMEA is a novel equivariant fine-tuning method that uses lightweight scalar gating to modulate feature magnitudes per-order and per-multiplicity, achieving SOTA performance while training fewer parameters than competing approaches.

Details

Motivation: Existing equivariant PEFT methods like ELoRA still have high degrees of freedom that can perturb pretrained feature distributions and degrade performance. There's a need for more efficient equivariant fine-tuning that preserves symmetry while being parameter-efficient.

Method: MMEA employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis, preserving strict equivariance while being highly parameter-efficient.

Result: MMEA consistently improves energy and force predictions to state-of-the-art levels across multiple benchmarks while training fewer parameters than competing approaches.

Conclusion: Modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

Abstract: Pretrained equivariant graph neural networks based on spherical harmonics offer efficient and accurate alternatives to computationally expensive ab-initio methods, yet adapting them to new tasks and chemical environments still requires fine-tuning. Conventional parameter-efficient fine-tuning (PEFT) techniques, such as Adapters and LoRA, typically break symmetry, making them incompatible with those equivariant architectures. ELoRA, recently proposed, is the first equivariant PEFT method. It achieves improved parameter efficiency and performance on many benchmarks. However, the relatively high degrees of freedom it retains within each tensor order can still perturb pretrained feature distributions and ultimately degrade performance. To address this, we present Magnitude-Modulated Equivariant Adapter (MMEA), a novel equivariant fine-tuning method which employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis. We demonstrate that MMEA preserves strict equivariance and, across multiple benchmarks, consistently improves energy and force predictions to state-of-the-art levels while training fewer parameters than competing approaches. These results suggest that, in many practical scenarios, modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

[433] Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration

Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, Manzil Zaheer

Main category: cs.LG

TL;DR: Local SGD with outer optimizer tuning: Outer learning rate >1 can trade off optimization error vs gradient noise and compensate for inner learning rate mis-tuning; momentum and acceleration improve convergence rates.

Details

Motivation: Large-scale distributed ML faces communication bottlenecks; Local SGD reduces communication but outer optimizer hyperparameter tuning is poorly understood compared to local optimization parameters.

Method: Theoretical analysis of Local SGD’s outer optimizer, proving convergence guarantees for outer learning rate tuning (including values >1), momentum, and acceleration mechanisms; plus novel data-dependent analysis and experiments with language models.

Result: Outer learning rate tuning enables trade-off between optimization error and gradient noise variance, compensates for inner learning rate mis-tuning; momentum and acceleration improve convergence rates; experiments validate theory.

Conclusion: Outer optimizer hyperparameters in Local SGD are crucial and should be tuned, with outer learning rate sometimes >1; momentum and acceleration enhance performance; data-dependent analysis provides further tuning insights.

Abstract: Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in such settings but methods like Local Stochastic Gradient Descent (Local SGD) show great promise in reducing this additional communication overhead. Local SGD consists of three parts: a local optimization process, an aggregation mechanism, and an outer optimizer that uses the aggregated updates from the nodes to produce a new model. While there exists an extensive literature on understanding the impact of hyperparameters in the local optimization process, the choice of outer optimizer and its hyperparameters is less clear. We study the role of the outer optimizer in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than $1$. We extend our results to settings where we use momentum in the outer optimizer, and we show a similar role for the momentum-adjusted outer learning rate. We also study acceleration in the outer optimizer and show that it improves the convergence rate as a function of the number of communication rounds, improving upon the convergence rate of prior algorithms that apply acceleration locally. Finally, we also introduce a novel data-dependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory.

Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Main category: cs.LG

TL;DR: RAD is a retrieval-augmented framework that explicitly injects external medical knowledge into multimodal diagnostic models, improving performance and interpretability by aligning with clinical guidelines.

Details

Motivation: Current AI medical approaches rely on implicitly encoded knowledge in model parameters, neglecting task-specific knowledge needed for diverse downstream diagnostic tasks. There's also a lack of quantitative evaluation for multimodal diagnostic model interpretability.

Method: Three key mechanisms: 1) Retrieval and refinement of disease-centered knowledge from multiple medical sources, 2) Guideline-enhanced contrastive loss to constrain distance between multimodal features and guideline knowledge, 3) Dual transformer decoder using guidelines as queries to steer cross-modal fusion, aligning with clinical workflows.

Result: Extensive evaluations across four datasets with different anatomies demonstrate state-of-the-art performance and generalizability. RAD enables models to focus more precisely on abnormal regions and critical indicators, ensuring evidence-based diagnosis.

Conclusion: RAD provides a novel framework for explicit knowledge injection in multimodal diagnostic models, improving both performance and interpretability while aligning with clinical workflows. The approach addresses limitations of current AI medical research and enables more trustworthy, evidence-based diagnosis.

Abstract: Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD’s generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.

[435] Bidirectional Representations Augmented Autoregressive Biological Sequence Generation

Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun

Main category: cs.LG

TL;DR: Hybrid AR-NAR framework for biological sequence generation that combines AR stability with NAR bidirectional context through cross-decoder attention.

Details

Motivation: AR models fail to capture bidirectional dependencies in biological sequences, while NAR models lack generative coherence. Need a solution that combines strengths of both approaches.

Method: Shared encoder with two decoders: NAR decoder learns bidirectional features, AR decoder generates sequences using cross-decoder attention to query NAR features. Training uses importance annealing and gradient blocking.

Result: Outperforms AR and NAR baselines on nine-species peptide sequencing benchmark, harmonizing AR stability with NAR contextual awareness.

Conclusion: Proposes novel architectural paradigm for enhancing AR models with bidirectional understanding, advancing biological sequence modeling techniques.

Abstract: Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.

[436] An Eulerian Perspective on Straight-Line Sampling

Panos Tsimpos, Youssef Marzouk

Main category: cs.LG

TL;DR: The paper studies dynamic measure transport for generative modeling, focusing on stochastic processes that bridge source and target distributions. It characterizes which processes produce straight-line flows (vanishing acceleration) that are easily integrable with first-order methods.

Details

Motivation: In generative modeling, flows induced by stochastic processes can transport between distributions, but many require complex integration. The authors aim to identify processes that produce straight-line flows, which are easier to integrate numerically with simple first-order methods.

Method: The authors provide a PDE characterization of straightness as a balance between conditional acceleration and divergence of a weighted covariance (Reynolds) tensor. They analyze affine-in-time interpolants and examine deterministic endpoint couplings to identify conditions for straight-line flows.

Result: They fully characterize affine-in-time interpolants and show that straightness occurs exactly under deterministic endpoint couplings. They derive necessary conditions that constrain flow geometry for general processes, providing guidance for designing easily integrable transports.

Conclusion: The paper offers a theoretical framework for identifying and designing stochastic processes that produce straight-line flows, making transport between distributions more computationally efficient through simpler integration methods.

Abstract: We study dynamic measure transport for generative modeling: specifically, flows induced by stochastic processes that bridge a specified source and target distribution. The conditional expectation of the process’ velocity defines an ODE whose flow map achieves the desired transport. We ask \emph{which processes produce straight-line flows} – i.e., flows whose pointwise acceleration vanishes and thus are exactly integrable with a first-order method? We provide a concise PDE characterization of straightness as a balance between conditional acceleration and the divergence of a weighted covariance (Reynolds) tensor. Using this lens, we fully characterize affine-in-time interpolants and show that straightness occurs exactly under deterministic endpoint couplings. We also derive necessary conditions that constrain flow geometry for general processes, offering broad guidance for designing transports that are easier to integrate.

[437] State-Space Models for Tabular Prior-Data Fitted Networks

Felix Koch, Marcel Wever, Fabian Raisch, Benjamin Tischler

Main category: cs.LG

TL;DR: Hydra (bidirectional SSM) replaces Transformer in TabPFN to reduce quadratic complexity while maintaining competitive performance despite SSM’s order sensitivity.

Details

Motivation: Transformers in tabular foundation models like TabPFN have quadratic complexity with sequence length, motivating exploration of more efficient sequence models like structured state space models (SSMs).

Method: Replace Transformer in TabPFN with Hydra, a bidirectional linear-time structured state space model (SSM), to address SSM’s inherent sensitivity to input token order through bidirectional context aggregation.

Result: The Hydra-based approach reduces order-dependence and achieves predictive performance competitive with the original TabPFN Transformer model.

Conclusion: Bidirectional SSMs like Hydra can serve as efficient alternatives to Transformers in tabular foundation models, maintaining performance while reducing computational complexity.

Abstract: Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM’s inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

[438] Enforcing hidden physics in physics-informed neural networks

Nanxi Chen, Sifan Wang, Rujin Ma, Airong Chen, Chuanjie Cui

Main category: cs.LG

TL;DR: PINNs struggle to enforce irreversible physical processes. This paper introduces an irreversibility-regularized strategy that enforces hidden physical laws as soft constraints, significantly improving accuracy across diverse scientific problems.

Details

Motivation: Current Physics-Informed Neural Networks (PINNs) often fail to fully capture the physical structure embedded in governing equations, particularly for irreversible processes. There's a need for more robust frameworks that can maintain physical consistency across diverse scientific problems.

Method: The authors introduce an irreversibility-regularized strategy that enforces hidden physical laws as soft constraints during training. This approach recovers missing physics associated with irreversible processes in conventional PINNs, ensuring solutions respect the intrinsic one-way nature of irreversible physical processes.

Result: The method shows substantial performance improvements over conventional PINNs across multiple benchmarks (traveling wave propagation, steady combustion, ice melting, corrosion evolution, and crack growth). The regularization scheme reduces predictive errors by more than an order of magnitude while requiring minimal modification to existing PINN frameworks.

Conclusion: The proposed irreversibility-regularized strategy effectively addresses the challenge of enforcing irreversible physical processes in PINNs, offering a simple, generalized, and robust approach that significantly improves accuracy while maintaining compatibility with existing frameworks.

Abstract: Physics-informed neural networks (PINNs) represent a new paradigm for solving partial differential equations (PDEs) by integrating physical laws into the learning process of neural networks. However, ensuring that such frameworks fully reflect the physical structure embedded in the governing equations remains an open challenge, particularly for maintaining robustness across diverse scientific problems. In this work, we address this issue by introducing a simple, generalized, yet robust irreversibility-regularized strategy that enforces hidden physical laws as soft constraints during training, thereby recovering the missing physics associated with irreversible processes in the conventional PINN. This approach ensures that the learned solutions consistently respect the intrinsic one-way nature of irreversible physical processes. Across a wide range of benchmarks spanning traveling wave propagation, steady combustion, ice melting, corrosion evolution, and crack growth, we observe substantial performance improvements over the conventional PINN, demonstrating that our regularization scheme reduces predictive errors by more than an order of magnitude, while requiring only minimal modification to existing PINN frameworks.

[439] Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan

Main category: cs.LG

TL;DR: RL fine-tuning of LLMs causes diversity collapse. The paper provides theoretical proof of this issue, proposes differential smoothing method that improves both correctness and diversity, outperforming existing heuristics.

Details

Motivation: RL fine-tuning of large language models leads to diversity collapse where outputs lack variety. Existing heuristics are ad hoc, trade correctness for diversity, have inconsistent effectiveness across tasks, and sometimes contradict each other.

Method: The paper first provides formal proof of why RL fine-tuning exhibits diversity collapse via selection and reinforcement bias. Then introduces differential smoothing - a principled method that only applies reward modifications on correct trajectories, provably improving both correctness and diversity.

Result: Differential smoothing outperforms vanilla RL and entropy-based heuristics, with consistent gains across models from 1B to 7B parameters. Improves both Pass@1 and Pass@k metrics, with up to 6.7% improvements on AIME24 dataset. Works well across domains including CountDown and real-world mathematical reasoning.

Conclusion: The paper provides rigorous theoretical foundation for understanding diversity collapse in RL fine-tuning, explains when existing heuristics help/fail, and introduces differential smoothing as a universally superior method that improves both correctness and diversity.

Abstract: It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method – differential smoothing – that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

[440] DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction

Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang

Main category: cs.LG

TL;DR: DDFI is a novel graph feature imputation method that combines feature propagation with masked autoencoder to handle incomplete node features, addressing issues with disconnected graphs, over-smoothing, and distribution shift in inductive tasks.

Details

Motivation: Real-world graphs often have incomplete node features (e.g., private user attributes), which degrades GNN performance. Existing feature propagation methods struggle with disconnected graphs, over-smoothing, and distribution shift in inductive tasks.

Method: DDFI combines feature propagation with graph-based masked autoencoder. It uses Co-Label Linking to connect same-label nodes in training set, and a two-step inference process: first FP imputation, then MAE reconstruction to reduce distribution shift and enhance feature diversity.

Result: Extensive experiments on six public datasets and a new real-world dataset (Sailing) show DDFI outperforms state-of-the-art methods under both transductive and inductive settings.

Conclusion: DDFI effectively addresses key limitations of existing feature imputation methods for graphs, providing robust performance across various graph connectivity patterns and task settings while handling real-world missing feature scenarios.

Abstract: Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.

[441] RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Runlong Zhou, Lefan Zhang, Shang-Chen Wu, Kelvin Zou, Hanzhi Zhou, Ke Ye, Yihao Feng, Dong Yin, Alex Guillen Garcia, Dmytro Babych, Rohit Chatterjee, Matthew Hopkins, Xiang Kong, Chang Lan, Lezhi Li, Yiping Ma, Daniele Molinari, Senyu Tong, Yanchao Sun, Thomas Voice, Jianyu Wang, Chong Wang, Simon Wang, Floris Weers, Yechen Xu, Guolin Yin, Muyang Yu, Yi Zhang, Zheng Zhou, Danyang Zhuo, Ruoming Pang, Cheng Leong

Main category: cs.LG

TL;DR: RLAX is a scalable RL framework on TPUs that improves LLM reasoning through efficient parameter-server architecture and novel dataset techniques, achieving significant accuracy gains in under 13 hours.

Details

Motivation: RL has become the standard approach for enhancing LLM reasoning capabilities, but existing frameworks lack scalability and robustness to preemptions on TPU infrastructure.

Method: Parameter-server architecture with master trainer pushing weights and inference workers pulling weights for rollouts; system techniques for scalable, preemptible RL; novel dataset curation and alignment methods.

Result: Improved QwQ-32B’s pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs while maintaining robustness to training preemptions.

Conclusion: RLAX demonstrates efficient, scalable RL training for LLMs on TPUs with significant performance improvements and practical robustness to infrastructure interruptions.

Abstract: Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B’s pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.

[442] LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

Yu Yu, Qian Xie, Nairen Cao, Li Jin

Main category: cs.LG

TL;DR: LLM-driven neural architecture search for composite state encoders in multi-source RL outperforms traditional NAS and LLM-based baselines with fewer evaluations.

Details

Motivation: Designing state encoders for RL with multiple information sources (sensor data, images, text) is challenging and requires manual design. Existing NAS methods overlook useful intermediate-output signals, limiting sample efficiency in multi-source RL.

Method: Proposes LLM-driven NAS pipeline where LLM serves as neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for composite state encoders with source-specific modules and fusion module.

Result: On mixed-autonomy traffic control task, approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and LLM-based GENIUS framework.

Conclusion: LLM-driven NAS effectively addresses composite state encoder design for multi-source RL by leveraging language priors and intermediate signals for sample-efficient architecture search.

Abstract: Designing state encoders for reinforcement learning (RL) with multiple information sources – such as sensor measurements, time-series signals, image observations, and textual instructions – remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules – such as their representation quality – limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.

[443] DS FedProxGrad: Asymptotic Stationarity Without Noise Floor in Fair Federated Learning

Huzaifa Arif

Main category: cs.LG

TL;DR: Improved asymptotic convergence analysis for FedProxGrad in group fair federated learning, showing convergence to exact stationarity without variance-induced noise floor.

Details

Motivation: Previous FedProxGrad analysis only showed convergence to a noise-dominated neighborhood with explicit dependence on variance-induced noise floor, limiting theoretical guarantees for group fair federated learning.

Method: Proposed DS FedProxGrad (Decay Step Size FedProxGrad) with Robbins-Monro step-size schedule and mild decay condition on local inexactness, extending the analytical framework with explicit fairness regularization.

Result: Proved liminf_{r→∞} 𝔼[‖∇F(x^r)‖²] = 0, meaning the algorithm is asymptotically stationary with convergence rate independent of variance-induced noise floor.

Conclusion: The improved analysis establishes stronger convergence guarantees for FedProxGrad-type methods in group fair federated learning, showing exact asymptotic stationarity rather than convergence to a noise-dominated neighborhood.

Abstract: Recent work \cite{arifgroup} introduced Federated Proximal Gradient \textbf{(\texttt{FedProxGrad})} for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textit{noise-dominated neighborhood of stationarity}, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \texttt{FedProxGrad}-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbf{DS \texttt{FedProxGrad}} (Decay Step Size \texttt{FedProxGrad}). Under a Robbins-Monro step-size schedule \cite{robbins1951stochastic} and a mild decay condition on local inexactness, we prove that $\liminf_{r\to\infty} \mathbb{E}[|\nabla F(\mathbf{x}^r)|^2] = 0$, i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.

cs.MA

[444] Norm-Governed Multi-Agent Decision-Making in Simulator-Coupled Environments:The Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP)

Stella C. Dong

Main category: cs.MA

TL;DR: R-CMASP: A multi-agent framework for reinsurance decision-making that combines simulator-coupled dynamics, role-specialized agents with typed communication, and normative constraints to improve stability and compliance over deterministic automation.

Details

Motivation: Reinsurance decision-making has distributed/asymmetric information, partial observability, heterogeneous responsibilities, simulator-driven dynamics, and regulatory constraints that deterministic workflow automation cannot handle due to lack of epistemic flexibility, coordination mechanisms, and norm-sensitive behavior.

Method: Proposed Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP) extends stochastic games and Dec-POMDPs with: (1) simulator-coupled transition dynamics using catastrophe/capital/portfolio engines; (2) role-specialized agents with structured observability, belief updates, and typed communication; (3) normative feasibility layer encoding solvency/regulatory/organizational rules as admissibility constraints.

Result: LLM-based agents with tool access and typed message protocols in domain-calibrated synthetic environment show governed multi-agent coordination yields more stable, coherent, and norm-adherent behavior than deterministic automation or monolithic LLM baselines—reducing pricing variance, improving capital efficiency, and increasing clause-interpretation accuracy.

Conclusion: Regulated, simulator-driven decision environments are most naturally modeled as norm-governed, simulator-coupled multi-agent systems, with normative constraints and structured communication enhancing equilibrium stability.

Abstract: Reinsurance decision-making exhibits the core structural properties that motivate multi-agent models: distributed and asymmetric information, partial observability, heterogeneous epistemic responsibilities, simulator-driven environment dynamics, and binding prudential and regulatory constraints. Deterministic workflow automation cannot meet these requirements, as it lacks the epistemic flexibility, cooperative coordination mechanisms, and norm-sensitive behaviour required for institutional risk-transfer. We propose the Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP), a formal model that extends stochastic games and Dec-POMDPs by adding three missing elements: (i) simulator-coupled transition dynamics grounded in catastrophe, capital, and portfolio engines; (ii) role-specialized agents with structured observability, belief updates, and typed communication; and (iii) a normative feasibility layer encoding solvency, regulatory, and organizational rules as admissibility constraints on joint actions. Using LLM-based agents with tool access and typed message protocols, we show in a domain-calibrated synthetic environment that governed multi-agent coordination yields more stable, coherent, and norm-adherent behaviour than deterministic automation or monolithic LLM baselines–reducing pricing variance, improving capital efficiency, and increasing clause-interpretation accuracy. Embedding prudential norms as admissibility constraints and structuring communication into typed acts measurably enhances equilibrium stability. Overall, the results suggest that regulated, simulator-driven decision environments are most naturally modelled as norm-governed, simulator-coupled multi-agent systems.

[445] Empirical Hardness in Multi-Agent Pathfinding: Research Challenges and Opportunities

Jingyao Ren, Eric Ewing, T. K. Satish Kumar, Sven Koenig, Nora Ayanian

Main category: cs.MA

TL;DR: This paper identifies three key research challenges in understanding the empirical hardness of multi-agent pathfinding (MAPF) instances, focusing on algorithm selection, instance features affecting hardness, and generating hard/diverse benchmark datasets.

Details

Motivation: There's a significant gap between theoretical NP-hardness of MAPF and the varying empirical hardness of individual instances. Understanding this empirical hardness phenomenon is crucial for developing better algorithms and benchmarks.

Method: The paper outlines a conceptual framework identifying three research challenges: 1) algorithm selection for given instances, 2) understanding instance features affecting hardness (structural properties like phase transition and backbone/backdoor), and 3) leveraging hardness knowledge to generate challenging or diverse benchmark datasets.

Result: The work establishes a foundational framework for future empirical hardness research in MAPF, identifying promising but underexplored research directions that could bridge the gap between theoretical complexity and practical instance hardness.

Conclusion: Understanding MAPF empirical hardness through these three challenges can lead to better algorithm selection, improved understanding of instance characteristics, and more effective benchmark generation, ultimately advancing the field beyond theoretical complexity analysis.

Abstract: Multi-agent pathfinding (MAPF) is the problem of finding collision-free paths for a team of agents on a map. Although MAPF is NP-hard, the hardness of solving individual instances varies significantly, revealing a gap between theoretical complexity and actual hardness. This paper outlines three key research challenges in MAPF empirical hardness to understand such phenomena. The first challenge, known as algorithm selection, is determining the best-performing algorithms for a given instance. The second challenge is understanding the key instance features that affect MAPF empirical hardness, such as structural properties like phase transition and backbone/backdoor. The third challenge is how to leverage our knowledge of MAPF empirical hardness to effectively generate hard MAPF instances or diverse benchmark datasets. This work establishes a foundation for future empirical hardness research and encourages deeper investigation into these promising and underexplored areas.

[446] Emergent Collective Memory in Decentralized Multi-Agent AI Systems

Khushiyant

Main category: cs.MA

TL;DR: Collective memory emerges in decentralized multi-agent systems through individual memory and environmental traces, with memory providing 68.7% performance improvement while traces require cognitive infrastructure to function.

Details

Motivation: To understand how collective memory emerges in decentralized multi-agent systems through the interplay between individual agent memory and environmental trace communication, without centralized control.

Method: Agents maintain internal memory states while depositing persistent environmental traces, creating spatially distributed collective memory. Comprehensive validation across various environmental conditions (grid sizes 20x20 to 50x50, 5-20 agents, 50 runs per configuration) and systematic density-sweep experiments (rho in [0.049, 0.300], up to 625 agents).

Result: Individual memory alone provides 68.7% performance improvement over no-memory baselines (1563.87 vs 927.23, p < 0.001), while environmental traces without memory fail completely. Stigmergic coordination dominates above rho ~ 0.20, with traces outperforming memory by 36-41% on composite metrics despite lower food efficiency. Experimental crossover confirms predicted critical density rho_c = 0.230 within 13% error.

Conclusion: Memory functions independently but environmental traces require cognitive infrastructure for interpretation. The study demonstrates a critical asymmetry between individual memory and environmental communication, validating theoretical phase transition predictions in collective memory emergence.

Abstract: We demonstrate how collective memory emerges in decentralized multi-agent systems through the interplay between individual agent memory and environmental trace communication. Our agents maintain internal memory states while depositing persistent environmental traces, creating a spatially distributed collective memory without centralized control. Comprehensive validation across five environmental conditions (20x20 to 50x50 grids, 5-20 agents, 50 runs per configuration) reveals a critical asymmetry: individual memory alone provides 68.7% performance improvement over no-memory baselines (1563.87 vs 927.23, p < 0.001), while environmental traces without memory fail completely. This demonstrates that memory functions independently but traces require cognitive infrastructure for interpretation. Systematic density-sweep experiments (rho in [0.049, 0.300], up to 625 agents) validate our theoretical phase transition prediction. On realistic large grids (30x30, 50x50), stigmergic coordination dominates above rho ~ 0.20, with traces outperforming memory by 36-41% on composite metrics despite lower food efficiency. The experimental crossover confirms the predicted critical density rho_c = 0.230 within 13% error.

[447] Thinking While Driving: A Concurrent Framework for Real-Time, LLM-Based Adaptive Routing

Xiaopei Tan, Muyang Fan

Main category: cs.MA

TL;DR: Thinking While Driving enables LLM-based route planning for agents while they’re moving, reducing intersection wait times to just 0.75 seconds average latency under high traffic using a non-blocking asynchronous architecture.

Details

Motivation: Traditional approaches require agents to stop and deliberate for route planning, causing delays at intersections. The authors aim to enable concurrent routing where LLM-based agents can plan routes while moving, reducing wait times and improving traffic flow.

Method: A concurrent routing framework integrating LLMs into a graph-based traffic environment. Uses Unity coroutines and a dedicated request manager for non-blocking asynchronous architecture. Environment is a weighted undirected graph with live congestion metrics continuously updated by agents for shared perception.

Result: Agents average just 0.75 seconds of decision latency under high traffic. LLM-driven agents can dynamically adapt to traffic, reroute around congestion, and exhibit behaviors beyond static pathfinding while maintaining real-time performance.

Conclusion: The framework successfully enables LLM-based agents to plan routes while moving, significantly reducing intersection wait times. It provides a reproducible framework for future research in adaptive routing and multi-agent cooperation with real-time performance.

Abstract: We present Thinking While Driving, a concurrent routing framework that integrates LLMs into a graph-based traffic environment. Unlike approaches that require agents to stop and deliberate, our system enables LLM-based route planning while agents are moving, significantly reducing intersection wait times. Under high traffic, agents average just 0.75 seconds of decision latency. To coordinate many agents in real-time, we implement a non-blocking asynchronous architecture using Unity coroutines and a dedicated request manager. The environment is a weighted undirected graph with live congestion metrics, updated continuously by the agents to enable shared perception. Our results show LLM-driven agents can dynamically adapt to traffic, reroute around congestion, and exhibit behaviors beyond static pathfinding, all while maintaining real-time performance. This work provides a reproducible framework for future research in adaptive routing and multi-agent cooperation.

cs.MM

Xiangyu Zhao, Yaling Shen, Yiwen Jiang, Zimu Wang, Jiahe Liu, Maxmartwell H Cheng, Guilherme C Oliveira, Robert Desimone, Dominic Dwyer, Zongyuan Ge

Main category: cs.MM

TL;DR: A novel multi-modal LLM framework for depression detection that aligns audio-visual features at timestamp level, outperforming single-modality and previous multi-modal methods while requiring less training data.

Details

Motivation: Depression is a prevalent mental health disorder, and while multi-modal data (speech, video, transcripts) and LLMs show promise for AI-assisted assessment, conventional LLMs are text-centric and cannot process critical non-verbal cues. Existing multi-modal LLMs are not tailored for psychological applications.

Method: Proposes a multi-modal LLM framework that augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves temporal dynamics modeling while reducing training data and computational requirements.

Result: Experiments on the DAIC-WoZ dataset show the model outperforms both single-modality approaches and previous multi-modal methods.

Conclusion: The framework demonstrates effective depression detection and can be extended to incorporate additional physiological signals, enabling broader clinical applications beyond mental health.

Abstract: Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.

eess.AS

[449] Exploring Perceptual Audio Quality Measurement on Stereo Processing Using the Open Dataset of Audio Quality

Pablo M. Delgado, Sascha Dick, Christoph Thompson, Chih-Wei Wu, Phillip A. Williams

Main category: eess.AS

TL;DR: ODAQ dataset update evaluates stereo processing effects on audio quality metrics, showing timbre-focused metrics struggle with complex presentation contexts.

Details

Motivation: To provide a comprehensive framework for evaluating audio quality degradations, particularly focusing on stereo processing methods (MS/LR) and their impact on objective audio quality metrics.

Method: Updated ODAQ dataset with test signals and subjective ratings for stereo processing methods, used to evaluate state-of-the-art objective audio quality metrics under various conditions.

Result: Timbre-focused metrics perform well under simple conditions but suffer with complex presentation contexts, highlighting limitations of current objective metrics.

Conclusion: Future audio quality models need to better integrate both timbral and spatial dimensions by modeling the interplay between bottom-up psychoacoustic processes and top-down contextual factors.

Abstract: ODAQ (Open Dataset of Audio Quality) provides a comprehensive framework for exploring both monaural and binaural audio quality degradations across a range of distortion classes and signals, accompanied by subjective quality ratings. A recent update of ODAQ, focusing on the impact of stereo processing methods such as Mid/Side (MS) and Left/Right (LR), provides test signals and subjective ratings for the in-depth investigation of state-of-the-art objective audio quality metrics. Our evaluation results suggest that, while timbre-focused metrics often yield robust results under simpler conditions, their prediction performance tends to suffer under the conditions with a more complex presentation context. Our findings underscore the importance of modeling the interplay of bottom-up psychoacoustic processes and top-down contextual factors, guiding future research toward models that more effectively integrate both timbral and spatial dimensions of perceived audio quality.

[450] Lightweight Model Attribution and Detection of Synthetic Speech via Audio Residual Fingerprints

Matías Pizarro, Mike Laszkiewicz, Dorothea Kolossa, Asja Fischer

Main category: eess.AS

TL;DR: A lightweight, training-free method for detecting synthetic speech and attributing it to source models using standardized average residuals as model-agnostic fingerprints.

Details

Motivation: As speech generation technologies advance, risks of impersonation, misinformation, and spoofing increase, creating a need for effective synthetic speech detection and attribution methods for digital forensics and security applications.

Method: Compute standardized average residuals (difference between audio signal and its filtered version) to extract model-agnostic fingerprints capturing synthesis artifacts. Uses Mahalanobis distances for out-of-domain detection.

Result: AUROC scores above 99% across multiple synthesis systems and languages, strong reliability with partial model outputs, high performance under common audio distortions, and F1 score of 0.91 on unseen models for out-of-domain detection.

Conclusion: The method is efficient, generalizable, and suitable for digital forensics and security applications, offering a lightweight, training-free solution for synthetic speech detection and attribution with strong performance across various conditions.

Abstract: As speech generation technologies advance, so do risks of impersonation, misinformation, and spoofing. We present a lightweight, training-free approach for detecting synthetic speech and attributing it to its source model. Our method addresses three tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) real vs. synthetic speech classification. The core idea is simple: we compute standardized average residuals–the difference between an audio signal and its filtered version–to extract model-agnostic fingerprints that capture synthesis artifacts. Experiments across multiple synthesis systems and languages show AUROC scores above 99%, with strong reliability even when only a subset of model outputs is available. The method maintains high performance under common audio distortions, including echo and moderate background noise, while data augmentation can improve results in more challenging conditions. In addition, out-of-domain detection is performed using Mahalanobis distances to in-domain residual fingerprints, achieving an F1 score of 0.91 on unseen models, reinforcing the method’s efficiency, generalizability, and suitability for digital forensics and security applications.

[451] A Low-Complexity Speech Codec Using Parametric Dithering for ASR

Ellison Murray, Morriel Kasher, Predrag Spasojevic

Main category: eess.AS

TL;DR: Dithering improves ASR input compression performance, with proposed parametric dithering showing 25-33.5% CER improvements at 1-3 bit resolutions while maintaining low complexity.

Details

Motivation: Dithering is known to improve perceptual quality in lossy compression, but its application to ASR input compression needs analytical justification and optimization for speech recognition performance rather than just perceptual quality.

Method: The authors formalize optimal ASR performance under lossy input compression and propose a parametric dithering technique for a low-complexity speech compression pipeline. The method is adaptable to meet performance targets or stay within entropy constraints.

Result: The method performs well at 1-bit resolution with 25% relative CER improvement, and shows 32.4% and 33.5% improvements at 2- and 3-bit resolutions respectively. A second dither choice yields reduced data rate while maintaining performance.

Conclusion: Dithering is analytically and experimentally justified for ASR input compression, with the proposed parametric dithering technique providing significant performance improvements at low bit resolutions while maintaining low complexity and adaptability.

Abstract: Dithering is a technique commonly used to improve the perceptual quality of lossy data compression. In this work, we analytically and experimentally justify the use of dithering for ASR input compression. We formalize an understanding of optimal ASR performance under lossy input compression and leverage this to propose a parametric dithering technique for a low-complexity speech compression pipeline. The method performs well at 1-bit resolution, showing a 25% relative CER improvement, while also demonstrating improvements of 32.4% and 33.5% at 2- and 3-bit resolution, respectively, with our second dither choice yielding a reduced data rate. The proposed codec is adaptable to meet performance targets or stay within entropy constraints.

eess.IV

[452] Active Optics for Hyperspectral Imaging of Reflective Agricultural Leaf Sensors

Dexter Burns, Sanjeev Koppal

Main category: eess.IV

TL;DR: Autonomous system using LiDAR, liquid lens, and Fast Steering Mirror to detect and interrogate plant-mounted sensors for real-time plant health monitoring.

Details

Motivation: Current plant health monitoring relies on leaf-mounted sensors, but efficiently locating and sampling these sensors in complex agricultural environments is challenging.

Method: Integrated system with LiDAR to identify sensor reflective signatures, Fast Steering Mirror to redirect camera field of view, liquid lens for continuous focus adjustment, and hyperspectral imaging for spectral measurements.

Result: Validated in controlled indoor experiments with accurate detection and tracking of reflective plant sensors and successful acquisition of spectral data.

Conclusion: Establishes foundation for adaptive, low-cost, automated plant sensor interrogation, representing significant step toward scalable real-time plant health monitoring in precision agriculture.

Abstract: Monitoring plant health increasingly relies on leaf-mounted sensors that provide real-time physiological data, yet efficiently locating and sampling these sensors in complex agricultural environments remains a major challenge. We present an integrated, adaptive, and scalable system that autonomously detects and interrogates plant sensors using a coordinated suite of low-cost optical components including a LiDAR, liquid lens, monochrome camera, filter wheel, and Fast Steering Mirror (FSM). The system first uses LiDAR to identify the distinct reflective signatures of sensors within the field, then dynamically redirects the camera s field of view via the FSM to target each sensor for hyperspectral imaging. The liquid lens continuously adjusts focus to maintain image sharpness across varying depths, enabling precise spectral measurements. We validated the system in controlled indoor experiments, demonstrating accurate detection and tracking of reflective plant sensors and successful acquisition of their spectral data. To our knowledge, no other system currently integrates these sensing and optical modalities for agricultural monitoring. This work establishes a foundation for adaptive, low-cost, and automated plant sensor interrogation, representing a significant step toward scalable, real-time plant health monitoring in precision agriculture.

[453] Hyperspectral Image Data Reduction for Endmember Extraction

Tomohiko Mizutani

Main category: eess.IV

TL;DR: Proposes a data reduction technique for self-dictionary endmember extraction that removes non-endmember pixels to reduce computational cost while maintaining accuracy.

Details

Motivation: Self-dictionary methods for hyperspectral endmember extraction have high computational costs that limit their applicability to large-scale images, despite achieving good accuracy.

Method: Develops a data reduction technique based on linear mixing model with pure-pixel assumption to remove non-endmember pixels, then integrates this with a self-dictionary method using linear programming formulation.

Result: The proposed method substantially reduces computational time of original self-dictionary method without sacrificing endmember extraction accuracy.

Conclusion: Data reduction approach effectively addresses computational bottleneck of self-dictionary methods for hyperspectral endmember extraction while preserving accuracy.

Abstract: Endmember extraction from hyperspectral images aims to identify the spectral signatures of materials present in a scene. Recent studies have shown that self-dictionary methods can achieve high extraction accuracy; however, their high computational cost limits their applicability to large-scale hyperspectral images. Although several approaches have been proposed to mitigate this issue, it remains a major challenge. Motivated by this situation, this paper pursues a data reduction approach. Assuming that the hyperspectral image follows the linear mixing model with the pure-pixel assumption, we develop a data reduction technique that removes pixels that do not contain endmembers. We analyze the theoretical properties of this reduction step and show that it preserves pixels that lie close to the endmembers. Building on this result, we propose a data-reduced self-dictionary method that integrates the data reduction with a self-dictionary method based on a linear programming formulation. Numerical experiments demonstrate that the proposed method can substantially reduce the computational time of the original self-dictionary method without sacrificing endmember extraction accuracy.

[454] Fast and Robust LRSD-based SAR/ISAR Imaging and Decomposition

Hamid Reza Hashempour, Majid Moradikia, Hamed Bastami, Ahmed Abdelhadi, Mojtaba Soltanalian

Main category: eess.IV

TL;DR: A fast unified joint SAR imaging framework using robust low-rank-sparse decomposition that handles platform residual phase errors and reduces computational burden for real-time processing.

Details

Motivation: Existing LRSD-based SAR imaging methods fail to handle platform residual phase errors (PRPE) from airborne instability and ignore real-time processing requirements by focusing only on image quality rather than computational efficiency.

Method: Proposes a unified joint SAR imaging framework that decomposes sparse objects and low-rank background features using robust LRSD. The method avoids computing large matrix inverses for image formation and uses constrained quadratic programming to handle unimodular constraints from PRPE. Also extends to ISAR autofocusing and imaging.

Result: Experiments with synthetic and real data show the proposed method outperforms state-of-the-art methods in both imaging quality and computational cost.

Conclusion: The framework successfully addresses both PRPE handling and computational efficiency for real-time SAR/ISAR imaging applications through a unified LRSD approach.

Abstract: The earlier works in the context of low-rank-sparse-decomposition (LRSD)-driven stationary synthetic aperture radar (SAR) imaging have shown significant improvement in the reconstruction-decomposition process. Neither of the proposed frameworks, however, can achieve satisfactory performance when facing a platform residual phase error (PRPE) arising from the instability of airborne platforms. More importantly, in spite of the significance of real-time processing requirements in remote sensing applications, these prior works have only focused on enhancing the quality of the formed image, not reducing the computational burden. To address these two concerns, this article presents a fast and unified joint SAR imaging framework where the dominant sparse objects and low-rank features of the image background are decomposed and enhanced through a robust LRSD. In particular, our unified algorithm circumvents the tedious task of computing the inverse of large matrices for image formation and takes advantage of the recent advances in constrained quadratic programming to handle the unimodular constraint imposed due to the PRPE. Furthermore, we extend our approach to ISAR autofocusing and imaging. Specifically, due to the intrinsic sparsity of ISAR images, the LRSD framework is essentially tasked with the recovery of a sparse image. Several experiments based on synthetic and real data are presented to validate the superiority of the proposed method in terms of imaging quality and computational cost compared to the state-of-the-art methods.

[455] Artificial Intelligence in Image-based Cardiovascular Disease Analysis

Xin Wang, Mingcheng Hu, Connie W. Tsao, Hongtu Zhu

Main category: eess.IV

TL;DR: Comprehensive review of AI applications in image-based cardiovascular disease analysis, categorizing literature by anatomical structures and imaging modalities, with discussion of current challenges and future directions.

Details

Motivation: Recent AI advancements have significantly impacted cardiovascular disease analysis, particularly in image-based diagnostics, creating a need for systematic review and categorization of current applications and future potential.

Method: Systematic literature review categorizing studies based on anatomical structures (non-vessel structures like ventricles/atria and vessel structures like aorta/coronary arteries) and imaging modalities (CT, MRI), providing structured analysis of AI applications in CVD analysis.

Result: Comprehensive review offering insights into current state of AI applications in image-based CVD analysis, covering diverse imaging techniques integrated with AI, and providing broad perspective on the field.

Conclusion: Identifies challenges and limitations in current AI-based CVD analysis methods and suggests directions for future research to overcome these hurdles, highlighting both current state and future potential of AI in cardiovascular diagnostics.

Abstract: Recent advancements in Artificial Intelligence (AI) have significantly influenced the field of Cardiovascular Disease (CVD) analysis, particularly in image-based diagnostics. Our paper presents an extensive review of AI applications in image-based CVD analysis, offering insights into its current state and future potential. We systematically categorize the literature based on the primary anatomical structures related to CVD, dividing them into non-vessel structures (such as ventricles and atria) and vessel structures (including the aorta and coronary arteries). This categorization provides a structured approach to explore various imaging modalities like Computed tomography (CT) and Magnetic Resonance Imaging (MRI), which are commonly used in CVD research. Our review encompasses these modalities, giving a broad perspective on the diverse imaging techniques integrated with AI for CVD analysis. We conclude with an examination of the challenges and limitations inherent in current AI-based CVD analysis methods and suggest directions for future research to overcome these hurdles.

[456] Equivariant Test-Time Training with Operator Sketching for Imaging Inverse Problems

Guixian Xu, Jinglai Li, Junqi Tang

Main category: eess.IV

TL;DR: Proposes sketched EI regularization using randomized sketching for computational acceleration in unsupervised deep imaging networks, with applications to test-time adaptation and parameter-efficient optimization.

Details

Motivation: EI regularization has computational redundancy and inefficiency in high-dimensional applications, needing acceleration for practical use in test-time network adaptation.

Method: Develops sketched EI regularization using randomized sketching techniques, creates accelerated deep internal learning framework, and proposes parameter-efficient approach optimizing only normalization layers.

Result: Achieves significant computational acceleration over standard EI, especially in test-time training tasks for X-ray CT and multicoil MRI reconstruction.

Conclusion: Sketched EI regularization provides efficient acceleration for unsupervised deep imaging networks, enabling practical test-time adaptation with reduced computational overhead.

Abstract: Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We apply our sketched EI regularization to develop an accelerated deep internal learning framework, which can be efficiently applied for test-time network adaptation. Additionally, for network adaptation tasks, we propose a parameter-efficient approach to accelerate both EI and Sketched-EI via optimizing only the normalization layers. Our numerical study on X-ray CT and multicoil magnetic resonance image reconstruction tasks demonstrate that our approach can achieve significant computational acceleration over the standard EI counterpart, especially in test-time training tasks.

Today’s Research Highlights

Table of Contents

cs.CL

[1] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models

[2] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models

[3] Workflow is All You Need: Escaping the “Statistical Smoothing Trap” via High-Entropy Information Foraging and Adversarial Pacing

[4] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset

[5] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

[6] Unsupervised Acquisition of Discrete Grammatical Categories

[7] AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

[8] Multilingual VLM Training: Adapting an English-Trained VLM to French

[9] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

[10] Sliding Window Attention Adaptation

[11] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

[12] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

[13] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring “Tortured Phrases” in Scientific Literature

[14] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT

[15] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis

[16] Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs

[17] XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

[18] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

[19] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems

[20] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

[21] From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

[22] Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

[23] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

[24] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage

[25] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

[26] Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

[27] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

[28] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

[29] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

[30] Quantifying Emotional Tone in Tolkien’s The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python

[31] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

[32] Leveraging language models for summarizing mental state examinations: A comprehensive evaluation and dataset release

[33] The Spatial Semantics of Iconic Gesture

[34] Anthropocentric bias in language model evaluation

[35] Vision-centric Token Compression in Large Language Model

[36] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

[37] Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

[38] V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

[39] Better Language Model Inversion by Compactly Representing Next-Token Distributions

[40] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

[41] Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

[42] Towards Personalized Deep Research: Benchmarks and Evaluations

[43] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

[44] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

[45] SCALE: Upscaled Continual Learning of Large Language Models

[46] Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

[47] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

[48] LMSpell: Neural Spell Checking for Low-Resource Languages

[49] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

[50] Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments

[51] Luxical: High-Speed Lexical-Dense Text Embeddings

[52] LLMs in Interpreting Legal Documents

cs.CV

[53] Neuromorphic Eye Tracking for Low-Latency Pupil Detection

[54] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering

[55] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

[56] Diffusion Is Your Friend in Show, Suggest and Tell

[57] Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video

[58] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

[59] Independent Density Estimation

[60] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing

[61] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

[62] Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition

[63] Topological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization

[64] Panoramic Out-of-Distribution Segmentation

[65] Feature Coding for Scalable Machine Vision

[66] Latent Chain-of-Thought World Modeling for End-to-End Driving

[67] Emerging Standards for Machine-to-Machine Video Coding

[68] Multi-dimensional Preference Alignment by Conditioning Reward Itself

[69] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

[70] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

[71] THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose

[72] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

[73] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models

[74] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

[75] Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

[76] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing