Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 58]
cs.CV [Total: 174]
cs.AI [Total: 69]
cs.SD [Total: 6]
cs.LG [Total: 121]
cs.MA [Total: 2]
cs.MM [Total: 0]
eess.AS [Total: 7]
eess.IV [Total: 5]

cs.CL

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani

Main category: cs.CL

TL;DR: Analysis of Chain-of-Thought reasoning attribution patterns in multilingual LLMs reveals limitations in faithfulness and interpretability, with attribution scores overemphasizing final reasoning steps and structured CoT showing limited multilingual robustness.

Details

Motivation: To assess the faithfulness and interpretability of Chain-of-Thought reasoning in multilingual LLMs, addressing concerns about the reliability of generated reasoning chains across different languages.

Method: Applied ContextCite for step-level attribution and Inseq for token-level attribution to Qwen2.5 1.5B-Instruct model using MGSM benchmark, with controlled perturbations via negation and distractor sentences.

Result: Attribution scores excessively emphasize final reasoning steps (especially in incorrect generations), structured CoT improves accuracy mainly for high-resource Latin-script languages, and perturbations reduce both accuracy and attribution coherence.

Conclusion: Chain-of-Thought prompting has significant limitations in multilingual robustness and interpretive transparency, highlighting reliability concerns for cross-lingual applications.

Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods–ContextCite for step-level attribution and Inseq for token-level attribution–to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

Seungbeen Lee, Jinhong Jeong, Donghyun Kim, Yejin Son, Youngjae Yu

Main category: cs.CL

TL;DR: Motion2Mind is a framework for evaluating AI’s Theory of Mind capabilities in interpreting nonverbal cues, revealing current systems struggle significantly with detection and tend to over-interpret compared to humans.

Details

Motivation: Existing ToM benchmarks focus mainly on false-belief tasks and overlook nonverbal communication and mental states beyond belief, creating a gap in evaluating AI's social understanding capabilities.

Method: Created Motion2Mind using expert-curated body-language references as knowledge base, building a video dataset with 222 nonverbal cue types and 397 mind states, with fine-grained annotations and verified psychological interpretations.

Result: Current AI systems show substantial performance gap in nonverbal cue detection and exhibit patterns of over-interpretation in explanations compared to human annotators.

Conclusion: The framework highlights significant limitations in AI’s ability to interpret nonverbal communication, indicating the need for improved ToM capabilities in machines for better social understanding.

Abstract: Our ability to interpret others’ mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng, Narayanan Sadagopan, Zhou Yu

Main category: cs.CL

TL;DR: TOD-ProcBench is a challenging benchmark for evaluating LLMs’ ability to follow complex process instructions in task-oriented dialogues, featuring intricate constraints and three evaluation tasks.

Details

Motivation: Existing TOD benchmarks oversimplify complex real-world instructions by reducing them to simple schemas, creating a gap in systematically evaluating LLMs' instruction-following capabilities for complex multi-turn conversations.

Method: Created benchmark using ABCD dataset with human quality control, formulated constraints as multi-level condition-action statements, and designed three tasks: relevant statement retrieval, instruction-violation detection, and conditional response generation.

Result: The benchmark comprehensively evaluates LLMs’ complex instruction-following capabilities across multiple dimensions including multilingual settings and different instruction formats.

Conclusion: TOD-ProcBench addresses the gap in evaluating LLMs’ ability to handle complex real-world instructions in task-oriented dialogues and is released under Llama 3.3 Community License Agreement.

Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs’ instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs’ abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs’ complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs’ abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.

[4] Liars’ Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks

Main category: cs.CL

TL;DR: LIARS’ BENCH is a comprehensive testbed with 72,863 examples of lies and honest responses from four models across seven datasets, revealing systematic failures in existing lie detection techniques.

Details

Motivation: Prior lie detection techniques for LLMs were validated in narrow settings that don't capture the diverse lies models can generate, creating a need for more comprehensive evaluation.

Method: Created LIARS’ BENCH testbed with lies varying along two dimensions: reason for lying and belief target, then evaluated three black- and white-box detection techniques on this benchmark.

Result: Existing techniques systematically fail to identify certain lie types, especially when the lie can’t be determined from the transcript alone.

Conclusion: LIARS’ BENCH reveals limitations in current detection methods and provides a practical testbed to guide progress in lie detection for LLMs.

Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS’ BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model’s reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS’ BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it’s not possible to determine whether the model lied from the transcript alone. Overall, LIARS’ BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

[5] Learning Tractable Distributions Of Language Model Continuations

Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck, Benjie Wang

Main category: cs.CL

TL;DR: LTLA is a hybrid approach that combines a base language model with a tractable surrogate model to enable controlled text generation with sequence-level constraints, addressing efficiency issues in neural context integration.

Details

Motivation: Prior methods for controlled language generation use tractable surrogates like HMMs to approximate continuation distributions, but these are often weakly context-aware, reducing query quality. The goal is to improve constraint satisfaction while maintaining fluency.

Method: LTLA pairs a base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. It uses a single batched HMM update for all next-token candidates and conditions only the surrogate’s latent state prior on LM’s hidden representations while keeping the surrogate decoder fixed.

Result: LTLA achieves higher conditional likelihood than unconditional HMMs, approximates continuation distributions for vision-language models where standalone HMMs fail, and improves constraint satisfaction with comparable fluency on controlled-generation tasks, with minimal inference overhead.

Conclusion: LTLA effectively combines neural context with tractable surrogates for controlled language generation, overcoming efficiency pitfalls while maintaining constraint satisfaction and fluency.

Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model’s next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate’s latent state prior on the LM’s hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

[6] Early science acceleration experiments with GPT-5

Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, Nikita Zhivotovskiy

Main category: cs.CL

TL;DR: GPT-5 demonstrated practical utility in scientific research across multiple disciplines, producing new mathematical results and accelerating research workflows while highlighting areas where human expertise remains essential.

Details

Motivation: To showcase the concrete capabilities of frontier AI (GPT-5) in scientific research and demonstrate how AI can accelerate discovery across mathematics, physics, astronomy, computer science, biology, and materials science.

Method: Collection of short case studies documenting human-AI interactions where GPT-5 produced new research steps, with careful verification by human authors and analysis of where AI succeeded/failed and where human input was crucial.

Result: GPT-5 generated four new verified mathematical results that helped settle previously unsolved problems, plus demonstrated acceleration of research workflows across multiple scientific domains.

Conclusion: Frontier AI like GPT-5 can meaningfully contribute to scientific discovery, producing modest but profound results that underscore the potential of human-AI collaboration as AI capabilities rapidly advance.

Abstract: AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li, Chen Zhang, Yik Chung Wu, Ngai Wong, Yijie Chen, Hong Dai, Xiansen Chen, Mian Zhang

Main category: cs.CL

TL;DR: ELPO is an ensemble learning-based prompt optimization framework that uses voting mechanisms and multiple search methods to improve LLM performance, outperforming existing methods by up to 7.6 F1 score on complex tasks.

Details

Motivation: Manual prompt engineering is laborious and existing automatic prompt optimization methods are limited by using single models/algorithms, which struggle with complex tasks.

Method: Proposes ELPO framework using ensemble learning with voting mechanism, shared generation strategies, and different search methods for prompt optimization.

Result: ELPO outperforms state-of-the-art methods across different tasks, achieving 7.6 F1 score improvement on ArSarcasm dataset.

Conclusion: ELPO provides more accurate and robust prompt optimization through ensemble learning approach, effectively addressing limitations of single-algorithm methods.

Abstract: The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang, Haojun Fei

Main category: cs.CL

TL;DR: Introduces Token-Selective PEFT (TS-PEFT), a new paradigm that selectively applies PEFT modifications to only a subset of position indices, challenging the traditional approach of applying modifications to all indices.

Details

Motivation: To question the necessity of applying PEFT modifications to all position indices in large models, as traditional PEFT approaches indiscriminately modify all positions which may be inefficient or counterproductive.

Method: Proposes TS-PEFT framework where a selection function S selectively applies PEFT modifications to only a subset of position indices rather than all indices.

Result: Experimental results show that indiscriminate application of PEFT to all indices is not only superfluous but may be counterproductive, while selective application can enhance performance on downstream tasks.

Conclusion: Provides a fresh perspective on PEFT, advocating for more targeted modifications and offering a framework for optimizing fine-tuning processes in large models through selective parameter updates.

Abstract: In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

Sebastian Haan

Main category: cs.CL

TL;DR: SemanticCite is an AI-powered system that verifies citation accuracy through full-text analysis, providing contextual explanations and classifying claim-source relationships into four categories (Supported, Partially Supported, Unsupported, Uncertain).

Details

Motivation: Address challenges in academic literature including semantic citation errors, AI-generated hallucinated references, and traditional citation formats that lack specificity about which sections support claims.

Method: Combines multiple retrieval methods with a four-class classification system, using fine-tuned lightweight language models for full-text source analysis with detailed reasoning and relevant text snippets.

Result: Fine-tuned lightweight models achieve performance comparable to large commercial systems with significantly lower computational requirements. Provides a comprehensive dataset of over 1,000 citations with detailed alignments, classifications, and annotations across eight disciplines.

Conclusion: SemanticCite enables scalable citation verification, supports peer review and AI-generated content quality control, and provides an open-source foundation for maintaining citation accuracy at scale.

Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu

Main category: cs.CL

TL;DR: Semantic Structural Entropy (SeSE) is a new uncertainty quantification framework that uses semantic structural information to detect hallucinations in LLMs, outperforming existing methods.

Details

Motivation: Current UQ methods overlook latent semantic structural information, which could provide more precise uncertainty estimates for detecting hallucinations in safety-critical LLM applications.

Method: Developed adaptively sparsified directed semantic graphs to capture semantic dependencies, then defined SeSE as structural entropy of optimal semantic encoding trees to quantify uncertainty through hierarchical abstraction.

Result: Extensive experiments across 29 model-dataset combinations show SeSE significantly outperforms advanced UQ baselines, including supervised methods and KLE.

Conclusion: SeSE provides a principled UQ framework that effectively leverages semantic structural information for reliable hallucination detection in LLMs.

Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation – where existing methods often rely on heuristic sample-and-count techniques – we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia, Zhi-Hong Deng

Main category: cs.CL

TL;DR: SDA is a training-free alignment framework that dynamically adjusts LLM output probabilities using user instructions to improve alignment with human intent without fine-tuning.

Details

Motivation: As LLMs are deployed in real-world applications, ensuring their responses align with human intent across diverse tasks and preferences remains challenging, especially during inference without costly retraining.

Method: Proposes SDA (Steering-Driven Distribution Alignment) - a model-agnostic framework that redistributes model output probabilities based on user-defined alignment instructions, working independently during inference or with training-based methods.

Result: SDA consistently improved alignment across 8 open-source LLMs, achieving average gains of 64.4% in helpfulness, 30% in honesty, and 11.5% in harmlessness across three key alignment dimensions.

Conclusion: SDA is an effective, lightweight, and generalizable solution for improving LLM alignment without fine-tuning, supporting personalized preference alignment across diverse models and scenarios.

Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You, Jie Tang, Qingsong Liu, Yuhang Guo, Yangyang Kang

Main category: cs.CL

TL;DR: Self-rewriting framework improves reasoning quality by having models rewrite their own reasoning texts, addressing issues like over-thinking and disordered thinking while maintaining RL scalability.

Details

Motivation: Traditional RL with outcome correctness rewards provides limited supervision over internal reasoning processes, leading to suboptimal reasoning quality with issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking.

Method: Proposes self-rewriting framework where models rewrite their own reasoning texts and learn from rewritten reasoning. Uses selective rewriting on ‘simple’ samples (consistently correct) to preserve GRPO reward signals, with rewriting and vanilla generation compiled in single batches.

Result: Achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) without explicit length reduction instructions. Significantly higher internal reasoning quality scores (+7.2) under LLM-as-a-judge metric, successfully mitigating reasoning flaws.

Conclusion: Self-rewriting framework effectively improves internal reasoning quality while maintaining RL scalability, addressing fundamental limitations of outcome-only rewards in reasoning tasks.

Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only “simple” samples, defined by the model’s consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.

[13] NLP Datasets for Idiom and Figurative Language Tasks

Blake Matheny, Phuong Minh Nguyen, Minh Le Nguyen, Stephanie Reynolds

Main category: cs.CL

TL;DR: This paper addresses the challenge of idiomatic and figurative language understanding in LLMs by creating new datasets for training and evaluation, focusing on idiom recognition tasks.

Details

Motivation: Idioms and figurative language remain difficult for LLMs despite large corpora, creating a need for specialized datasets to improve model performance on informal language.

Method: Created large-scale datasets by compiling idiom lists from existing sources, retrieving context sequences from a large corpus, and producing human-annotated datasets for evaluation. Used these datasets for slot labeling and sequence tagging tasks.

Result: Developed three datasets: one large-scale dataset of potential idiomatic expressions and two human-annotated datasets of definite idiomatic expressions for evaluating pre-trained language models.

Conclusion: The presented datasets provide a foundation for building better models and developing new approaches to handle figurative language, helping narrow the performance gap in idiom understanding for LLMs.

Abstract: Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Main category: cs.CL

TL;DR: Sufficiency metric for rationales has limitations; it captures non-rationalized context interference rather than rationale informativeness, and shows complex relationships with token classification and model performance.

Details

Motivation: To better understand the informativeness of rationales (human explanations) and address limitations of the sufficiency metric in assessing whether models learn for the right reasons.

Method: Relate sufficiency to token classification ability and model performance improvement through attention regularization, analyzing cross-domain classification with rationale incorporation.

Result: Highly informative rationales don’t necessarily help correct classification; sufficiency captures non-rationalized context interference; rationale incorporation boosts cross-domain performance inconsistently; sufficiency and token classification are unrelated.

Conclusion: Rationales are complex, and metrics that systematically capture rationale information need further investigation beyond current sufficiency measures.

Abstract: Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

[15] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He

Main category: cs.CL

TL;DR: MinerU-HTML is a novel HTML-to-text extraction pipeline using a 0.6B-parameter language model that significantly outperforms heuristic methods like Trafilatura, preserving structured elements and improving downstream model performance.

Details

Motivation: Current web data curation focuses on filtering and deduplication while treating HTML extraction as fixed preprocessing. Heuristic extractors struggle to preserve document structure and corrupt elements like formulas, codes, and tables.

Method: Reformulates content extraction as sequence labeling using a 0.6B-parameter language model. Uses semantic understanding with two-stage formatting pipeline that categorizes semantic elements before converting to Markdown.

Result: Achieves 81.8% ROUGE-N F1 vs Trafilatura’s 63.6% on MainWebBench (7,887 pages). Exceptional structured element preservation: 90.9% for code blocks, 94.0% for formulas. AICC corpus (7.3T tokens) outperforms TfCC by 1.08pp on 13 benchmarks.

Conclusion: HTML extraction quality significantly impacts model capabilities and is a critical, often underestimated component of web corpus construction. Model-based approaches are inherently scalable compared to heuristic methods.

Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8% ROUGE-N F1 compared to Trafilatura’s 63.6%, with exceptional structured element preservation (90.9% for code blocks, 94.0% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

[16] Classification of worldwide news articles by perceived quality, 2018-2024

Connor McElroy, Thiago E. A. de Oliveira, Chris Brogly

Main category: cs.CL

TL;DR: Machine learning and deep learning models can effectively distinguish between perceived low-quality and high-quality news articles, with deep learning models achieving better performance than traditional classifiers.

Details

Motivation: To explore whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles.

Method: Used 3 machine learning classifiers and 3 deep learning models on a dataset of 1,412,272 English news articles from Common Crawl (2018-2024). Articles were classified into low/high-quality based on expert consensus ratings on 579 source websites, using 194 linguistic features per website-level labeled article.

Result: Random Forest achieved 0.7355 accuracy and 0.8131 ROC AUC. ModernBERT-large (256 context) performed best with 0.8744 accuracy, 0.9593 ROC-AUC, and 0.8739 F1. Other deep learning models also showed strong performance with accuracies ranging from 0.8478 to 0.8685.

Conclusion: Both traditional CPU-based machine learning classifiers and deep learning classifiers can effectively differentiate the perceived quality of worldwide news articles, with deep learning models demonstrating superior performance.

Abstract: This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

Sherine George, Nithish Saji

Main category: cs.CL

TL;DR: ESGBench is a benchmark for evaluating explainable ESG question answering systems using corporate sustainability reports, with domain-specific questions and human-curated answers.

Details

Motivation: To assess and improve the performance of AI systems in ESG question answering, focusing on factual consistency, traceability, and domain alignment in sustainability reporting.

Method: Created a benchmark dataset with domain-grounded ESG questions paired with human-curated answers and supporting evidence for fine-grained evaluation.

Result: Analysis of state-of-the-art LLMs revealed key challenges in factual consistency, traceability, and domain alignment when processing ESG content.

Conclusion: ESGBench provides a framework to accelerate research in transparent and accountable ESG-focused AI systems by enabling comprehensive evaluation of model reasoning capabilities.

Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Andrew Gomes

Main category: cs.CL

TL;DR: The paper investigates how transformer models process idiomatic expressions using circuit discovery techniques, identifying specific attention patterns and mechanisms that enable efficient handling of non-compositional language.

Details

Motivation: To understand how transformer-based language models process idiomatic expressions and non-compositional language, which requires different computational patterns than literal language.

Method: Used modified path patching algorithm for circuit discovery and analysis, identifying “Idiom Heads” (attention heads that activate across idioms) and “augmented reception” (enhanced attention between idiom tokens).

Result: Found distinct computational patterns for idiom processing, identified specific attention mechanisms that balance computational efficiency and robustness in handling non-compositional language.

Conclusion: The findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding more complex grammatical constructions.

Abstract: We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term augmented reception.’’ We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

[19] Arctic-Extract Technical Report

Mateusz Chiliński, Julita Ołtusek, Wojciech Jaśkowski

Main category: cs.CL

TL;DR: Arctic-Extract is a lightweight (6.6 GiB) SoTA model for extracting structural data from business documents, deployable on resource-constrained hardware like A10 GPUs.

Details

Motivation: To create a high-performance document understanding model that can run on resource-constrained hardware while maintaining state-of-the-art extraction capabilities.

Method: Developed training protocols for Arctic-Extract, focusing on efficient model architecture that maintains performance while reducing resource requirements.

Result: The model achieves strong document understanding performance while being deployable on A10 GPUs (24GB memory), processing up to 125 A4 pages, making it suitable for long document processing.

Conclusion: Arctic-Extract successfully demonstrates that high-performance document extraction can be achieved on resource-constrained hardware, making advanced document understanding accessible for practical deployment scenarios.

Abstract: Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract’s training protocols and evaluation results, demonstrating its strong performance in document understanding.

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu, Yusuf Çelebi, Yağız Asker

Main category: cs.CL

TL;DR: TurkColBERT is the first benchmark comparing dense encoders and late-interaction models for Turkish IR, showing that smaller late-interaction models outperform larger dense encoders while being more parameter-efficient.

Details

Motivation: Neural IR systems are underexplored for morphologically rich, lower-resource languages like Turkish, and late-interaction models haven't been systematically evaluated despite dense bi-encoders dominating Turkish IR.

Method: Two-stage adaptation pipeline: fine-tune English/multilingual encoders on Turkish NLI/STS tasks, then convert to ColBERT-style retrievers using PyLate trained on MS MARCO-TR. Evaluated 10 models across five Turkish BEIR datasets.

Result: Late-interaction models 3-5× smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8% mAP on domain-specific tasks. MUVERA+Rerank is 3.33× faster than PLAID with +1.7% relative mAP gain.

Conclusion: Late-interaction models offer superior parameter efficiency and performance for Turkish IR, enabling low-latency retrieval with 0.54 ms query times. Limitations include reliance on moderately sized datasets and translated benchmarks.

Abstract: Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models – which retain token-level representations for fine-grained matching – have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71% of its average mAP. Late-interaction models that are 3–5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Éloïse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov

Main category: cs.CL

TL;DR: This paper presents a predictive framework that can identify text genre from LLM activations using shallow learning models, achieving up to 98% F1-score.

Details

Motivation: Understanding LLMs is crucial for safe deployment, but complicated by their interpretability challenges and inability to human-evaluate all outputs.

Method: Using Mistral-7B and two datasets, genre is predicted from LLM activations using scikit-learn classifiers.

Result: Genre can be extracted with F1-scores of up to 98% and 71%, consistently outperforming control tasks across both datasets.

Conclusion: This provides proof of concept that text genres can be inferred from LLMs using shallow learning models.

Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim

Main category: cs.CL

TL;DR: This paper challenges the use of Word Error Rate (WER) for evaluating ASR in clinical dialogue, showing poor correlation with clinical impact. It introduces an LLM-as-a-Judge framework optimized with GEPA that achieves human-comparable performance for assessing clinical safety.

Details

Motivation: Standard ASR evaluations rely heavily on WER, but this may not reflect the clinical impact of transcription errors in healthcare settings where safety is critical.

Method: Established gold-standard benchmark with expert clinicians labeling clinical impact of ASR errors, then introduced LLM-as-a-Judge framework optimized using GEPA to replicate expert assessment.

Result: WER and existing metrics correlate poorly with clinical impact labels. The optimized Gemini-2.5-Pro judge achieved 90% accuracy and Cohen’s κ of 0.816, comparable to human performance.

Conclusion: Provides a validated automated framework for moving ASR evaluation beyond textual fidelity to scalable assessment of clinical safety.

Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

Kexin Zhao, Ken Forbus

Main category: cs.CL

TL;DR: A method using LLMs as oracles for word sense disambiguation without hand-annotated training data, converting symbolic meanings to natural language alternatives for disambiguation.

Details

Motivation: Current WSD methods rely on coarse-grained representations and hand-annotated data, making it difficult to disambiguate richer representations needed for sophisticated inference.

Method: Convert multiple candidate meanings from symbolic NLU system into distinguishable natural language alternatives, query LLM to select appropriate interpretations based on context, and propagate selections back to symbolic system.

Result: Method evaluated against human-annotated gold answers and demonstrated effectiveness.

Conclusion: Proposed approach enables automatic disambiguation of richer representations without requiring hand-annotated training data.

Abstract: Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez

Main category: cs.CL

TL;DR: Direct multimodal embedding retrieval outperforms LLM-summary-based approaches in multimodal RAG systems, achieving significant improvements in retrieval metrics and answer quality by preserving visual context.

Details

Motivation: Existing multimodal RAG systems lose contextual information and visual details by converting images to text through LLM summarization during preprocessing, which is critical for financial document analysis.

Method: Comparative analysis of two retrieval approaches: text-based chunk retrieval (images summarized into text) vs direct multimodal embedding retrieval (images stored natively in vector space), evaluated across 6 LLM models and 2 multimodal embedding models on a financial earnings call benchmark.

Result: Direct multimodal embedding retrieval achieved 13% absolute improvement in mAP@5 and 11% in nDCG@5 (32% and 20% relative improvements), producing more accurate and factually consistent answers while preserving visual context.

Conclusion: Direct multimodal embeddings preserve visual context better than LLM summarization, which introduces information loss, making direct multimodal retrieval more effective for multimodal RAG systems in financial document analysis.

Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

Main category: cs.CL

TL;DR: Nemotron Elastic is a framework for building reasoning-oriented LLMs that embeds multiple nested submodels within a single parent model, enabling zero-shot extraction of different-sized models without additional training.

Details

Motivation: Training separate large language models for different scales and deployment objectives is prohibitively expensive, requiring separate training runs for each size.

Method: Uses hybrid Mamba-Attention architectures with end-to-end trained router, group-aware SSM elastification, heterogeneous MLP elastification, normalized MSE-based layer importance, and knowledge distillation for simultaneous multi-budget optimization.

Result: Applied to Nemotron Nano V2 12B model, produced 9B and 6B models using only 110B training tokens with 360x cost reduction vs training from scratch and 7x vs SoTA compression. Nested models perform on par or better than SoTA in accuracy.

Conclusion: The framework enables many-in-one reasoning models with constant deployment memory against the number of models in the family, unlike other compression methods.

Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba’s structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

[26] GPTopic: Dynamic and Interactive Topic Representations

Arik Reuter, Bishnu Khadka, Anton Thielmann, Christoph Weisser, Sebastian Fischer, Benjamin Säfken

Main category: cs.CL

TL;DR: GPTopic is a software package that uses Large Language Models to create interactive topic representations with a chat interface, making topic modeling more accessible than traditional top-word lists.

Details

Motivation: Traditional topic modeling with top-word lists requires expertise to interpret and fails to capture the full complexity and nuances of topics, limiting accessibility for non-experts.

Method: Leverages Large Language Models (LLMs) to create dynamic, interactive topic representations with an intuitive chat interface for exploring, analyzing, and refining topics.

Result: Developed GPTopic software package that provides more comprehensive and accessible topic modeling through interactive exploration.

Conclusion: GPTopic addresses the limitations of traditional topic modeling by making it more accessible and comprehensive through LLM-powered interactive representations.

Abstract: Topic modeling seems to be almost synonymous with generating lists of top words to represent topics within large text corpora. However, deducing a topic from such list of individual terms can require substantial expertise and experience, making topic modelling less accessible to people unfamiliar with the particularities and pitfalls of top-word interpretation. A topic representation limited to top-words might further fall short of offering a comprehensive and easily accessible characterization of the various aspects, facets and nuances a topic might have. To address these challenges, we introduce GPTopic, a software package that leverages Large Language Models (LLMs) to create dynamic, interactive topic representations. GPTopic provides an intuitive chat interface for users to explore, analyze, and refine topics interactively, making topic modeling more accessible and comprehensive. The corresponding code is available here: https://github.com/ArikReuter/TopicGPT.

[27] LLMs as Models for Analogical Reasoning

Sam Musker, Alex Duchnowski, Raphaël Millière, Ellie Pavlick

Main category: cs.CL

TL;DR: LLMs can match human performance in novel analogical reasoning tasks requiring flexible semantic re-representation, but show different response patterns to certain variations, suggesting they offer how-possibly rather than how-actually explanations of human analogy.

Details

Motivation: To investigate whether LLMs' emergent analogical reasoning capabilities are superficial or encompass the flexible representational and mapping capabilities central to human analogy, particularly in contexts not well captured by existing cognitive theories.

Method: Introduced novel analogical reasoning tasks requiring mapping between semantic words and abstract character sequences, testing both semantic structure and content reasoning with variations to assess robustness of analogical inferences. Compared performance of human participants and advanced LLMs.

Result: Advanced LLMs matched human performance across several conditions, but humans and LLMs responded differently to certain task variations and semantic distractors.

Conclusion: LLMs might offer a how-possibly explanation of human analogical reasoning in contexts not well modeled by existing theories, but current models are unlikely to provide how-actually explanations of human analogy.

Abstract: Analogical reasoning – the capacity to identify and map structural relationships between different domains – is fundamental to human cognition and learning. Recent studies have shown that large language models (LLMs) can sometimes match humans in analogical reasoning tasks, opening the possibility that analogical reasoning might emerge from domain-general processes. However, it is still debated whether these emergent capacities are largely superficial and limited to simple relations seen during training or whether they encompass the flexible representational and mapping capabilities which are the focus of leading cognitive models of analogy. In this study, we introduce novel analogical reasoning tasks that require participants to map between semantically contentful words and sequences of letters and other abstract characters. This task necessitates the ability to flexibly re-represent rich semantic information – an ability which is known to be central to human analogy but which is thus far not well captured by existing cognitive theories and models. We assess the performance of both human participants and LLMs on tasks focusing on reasoning from semantic structure and semantic content, introducing variations that test the robustness of their analogical inferences. Advanced LLMs match human performance across several conditions, though humans and LLMs respond differently to certain task variations and semantic distractors. Our results thus provide new evidence that LLMs might offer a how-possibly explanation of human analogical reasoning in contexts that are not yet well modeled by existing theories, but that even today’s best models are unlikely to yield how-actually explanations.

[28] CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering

Yike Wu, Yi Huang, Nan Hu, Yuncheng Hua, Guilin Qi, Jiaoyan Chen, Jeff Z. Pan

Main category: cs.CL

TL;DR: CoTKR is a novel Chain-of-Thought enhanced knowledge rewriting method that generates reasoning traces and knowledge interleaved to improve LLM performance in Knowledge Graph Question Answering by addressing limitations of single-step rewriting.

Details

Motivation: Existing methods for rewriting retrieved subgraphs into natural language for LLMs in KGQA often include irrelevant information, omit crucial details, or fail to align with question semantics, especially for complex questions.

Method: Proposes CoTKR (Chain-of-Thought Enhanced Knowledge Rewriting) that generates reasoning traces and corresponding knowledge in an interleaved manner, plus PAQAF (Preference Alignment from Question Answering Feedback) training strategy to optimize the knowledge rewriter using QA model feedback.

Result: Experimental results across various LLMs and KGQA benchmarks show CoTKR generates the most beneficial knowledge representation for QA models, significantly improving LLM performance in KGQA compared to previous knowledge rewriting methods.

Conclusion: The proposed CoTKR method with PAQAF training effectively addresses limitations of single-step knowledge rewriting and bridges the preference gap between knowledge rewriter and QA model, leading to substantial performance improvements in KGQA tasks.

Abstract: Recent studies have explored the use of Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for Knowledge Graph Question Answering (KGQA). They typically require rewriting retrieved subgraphs into natural language formats comprehensible to LLMs. However, when tackling complex questions, the knowledge rewritten by existing methods may include irrelevant information, omit crucial details, or fail to align with the question’s semantics. To address them, we propose a novel rewriting method CoTKR, Chain-of-Thought Enhanced Knowledge Rewriting, for generating reasoning traces and corresponding knowledge in an interleaved manner, thereby mitigating the limitations of single-step knowledge rewriting. Additionally, to bridge the preference gap between the knowledge rewriter and the question answering (QA) model, we propose a training strategy PAQAF, Preference Alignment from Question Answering Feedback, for leveraging feedback from the QA model to further optimize the knowledge rewriter. We conduct experiments using various LLMs across several KGQA benchmarks. Experimental results demonstrate that, compared with previous knowledge rewriting methods, CoTKR generates the most beneficial knowledge representation for QA models, which significantly improves the performance of LLMs in KGQA.

[29] Atomic Calibration of LLMs in Long-Form Generations

Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier

Main category: cs.CL

TL;DR: This paper studies atomic calibration for LLMs in long-form generation, decomposing responses into atomic claims to evaluate factuality calibration at a fine-grained level, revealing poorer calibration than macro approaches.

Details

Motivation: LLMs suffer from hallucinations in real-world applications, and existing confidence calibration methods focus on short-form tasks with single response-level scores, which are insufficient for long-form outputs containing both accurate and inaccurate claims.

Method: Systematically study atomic calibration by decomposing long responses into atomic claims, categorize confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration.

Result: LLMs exhibit poorer calibration at the atomic level during long-form generation, and atomic calibration uncovers patterns regarding confidence method alignment and confidence changes throughout generation.

Conclusion: Atomic calibration provides valuable insights for future research directions in confidence estimation for long-form generation, highlighting the need for fine-grained calibration approaches.

Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.

[30] Crowdsourcing Lexical Diversity

Hadi Khalilia, Jahna Otterbacher, Gabor Bella, Shandy Darma, Fausto Giunchiglia

Main category: cs.CL

TL;DR: Proposes a crowdsourcing methodology using the LingoGap platform to identify and reduce bias in lexical-semantic resources by detecting equivalent terms, language-specific terms, and lexical gaps across languages.

Details

Motivation: Lexical-semantic resources often suffer from bias towards English and Anglo-Saxon culture, including missing culture-specific concepts, presence of foreign concepts, and lack of explicit indication of untranslatability (lexical gaps).

Method: Crowdsourcing approach where workers compare lexemes from two languages using the LingoGap platform, focusing on domains rich in lexical diversity like kinship or food through microtasks.

Result: Applied to English-Arabic and Standard Indonesian-Banjarese case studies, identifying 2,140 lexical gaps in the first and 951 in the second, successfully validating the method.

Conclusion: The methodology and tool are usable and effective for future large-scale lexicon enrichment tasks to reduce bias in lexical-semantic resources.

Abstract: Lexical-semantic resources (LSRs), such as online lexicons and wordnets, are fundamental to natural language processing applications as well as to fields such as linguistic anthropology and language preservation. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual lexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing platform facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.

[31] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: OmniThink is a slow-thinking machine writing framework that improves article generation by simulating human iterative learning processes, addressing limitations of current retrieval-augmented generation methods.

Details

Motivation: Current retrieval-augmented generation approaches produce shallow, unoriginal, and repetitive content due to limited depth, novelty, and redundancy in retrieved information, negatively impacting article quality.

Method: OmniThink emulates human cognitive behavior through iterative expansion and reflection, simulating how learners slowly deepen their knowledge of topics.

Result: Experimental results show OmniThink improves knowledge density without compromising coherence and depth. Human evaluations confirm its effectiveness for long-form article generation.

Conclusion: OmniThink demonstrates potential to address real-world challenges in long-form article generation by incorporating slow-thinking cognitive processes into machine writing.

Abstract: Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model’s predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.

[32] Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks

Darpan Aswal, Manjira Sinha

Main category: cs.CL

TL;DR: Proposes graph-based models (GNNs and HGNNs) as lightweight alternatives to transformers for environmental claim detection, achieving superior performance with 30x fewer parameters.

Details

Motivation: Transformer models require significant computational power, posing challenges for resource-constrained applications. Need lightweight alternatives for environmental claim detection.

Method: Reframe task as graph classification using dependency parsing graphs with word2vec & POS tag embeddings for nodes, syntactic dependencies for edges. Use GNNs and HGNNs (including hyperbolic variants).

Result: HGNNs in poincaré space achieve superior performance to state-of-the-art while using 30x fewer parameters. HGNNs benefit from modeling hierarchical structures.

Conclusion: Graph-based models, especially hyperbolic GNNs, offer effective lightweight alternatives to transformers for environmental claim detection with significant parameter reduction.

Abstract: Transformer based models, especially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, especially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec & learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to \textbf{30x fewer parameters}. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.

[33] CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng

Main category: cs.CL

TL;DR: CaKE is a circuit-aware knowledge editing method that improves multi-hop reasoning by better integrating updated knowledge into LLMs’ reasoning pathways, achieving 20% better accuracy than existing methods.

Details

Motivation: Current knowledge editing methods fail to generalize updates to multi-hop reasoning tasks because they inadequately integrate knowledge into reasoning circuits.

Method: CaKE uses circuit-based analysis to guide the selection of a few curated data samples that stimulate the model to develop appropriate reasoning circuits for new knowledge.

Result: CaKE achieves 20% average improvement in multi-hop reasoning accuracy on MQuAKE dataset while requiring less memory than existing methods.

Conclusion: Circuit-aware knowledge editing enables more accurate and consistent use of edited knowledge across reasoning tasks by better integrating updates into reasoning pathways.

Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits – the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.

[34] One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks

Main category: cs.CL

TL;DR: VD-RAG systems are vulnerable to poisoning attacks where a single adversarial image can manipulate retrieval and generation to spread disinformation or cause denial-of-service.

Details

Motivation: Visual document RAG systems use document screenshots as knowledge bases but introduce new security vulnerabilities through image modality that adversaries can exploit.

Method: Defined two attack objectives: targeted disinformation attacks and universal denial-of-service attacks, implemented using multi-objective gradient-based optimization and generative model prompting under white-box and black-box settings.

Result: VD-RAG systems are vulnerable to both targeted and universal poisoning attacks in white-box settings, but show robustness to black-box attacks in universal settings.

Conclusion: Visual document RAG systems require enhanced security measures to protect against poisoning attacks that exploit the image modality vulnerability.

Abstract: Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.

[35] AutoJudge: Judge Decoding Without Manual Annotation

Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin

Main category: cs.CL

TL;DR: AutoJudge accelerates LLM inference using task-specific lossy speculative decoding by identifying which generated tokens affect downstream quality, allowing faster generation of “unimportant” tokens while maintaining response quality.

Details

Motivation: To speed up large language model inference by relaxing the strict token-by-token distribution matching requirement in speculative decoding, focusing only on tokens that impact final answer quality.

Method: Uses semi-greedy search to identify which mismatches between target and draft models need correction, then trains a lightweight classifier on LLM embeddings to predict which mismatching tokens can be safely accepted without quality loss.

Result: Achieved ≈2× speedup over speculative decoding on GSM8k with Llama 3.1 70B (≤1% accuracy drop), and accepted ≥25 tokens per speculation cycle on LiveCodeBench (2% drop in Pass@1).

Conclusion: AutoJudge provides significant inference speedups with minimal quality degradation, requires no human annotation, and is easily integrable with modern LLM frameworks.

Abstract: We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the “unimportant” tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.

[36] Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

Maureen de Seyssel, Jie Chi, Skyler Seto, Maartje ter Hoeve, Masha Fedzechkina, Natalie Schluter

Main category: cs.CL

TL;DR: The paper introduces training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity and semantic content, finding that language discrimination declines over training while meaning discrimination strengthens.

Details

Motivation: To provide a flexible and interpretable alternative to probing for analyzing multilingual language model representations, specifically examining how they handle language identity (form) versus semantic content (meaning).

Method: Uses zero-shot ABX-style discrimination tasks inspired from speech processing to measure minimal differences in representation, applied to XLM-R across pretraining checkpoints and layers, with additional probing tasks for comparison.

Result: Language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. Some alignment was found between ABX metrics and linguistic learning performance.

Conclusion: ABX tasks serve as a lightweight framework for analyzing the structure of multilingual representations, offering an interpretable alternative to traditional probing methods.

Abstract: We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.

[37] An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: iQUEST is a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions and integrates GNN to incorporate 2-hop neighbor information, improving multi-hop reasoning in knowledge base question answering.

Details

Motivation: LLMs often exhibit factual inconsistencies in knowledge-intensive tasks, and multi-hop KBQA faces challenges in maintaining coherent reasoning paths and avoiding premature discarding of critical connections.

Method: Iterative query decomposition into sub-questions combined with GNN integration to incorporate 2-hop neighbor information at each reasoning step.

Result: Consistent improvement across four benchmark datasets and four LLMs, demonstrating effective multi-hop reasoning.

Conclusion: The dual approach of iterative decomposition and GNN look-ahead strengthens reasoning processes and enables more effective exploration of viable paths in complex KBQA tasks.

Abstract: Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

[38] AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Main category: cs.CL

TL;DR: AgentSwift is a framework for automated LLM agent design that uses hierarchical search space modeling, value model training, and hierarchical MCTS to discover effective agent architectures with 8.34% average performance gain.

Details

Motivation: Current automated agent design methods have limited search spaces, fail to integrate human-designed components, and suffer from high evaluation costs and inefficient search strategies.

Method: Formalizes hierarchical search space modeling agent workflow and functional components, trains value model using combinatorial coverage and balanced Bayesian sampling, and uses hierarchical MCTS with uncertainty guidance.

Result: Achieves 8.34% average performance gain over existing automated methods and manually designed agents across seven benchmarks in embodied, math, web, tool, and game domains.

Conclusion: AgentSwift serves as a launchpad for rapidly discovering powerful agent architectures, addressing key limitations in automated agent design.

Abstract: Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical MCTS strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34% over both existing automated agent search methods and manually designed agents. Our framework serves as a launchpad for researchers to rapidly discover powerful agent architectures.

[39] Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models

Sumanth Manduru, Carlotta Domeniconi

Main category: cs.CL

TL;DR: The Vacuous Neutrality Framework (VaNeu) is introduced to assess fairness in Small Language Models (0.5-5B parameters) across four evaluation stages, revealing hidden vulnerabilities despite apparent low bias.

Details

Motivation: The rapid adoption of Small Language Models for resource-constrained applications has outpaced understanding of their ethical and fairness implications, creating a critical gap in responsible deployment.

Method: Multi-dimensional evaluation framework examining model robustness across four stages: biases, utility, ambiguity handling, and positional bias over diverse social bias categories. Evaluated nine widely used SLMs from four model families under ambiguous and disambiguated contexts.

Result: Models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. This is the first large-scale audit of SLMs in the 0.5-5B parameter range.

Conclusion: There is a need for more comprehensive understanding of fairness and reliability in SLMs, and the proposed framework serves as a principled tool for responsible deployment in socially sensitive settings.

Abstract: The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5-5B parameter range, an overlooked “middle tier” between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings.

[40] Eliciting Reasoning in Language Models with Cognitive Tools

Brown Ebouky, Andrea Bartezzaghi, Mattia Rigotti

Main category: cs.CL

TL;DR: The paper proposes using cognitive tools in an agentic framework to enhance LLM reasoning, achieving significant performance gains on mathematical benchmarks and contributing to understanding reasoning mechanisms.

Details

Motivation: To explore alternative methods for eliciting reasoning in LLMs beyond existing approaches like chains-of-thought and RL, drawing from cognitive psychology theories about modular cognitive operations.

Method: Implement cognitive psychology principles by endowing LLMs with a small set of “cognitive tools” representing specific reasoning operations, executed within an agentic tool-calling framework.

Result: Substantial performance improvements on mathematical reasoning benchmarks (e.g., GPT-4.1’s AIME2024 performance increased from 32% to 53%, surpassing o1-preview), with similar gains for both closed and open-weight models.

Conclusion: The approach demonstrates that structured cognitive tools can effectively elicit reasoning capabilities, contributing to understanding whether post-training methods uncover latent abilities versus creating new capabilities.

Abstract: The recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of “cognitive tools” encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our “cognitive tools” to GPT-4.1 increases its pass@1 performance on AIME2024 from 32% to 53%, even surpassing the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.

Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: Arg-LLaDA is a novel large language diffusion framework that iteratively improves argument summaries through sufficiency-guided remasking and regeneration, outperforming state-of-the-art methods in both automatic and human evaluations.

Details

Motivation: Argument summarization generation stage remains underexplored, with existing approaches relying on single-pass generation that offers limited support for factual correction or structural refinement.

Method: Combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans through iterative remasking and regeneration.

Result: Surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics and shows substantial improvements in human evaluations across coverage, faithfulness, and conciseness.

Conclusion: The iterative, sufficiency-aware generation strategy effectively produces more faithful, concise, and coherent argument summaries compared to existing approaches.

Abstract: Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.

[42] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom, H. Andrew Schwartz

Main category: cs.CL

TL;DR: MAQuA is an adaptive mental health screening framework that uses multi-outcome modeling with IRT and factor analysis to select optimal questions, reducing assessment burden by 50-87% while maintaining accuracy across multiple symptom domains.

Details

Motivation: LLMs offer scalable mental health assessment but excessive querying burdens users and is inefficient for transdiagnostic screening across multiple symptom dimensions.

Method: Combines multi-outcome modeling on language responses with item response theory (IRT) and factor analysis to select the most informative questions across multiple dimensions at each turn.

Result: Reduces assessment questions by 50-87% compared to random ordering (71% fewer for depression, 85% fewer for eating disorders), with robust performance across internalizing and externalizing domains.

Conclusion: MAQuA is a powerful and efficient tool for scalable, nuanced mental health screening that advances LLM integration into clinical workflows by significantly reducing patient burden while maintaining accuracy.

Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.

[43] CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov

Main category: cs.CL

TL;DR: CRISP is a parameter-efficient method for persistent concept unlearning in LLMs using sparse autoencoders to identify and suppress harmful knowledge features while preserving model utility.

Details

Motivation: Current SAE-based unlearning methods operate at inference time without persistent parameter changes, making them vulnerable to reversal by malicious actors with parameter access.

Method: CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations to achieve persistent concept unlearning.

Result: CRISP outperforms prior approaches on safety-critical unlearning tasks from WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities.

Conclusion: CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of target features for persistent safety improvements.

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model’s parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

[44] From Confidence to Collapse in LLM Factual Robustness

Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper introduces Factual Robustness Score (FRS), a novel metric that measures factual knowledge robustness in LLMs by analyzing token distribution entropy and temperature scaling sensitivity, showing significant robustness variations across model sizes.

Details

Motivation: Existing evaluation methods focus on performance-based metrics and prompt perturbations, but fail to capture knowledge robustness from the generation process perspective, especially regarding stability against decoding condition perturbations.

Method: Propose FRS metric combining token distribution entropy and temperature scaling sensitivity to quantify fact stability against decoding perturbations. Validate on 5 LLMs across 3 QA datasets (SQuAD, TriviaQA, HotpotQA).

Result: Factual robustness varies significantly: smaller models have FRS of 0.76, larger ones 0.93. Accuracy degrades by ~60% under increased uncertainty. Entropy and temperature scaling significantly impact factual accuracy.

Conclusion: The approach provides insights into how entropy and temperature scaling affect factual accuracy, establishing foundation for developing more robust knowledge retention and retrieval in future models.

Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of $0.76$, larger ones $0.93$ – with accuracy degrading by ~$60%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.

[45] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

Kyohoon Jin, Juhwan Choi, Jungmin Yun, Junho Lee, Soojin Jang, Youngbin Kim

Main category: cs.CL

TL;DR: CoBA introduces counterbias data augmentation to address spurious correlations in deep learning by decomposing text into semantic triples and selectively modifying them to disrupt biases, improving performance and out-of-distribution robustness.

Details

Motivation: Deep learning models often exploit spurious correlations in training data, leading to poor generalization and performance degradation on unseen data.

Method: Decompose text into subject-predicate-object triples, selectively modify these triples to disrupt spurious correlations, and reconstruct text from adjusted triples to generate counterbias data.

Result: CoBA improves downstream task performance, effectively reduces biases, and strengthens out-of-distribution resilience.

Conclusion: CoBA offers a versatile and robust solution to mitigate spurious correlations in deep learning models.

Abstract: Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.

[46] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen

Main category: cs.CL

TL;DR: Probing-based safety detection methods in LLMs fail because they learn superficial patterns (instructional formats and trigger words) rather than semantic harmfulness, leading to poor out-of-distribution performance.

Details

Motivation: To systematically re-examine probing-based approaches for detecting harmful instructions in LLMs, motivated by concerns about their poor out-of-distribution performance and the hypothesis that probes learn superficial patterns instead of true semantic harmfulness.

Method: Conducted controlled experiments including: comparing simple n-gram methods with probing approaches, using semantically cleaned datasets, and detailed analysis of pattern dependencies to identify what patterns probes actually learn.

Result: Probes primarily learn superficial patterns like instructional formats and trigger words rather than semantic harmfulness, leading to comparable performance with simple n-gram methods and poor generalization to out-of-distribution data.

Conclusion: Current probing-based approaches create a false sense of security and highlight the need to redesign both LLM safety detection models and evaluation protocols for more reliable safety assessment.

Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs’ internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

[47] Verbalized Algorithms

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

Main category: cs.CL

TL;DR: Verbalized Algorithms (VAs) use classical algorithms with LLMs as simple operation oracles, improving reliability on reasoning tasks like sorting and clustering.

Details

Motivation: One-shot LLM queries for reasoning tasks are unreliable; VAs provide theoretical guarantees by limiting LLMs to simple, verifiable operations within established algorithms.

Method: Decompose tasks into elementary natural language operations, use LLMs as binary comparison oracles in classical algorithms (e.g., bitonic sorting network for sorting).

Result: VAs demonstrate effectiveness on sorting and clustering tasks by improving reliability and leveraging established algorithmic analysis.

Conclusion: Verbalized Algorithms offer a principled approach to reliable LLM-based reasoning by combining classical algorithms with constrained LLM usage.

Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.

[48] Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

Guangliang Liu, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Marie Johnson

Main category: cs.CL

TL;DR: Current fairness objectives for gender stereotype mitigation in language models fail to achieve effective performance trade-offs, as downstream performance strongly correlates with overall forgetting rather than selective forgetting.

Details

Motivation: To investigate whether performance trade-offs in moral alignment can be achieved through forgetting and fairness objectives, particularly for gender stereotype mitigation in pretrained language models.

Method: Analyzed the relationship between forgetting, fairness objectives, and downstream task performance using large datasets for gender stereotype mitigation.

Result: Found that: (1) downstream performance strongly correlates with overall forgetting; (2) selective forgetting reduces stereotypes but increases overall forgetting; (3) general solutions for alleviating forgetting are ineffective.

Conclusion: Current fairness objectives have limitations in achieving effective performance trade-offs for moral alignment, highlighting the need for better approaches that can selectively forget stereotypes without degrading overall language modeling capabilities.

Abstract: Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.

[49] Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda

Main category: cs.CL

TL;DR: Activation steering can suppress LLMs’ evaluation-awareness, making them behave as if deployed during safety evaluations to improve reliability.

Details

Motivation: LLMs can detect when they're being evaluated and adjust behavior to appear more aligned, compromising safety evaluation reliability.

Method: Two-step training: continued pretraining on documents with factual descriptions, then expert iteration to use Python type hints in evaluation settings. Used activation steering with vectors from original model to suppress evaluation-awareness.

Result: Activation steering successfully suppressed evaluation awareness, making the model act like deployed even when evaluation cues were present.

Conclusion: AI evaluators could improve safety evaluation reliability by steering models to act like they’re deployed rather than being evaluated.

Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM’s activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

[50] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning

Sangmook Lee, Dohyung Kim, Hyukhun Koh, Nakyeong Yang, Kyomin Jung

Main category: cs.CL

TL;DR: STEER is a confidence-guided routing framework that dynamically switches between small and large LLMs during reasoning steps based on the small model’s confidence scores, reducing inference costs while maintaining accuracy.

Details

Motivation: To lower the high inference costs of large language models while maintaining reasoning capabilities, without requiring expensive external router models or data synthesis techniques.

Method: Uses model-internal confidence scores from the smaller LLM’s logits before generating each reasoning step to decide when to invoke the larger model, performing fine-grained step-level routing without external modules.

Result: Achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to using only the larger model), outperforming baselines with trained external routers.

Conclusion: Model-internal confidence scores provide a robust, domain-agnostic signal for efficient model routing, offering a scalable pathway for cost-effective LLM deployment.

Abstract: Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model’s logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.

[51] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

Main category: cs.CL

TL;DR: LoRA on the Go (LoGo) is a training-free framework that dynamically selects and merges LoRA adapters at instance level for multi-task inference without requiring labeled data or additional training.

Details

Motivation: Conventional LoRA adapters are trained for single tasks, limiting real-world applicability where inputs span diverse domains. Existing multi-LoRA approaches require labeled data or task-specific training, which is expensive at scale.

Method: LoGo extracts signals from a single forward pass through LoRA adapters to identify the most relevant adapters and determine their contributions dynamically at inference time.

Result: Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks by up to 3.6% while remaining competitive on other tasks and maintaining inference throughput.

Conclusion: LoGo provides an effective and practical training-free solution for dynamic multi-task inference using LoRA adapters, demonstrating strong performance across diverse NLP tasks.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.

[52] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Yaxin Zhao, Yu Zhang

Main category: cs.CL

TL;DR: HalluClean is a lightweight, task-agnostic framework that detects and corrects hallucinations in LLM-generated text using a reasoning-enhanced paradigm without external knowledge or supervised detectors.

Details

Motivation: LLMs often produce hallucinated content that undermines factual reliability, creating a need for methods to improve factual consistency in LLM outputs.

Method: Uses a reasoning-enhanced paradigm with planning, execution, and revision stages to identify and refine unsupported claims, employing minimal task-routing prompts for zero-shot generalization across domains.

Result: Significantly improves factual consistency and outperforms competitive baselines across five tasks: question answering, dialogue, summarization, math word problems, and contradiction detection.

Conclusion: HalluClean demonstrates potential to enhance the trustworthiness of LLM outputs in real-world applications through its lightweight, task-agnostic approach to hallucination detection and correction.

Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.

[53] MajinBook: An open catalogue of digital world literature with likes

Antoine Mazières, Thierry Poibeau

Main category: cs.CL

TL;DR: MajinBook is an open catalogue linking shadow library metadata with Goodreads data, creating a corpus of 539,000+ English books with publication dates, genres, and popularity metrics, while addressing biases in traditional corpora.

Details

Motivation: To facilitate the use of shadow libraries for computational social science and cultural analytics by creating a high-precision, machine-readable corpus that addresses biases in traditional book corpora.

Method: Linking metadata from shadow libraries (Library Genesis, Z-Library) with structured bibliographic data from Goodreads, prioritizing natively digital EPUB files for machine-readability, and including secondary datasets for French, German, and Spanish languages.

Result: Created a corpus of over 539,000 English-language book references spanning three centuries, enriched with first publication dates, genres, ratings, and reviews, with evaluated linkage accuracy and openly released data.

Conclusion: The project successfully bridges shadow libraries with structured bibliographic data, provides legal permissibility analysis under EU and US frameworks, and offers a valuable resource for computational social science research.

Abstract: This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries–such as Library Genesis and Z-Library–for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project’s legal permissibility under EU and US frameworks for text and data mining in research.

[54] Auditing Google’s AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy

Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg, Aniko Hannak, Christo Wilson

Main category: cs.CL

TL;DR: Google’s AI Overviews and Featured Snippets show concerning inconsistencies and lack of medical safeguards in health information, with 33% inconsistency between features and only 11% of AIO/7% of FS containing medical disclaimers.

Details

Motivation: To evaluate the quality and consistency of AI-generated health information in Google Search features (AI Overviews and Featured Snippets) that users rely on but cannot control.

Method: Systematic algorithm audit of 1,508 baby care and pregnancy queries using a robust evaluation framework assessing answer consistency, relevance, medical safeguards, source categories, and sentiment alignment.

Result: 33% inconsistency between AIO and FS on same page; high relevance but critically lacking medical safeguards (11% AIO, 7% FS); health/wellness websites dominate sources, with FS often linking to commercial sources.

Conclusion: Concerning gaps in AI-mediated health information quality demonstrate need for stronger controls; methodology provides transferable framework for auditing AI systems in high-stakes domains impacting user well-being.

Abstract: Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.

[55] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu, Jianxin Dong, Jiaying Li, Hui Wu, Hanwen Liang, Jintai Lin, Yanting Wang, Jie Dong, Tong Zhu, Tianfan Fu, Conghui He, Qi Zhang, Songyang Zhang, Lei Bai, Kai Chen

Main category: cs.CL

TL;DR: ATLAS is a new high-difficulty, cross-disciplinary benchmark for evaluating LLMs’ scientific reasoning capabilities, addressing limitations of existing benchmarks through original problems, contamination resistance, and complex open-ended answers.

Details

Motivation: Existing benchmarks suffer from performance saturation, narrow focus, simplified formats, and data contamination, creating a gap with real scientific inquiry.

Method: Created ~800 original problems across 7 scientific fields by domain experts, featuring contamination resistance, cross-disciplinary integration, complex open-ended answers with LaTeX, and rigorous quality control through peer review.

Result: Preliminary results show ATLAS effectively differentiates advanced scientific reasoning capabilities of leading models.

Conclusion: ATLAS will serve as a long-term community platform to reliably measure progress toward AGI through robust evaluation of scientific reasoning.

Abstract: The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models’ ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS’s effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable “ruler” for progress toward Artificial General Intelligence.

[56] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Xinli Tao, Xin Dong, Xuezhong Zhou

Main category: cs.CL

TL;DR: OEMA is a zero-shot clinical NER framework using multi-agent collaboration that achieves near-supervised performance without labeled data.

Details

Motivation: Traditional supervised NER models require expensive annotation, while existing zero-shot approaches struggle with example selection granularity and prompt integration.

Method: Three-agent system: self-annotator generates examples, discriminator filters using SNOMED CT, and predictor enhances inference with entity descriptions.

Result: State-of-the-art on MTSamples and VAERS datasets; comparable to supervised BioClinicalBERT under related-match criteria.

Conclusion: OEMA advances zero-shot clinical NER, achieving near-supervised performance, with future work on continual learning and domain adaptation.

Abstract: With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.

[57] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi

Main category: cs.CL

TL;DR: Poetic prompts serve as effective universal jailbreaks for LLMs, achieving high attack success rates across multiple risk domains and outperforming non-poetic baselines significantly.

Details

Motivation: To investigate whether stylistic variations like poetry can systematically bypass LLM safety mechanisms, revealing limitations in current alignment methods.

Method: Used curated poetic prompts and converted 1,200 MLCommons harmful prompts into verse via standardized meta-prompt, testing across 25 proprietary and open-weight models with ensemble LLM judges.

Result: Poetic attacks achieved 62% success for hand-crafted poems and 43% for meta-prompt conversions, substantially outperforming non-poetic baselines with some providers exceeding 90% attack success rates.

Conclusion: Stylistic variation alone can circumvent contemporary safety mechanisms, indicating fundamental limitations in current alignment methods and evaluation protocols.

Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

[58] Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova

Main category: cs.CL

TL;DR: Mera Multi is a new multimodal evaluation framework for Russian language MLLMs, featuring 18 tasks across text, image, audio, and video modalities with cultural specificity.

Details

Motivation: Address the lack of multimodal benchmarks for Russian language and better understand the intelligence, limitations, and risks of multimodal large language models.

Method: Created 18 datasets from scratch with Russian cultural specificity, unified prompts and metrics, plus methodology for preventing benchmark leakage through watermarking and licenses.

Result: Established baseline results for both closed-source and open-source models, providing the first multimodal evaluation framework for Russian-spoken architectures.

Conclusion: The benchmark offers a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

cs.CV

[59] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian

Main category: cs.CV

TL;DR: UniFit is a universal virtual try-on framework that uses Multimodal Large Language Models to bridge semantic gaps between text instructions and reference images, enabling flexible handling of diverse try-on tasks with limited data.

Details

Motivation: Existing VTON methods struggle with semantic gaps between text instructions and reference images, and face data scarcity in complex scenarios, limiting their flexibility and performance.

Method: Proposes MLLM-Guided Semantic Alignment Module (MGSA) that integrates multimodal inputs using MLLM and learnable queries with semantic alignment loss, plus a two-stage progressive training strategy with self-synthesis pipeline.

Result: UniFit supports wide range of VTON tasks including multi-garment and model-to-model try-on, achieving state-of-the-art performance in extensive experiments.

Conclusion: The proposed framework effectively addresses semantic gap and data scarcity challenges, enabling universal VTON capabilities with superior performance across diverse tasks.

Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

[60] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

Chengxi Zeng, Yuxuan Jiang, Aaron Zhang

Main category: cs.CV

TL;DR: EfficientSAM3 is a family of efficient models that enable on-device concept segmentation and tracking by distilling capabilities from SAM3 through Progressive Hierarchical Distillation.

Details

Motivation: SAM3's unified architecture is too computationally expensive for on-device use, creating a need for efficient alternatives that maintain high performance.

Method: Progressive Hierarchical Distillation (PHD) in three stages: Encoder Distillation, Temporal Memory Distillation with Perceiver-based module, and End-to-End Fine-Tuning using lightweight backbones like RepViT, TinyViT, and EfficientViT.

Result: Achieves strong performance-efficiency trade-offs on VOS datasets while maintaining high fidelity to the teacher model’s behavior.

Conclusion: EfficientSAM3 enables on-device concept segmentation and tracking through efficient distillation while preserving concept-level performance.

Abstract: The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.

[61] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Sajjad Pakdamansavoji, Yintao Ma, Amir Rasouli, Tongtong Cao

Main category: cs.CV

TL;DR: Proposes four extensions to model-based 6D pose estimation: dynamic non-uniform sampling, multi-hypothesis inference, iterative refinement, and occlusion-focused training, achieving significant accuracy improvements and faster inference.

Details

Motivation: Existing 6D pose estimation methods struggle with unseen objects and occlusion, where early-stage detection errors propagate through sequential pipelines, degrading performance.

Method: Four novel extensions: (1) dynamic non-uniform dense sampling focusing on visible regions, (2) multi-hypothesis inference with confidence-ranked candidates, (3) iterative refinement for progressive accuracy, (4) occlusion-focused training augmentations, plus a new weighted by visibility evaluation metric.

Result: Achieves >5% accuracy improvement on ICBIN and >2% on BOP benchmarks, with approximately 3x faster inference speed.

Conclusion: The proposed approach effectively addresses occlusion challenges in 6D pose estimation, improving both accuracy and efficiency while providing better evaluation metrics for occluded scenarios.

Abstract: Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

[62] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

Lukas Arzoumanidis, Julius Knechtel, Jan-Henrik Haunert, Youness Dehbi

Main category: cs.CV

TL;DR: Automated analysis of historical maps faces data scarcity. The paper proposes generating synthetic historical maps by transferring cartographic style to vector data, using both automatic deep generative and manual stochastic degradation methods to create realistic training data for domain-adaptive semantic segmentation.

Details

Motivation: Historical map analysis lacks annotated training data, especially for homogeneous corpora. Manual annotation is time-consuming, and existing synthetic data lacks realism and diversity needed for effective machine learning.

Method: Two approaches: (1) automatic deep generative method and (2) manual stochastic degradation technique to emulate visual uncertainty and noise in historical map scans. Transfer cartographic style from original maps to vector data to create synthetic training maps.

Result: Generated effectively unlimited synthetic historical maps suitable for land-cover interpretation tasks. Used the synthetic datasets for domain-adaptive semantic segmentation using Self-Constructing Graph Convolutional Network.

Conclusion: The proposed data bootstrapping methods successfully address the training data scarcity problem in historical map analysis by creating realistic and diverse synthetic maps that enable effective domain-adaptive semantic segmentation.

Abstract: The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

[63] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

Main category: cs.CV

TL;DR: SAM2S enhances SAM2 for surgical video segmentation with long-term tracking and zero-shot generalization, achieving 80.42 average J&F score and 68 FPS real-time inference.

Details

Motivation: Address limitations of existing iVOS models in surgical scenarios, including domain gap and limited long-term tracking capabilities.

Method: Proposes SAM2S with three key components: DiveMem for long-term tracking, temporal semantic learning for instrument understanding, and ambiguity-resilient learning for annotation inconsistencies.

Result: Fine-tuning on SA-SV benchmark improves SAM2 by 12.99 average J&F. SAM2S achieves 80.42 average J&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, with 68 FPS real-time inference.

Conclusion: SAM2S demonstrates substantial performance gains for surgical iVOS while maintaining real-time capabilities and strong zero-shot generalization across surgical procedures.

Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

[64] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

Yintao Ma, Sajjad Pakdamansavoji, Amir Rasouli, Tongtong Cao

Main category: cs.CV

TL;DR: Box6D is a category-level 6D pose estimation method for warehouse boxes that uses binary search for dimension inference and category CAD templates, achieving competitive accuracy with 76% faster inference.

Details

Motivation: Existing methods have limitations: model-based approaches require exact CAD models and transfer poorly, model-free methods fail under challenging conditions, and category-level methods are often too general and ignore environment/object priors for industrial use.

Method: Box6D uses a fast binary search to infer box dimensions from single RGB-D observation, employs category CAD templates instead of instance-specific models, and applies depth-based plausibility filtering with early-stopping to reject implausible hypotheses.

Result: Evaluation on real-world storage scenarios and public benchmarks shows competitive or superior 6D pose precision while reducing inference time by approximately 76%.

Conclusion: Box6D provides an effective balance between flexibility and accuracy for warehouse box pose estimation, addressing practical limitations of existing methods in industrial settings.

Abstract: Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

[65] Adaptive Guided Upsampling for Low-light Image Enhancement

Angela Vivian Dcosta, Chunbo Song, Rafael Radkowski

Main category: cs.CV

TL;DR: AGU is an efficient guided upsampling method for low-light images that simultaneously optimizes noise reduction and sharpness enhancement through multi-parameter learning from few image pairs.

Details

Motivation: Existing guided image methods fail with low-light images due to high noise and low brightness, resulting in suboptimal upscaling quality.

Method: Uses guided image approach with multi-parameter optimization to learn associations between low-light and bright image characteristics from few sample image pairs.

Result: AGU produces high-quality images in real-time from low-quality, low-resolution input and outperforms state-of-the-art methods in low-light scenarios.

Conclusion: The proposed method effectively addresses the limitations of guided image approaches for low-light conditions through learned multi-characteristic optimization.

Abstract: We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.

[66] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, Chao Ma

Main category: cs.CV

TL;DR: CoRL framework enables unified multimodal LLMs to co-evolve generation and understanding capabilities through reinforcement learning, achieving 7% improvement in text-to-image generation and 23% in multimodal understanding.

Details

Motivation: To explore how reinforcement learning can simultaneously reinforce both generation and understanding capabilities in unified multimodal large language models, enabling synergistic co-evolution of dual capabilities.

Method: Proposed CoRL framework with two stages: unified RL for joint optimization and refined RL for task-specific enhancement, using group relative policy optimization.

Result: ULM-R1 model achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks.

Conclusion: CoRL effectively facilitates cross-task synergy and optimization for unified multimodal LLMs, demonstrating substantial benefits of reinforcement learning in multimodal AI systems.

Abstract: This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.

[67] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu, Di Fu, Jiaxing Zhang, Gong Yu, Jiayu Zheng, Xiaoling Hu, Dongdi Zhao, Feiyang Li, Chao Chen, Yong Cao

Main category: cs.CV

TL;DR: A two-stage self-improvement method for Vision Language Models (VLMs) that generates rationales and fine-tunes on them to bridge the rationale gap in domain-specific video classification with limited data.

Details

Motivation: VLMs struggle with domain-specific video classification due to a rationale gap where sparse domain data cannot bridge the semantic distance between complex video content and abstract labels.

Method: Two-stage approach: 1) Generate detailed textual rationales for videos using VLMs, 2) Fine-tune VLMs on self-generated rationales followed by supervised fine-tuning on task labels.

Result: Significantly outperforms direct supervised fine-tuning across diverse datasets, demonstrating improved domain adaptation for video analysis.

Conclusion: Self-generated rationales provide an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis without requiring new annotations.

Abstract: Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model’s pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

[68] Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

Main category: cs.CV

TL;DR: MGLL is a contrastive learning framework that improves multi-label and cross-granularity alignment in medical imaging by leveraging structured supervision and smooth KL divergence.

Details

Motivation: Standard CLIP models are limited by single-label, single-granularity alignment, which is insufficient for complex medical imaging where images often have multiple high-level labels across different annotation granularities.

Method: Proposes Multi-Granular Language Learning (MGLL) framework using structured multi-label supervision, cross-granularity text integration, soft-label supervision with point-wise constraints, and smooth KL divergence for consistency.

Result: MGLL outperforms state-of-the-art methods in downstream tasks when pretrained on large-scale multi-granular datasets and evaluated across multiple datasets.

Conclusion: MGLL effectively addresses the limitations of existing vision-language models in medical imaging by enabling better multi-label and cross-granularity alignment while maintaining computational efficiency as a plug-and-play module.

Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.

[69] Automated Interpretable 2D Video Extraction from 3D Echocardiography

Milos Vukadinovic, Hirotaka Ieki, Yuki Sahasi, David Ouyang, Bryan He

Main category: cs.CV

TL;DR: Automated method to extract standard 2D echocardiography views from 3D cardiac ultrasound volumes using deep learning and anatomical heuristics, validated with 96% accuracy by cardiologists.

Details

Motivation: Conventional cardiac ultrasound uses 2D videos but 3D echocardiography offers better image quality and efficiency. However, physicians prefer interpreting data in their familiar 2D format.

Method: Combines deep learning view classifier with anatomical landmark-based heuristics and cardiologist-provided heuristics to reconstruct standard 2D echocardiography views from 3D volumes.

Result: Achieved 96% accuracy in blinded evaluation by three cardiologists on 1,600 videos from 2 hospitals. Extracted 2D videos preserved spatial calibration and diagnostic features, allowing accurate detection of cardiac abnormalities and clinical-grade measurements.

Conclusion: The method successfully bridges the gap between 3D scanning efficiency and 2D clinical interpretation, enabling physicians to benefit from 3D technology while maintaining their standard workflow.

Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

[70] Segmenting Collision Sound Sources in Egocentric Videos

Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen

Main category: cs.CV

TL;DR: Proposes Collision Sound Source Segmentation (CS3) task to segment objects responsible for collision sounds in video frames using audio conditioning, with a weakly-supervised method leveraging foundation models and egocentric cues.

Details

Motivation: Humans excel at multisensory perception and can recognize object properties from collision sounds, inspiring the need to develop computational methods for audio-visual understanding of collision events.

Method: Weakly-supervised audio-conditioned segmentation using foundation models (CLIP and SAM2) with egocentric cues like objects in hands to identify potential collision sound sources.

Result: Outperforms competitive baselines by 3× and 4.7× in mIoU on EPIC-CS3 and Ego4D-CS3 benchmarks respectively.

Conclusion: The proposed CS3 task and method effectively address the challenge of segmenting collision sound sources in cluttered egocentric videos using audio-visual fusion.

Abstract: Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

[71] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath

Main category: cs.CV

TL;DR: Click2Graph is the first interactive framework for Panoptic Video Scene Graph Generation that combines visual prompting with spatial, temporal, and semantic understanding to enable human-guided video scene understanding.

Details

Motivation: Current VSGG systems are closed, feed-forward pipelines without human guidance capability, while promptable segmentation models lack semantic reasoning. This creates a gap between precise user interaction and comprehensive scene understanding.

Method: The framework uses a single user cue (click or bounding box) to segment and track subjects across time, autonomously discovers interacting objects, and predicts scene graph triplets. Key components include a Dynamic Interaction Discovery Module for subject-conditioned object prompts and a Semantic Classification Head for joint entity-predicate reasoning.

Result: Experiments on OpenPVSG benchmark show Click2Graph establishes a strong foundation for user-guided PVSG, demonstrating how human prompting can be effectively combined with panoptic grounding and relational inference.

Conclusion: Click2Graph enables controllable and interpretable video scene understanding by unifying visual prompting with comprehensive scene graph generation, bridging the gap between interactive segmentation and semantic reasoning.

Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

[72] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan, Yuanhong Zhang, Weizhan Zhang, Lan Ma, Yuan Gao, Jiangyong Ying, Yudeng Xin

Main category: cs.CV

TL;DR: InfoCLIP improves CLIP fine-tuning for open-vocabulary semantic segmentation by using information theory to transfer alignment knowledge while preventing overfitting and preserving vision-language alignment.

Details

Motivation: Existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment, which needs to be stabilized during fine-tuning.

Method: Proposes InfoCLIP with two information-theoretic objectives: compressing pixel-text modality alignment from pretrained CLIP to reduce noise, and maximizing mutual information between pretrained CLIP and fine-tuned model to transfer compact local semantic relations.

Result: Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation.

Conclusion: InfoCLIP demonstrates adaptability and superiority in asymmetric transfer for open-vocabulary semantic segmentation tasks.

Abstract: Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

[73] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

Jingru Zhang, Saed Moradi, Ashirbani Saha

Main category: cs.CV

TL;DR: Proposed consistency regularization with BI-RADS-inspired features to reduce destructive interference in multi-task learning for breast ultrasound tumor segmentation, achieving significant generalization improvements across external datasets.

Details

Motivation: Multi-task learning often suffers from destructive task interference that limits generalization performance in breast ultrasound tumor segmentation when jointly training segmentation and classification tasks.

Method: Consistency regularization approach using differentiable BI-RADS-inspired morphological features to mitigate interference between segmentation and classification tasks in multi-task learning.

Result: Statistically significant improvements (p<0.001) in segmentation generalization: Dice coefficients of 0.81 vs 0.59 (UDIAT), 0.66 vs 0.56 (BUSI), 0.69 vs 0.49 (BUS-UCLM). Achieved state-of-the-art segmentation on UDIAT dataset.

Conclusion: The proposed consistency regularization effectively reduces destructive task interference in multi-task learning, enabling improved generalization performance for breast ultrasound tumor segmentation across diverse external datasets.

Abstract: Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.

[74] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

Xinyu Nan, Lingtao Mao, Huangyu Dai, Zexin Zheng, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li

Main category: cs.CV

TL;DR: A detection-guided generative framework that uses BART-based generation to produce hierarchical category and attribute tokens for unified visual semantic understanding, outperforming similarity-based and multi-stage approaches.

Details

Motivation: Current approaches relying on global similarity struggle with fine-grained category distinctions and category-specific attribute diversity, particularly in large-scale e-commerce scenarios.

Method: Detection-guided generative framework that extracts refined ROI-level features and employs BART-based generator to produce semantic tokens in coarse-to-fine sequence covering category hierarchies and property-value pairs.

Result: Significantly outperforms existing similarity-based pipelines and multi-stage classification systems on both large-scale proprietary e-commerce datasets and open-source datasets.

Conclusion: Achieves stronger fine-grained recognition and more coherent unified inference for visual semantic understanding tasks.

Abstract: Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

Dawei Li, Zijian Gu, Peng Wang, Chuhan Song, Zhen Tan, Mohan Zhang, Tianlong Chen, Yu Tian, Song Wang

Main category: cs.CV

TL;DR: Proposes FADS, a fairness-aware demonstration selection method for in-context learning that reduces demographic disparities in medical image reasoning without requiring model fine-tuning.

Details

Motivation: Existing debiasing methods for MLLMs rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. ICL offers a lightweight alternative but conventional demonstration selection fails to ensure fairness due to demographic imbalance.

Method: FADS (Fairness-Aware Demonstration Selection) builds demographically balanced and semantically relevant demonstrations via clustering-based sampling to ensure fair representation across demographic groups.

Result: Experiments on multiple medical imaging benchmarks show FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy.

Conclusion: FADS offers an efficient and scalable path toward fair medical image reasoning, highlighting the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

[76] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

Nimeshika Udayangani, Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie

Main category: cs.CV

TL;DR: Graph-based approach using GCNs and Gaussianization improves OOD detection in long-tailed vision datasets by enhancing tail-class performance and reducing false positives.

Details

Motivation: Existing OOD detection methods perform poorly on long-tailed datasets, showing high false positive rates and low accuracy for tail classes, which limits safe deployment of DNNs.

Method: Use pre-trained model features to initialize graph structure, apply Gaussianization to normalize activations, then refine with graph convolutional networks to create suitable feature space for long-tailed OOD detection.

Result: Outperforms state-of-the-art methods by large margin on CIFAR10-LT, CIFAR100-LT, and ImageNet-LT benchmarks in terms of FPR and tail-class ID classification accuracy.

Conclusion: Graph-based representation with GCNs and Gaussianization effectively addresses OOD detection challenges in long-tailed recognition, significantly improving performance on tail classes.

Abstract: Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.

[77] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu, Shikang Zheng, Ruiqi Huang, Yuanjie Zhao

Main category: cs.CV

TL;DR: Sequence-level optimization framework generates printable adversarial textures for clothing that maintain concealment across entire walking videos in both digital and physical settings.

Details

Motivation: Deep neural networks for human detection are vulnerable to adversarial attacks, but existing wearable approaches fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation.

Method: Map product images to UV space with compact palette and control-point parameterization, use physically based human-garment pipeline to simulate motion, camera viewpoints, cloth dynamics, and illumination, then optimize control points with expectation-over-transformation objective to minimize detection confidence across sequences.

Result: Strong and stable concealment, high robustness to viewpoint changes, superior cross-model transferability, and physical garments achieve reliable suppression under indoor and outdoor recordings.

Conclusion: The framework demonstrates real-world feasibility for generating adversarial textures that remain effective throughout entire walking sequences in both digital and physical environments.

Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.

[78] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao

Main category: cs.CV

TL;DR: The paper proposes a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution, integrating sparse Mixture-of-Experts (MoE) with LoRA modules to address limitations in existing dense Real-ISR models.

Details

Motivation: Existing dense Real-ISR models using LoRA fine-tuning cannot adaptively capture heterogeneous characteristics of complex real-world degraded samples or enable efficient knowledge sharing under equivalent computational budgets.

Method: Proposes MoR architecture with fine-grained expert partitioning (treating each LoRA rank as independent expert), degradation estimation module using CLIP embeddings, zero-expert slots, and degradation-aware load-balancing loss for dynamic expert activation.

Result: Comprehensive experiments validate the framework’s effectiveness and state-of-the-art performance in single-step image super-resolution.

Conclusion: The MoR architecture successfully addresses limitations of dense Real-ISR models by enabling flexible knowledge recombination, adaptive expert activation based on degradation severity, and optimal computational resource allocation.

Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework’s effectiveness and state-of-the-art performance.

[79] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy

Main category: cs.CV

TL;DR: A deep learning-based material classification method using laser speckle patterns that achieves high accuracy even when laser color changes, enabling safe and efficient laser cutting.

Details

Motivation: Laser cutting generates hazardous dust and smoke, requiring real-time monitoring. Existing speckle sensing methods struggle when laser color changes, limiting practical applications.

Method: Train a convolutional neural network (CNN) on laser speckle patterns to classify materials, addressing the laser color change issue that affects previous methods.

Result: Achieved 98.30% training accuracy, 96.88% validation accuracy, and 0.9643 F1-score on 3000 new images across 30 different materials, demonstrating robustness to laser color changes.

Conclusion: The proposed CNN-based speckle pattern analysis provides a robust and accurate solution for material-aware laser cutting, overcoming limitations of previous methods.

Abstract: Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers’ health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material’s surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.

[80] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

Zijian Wu, Mingfeng Jiang, Zidian Lin, Ying Song, Hanjie Ma, Qun Wu, Dongping Zhang, Guiyang Pu

Main category: cs.CV

TL;DR: CuriGS is a curriculum-guided framework that improves 3D Gaussian Splatting for sparse-view reconstruction by generating pseudo-views (students) around ground-truth poses and gradually training with increasing perturbation levels.

Details

Motivation: Extending 3D Gaussian Splatting to sparse-view settings is challenging due to supervision scarcity and overfitting from limited viewpoint coverage.

Method: Generate student views (pseudo-views) around ground-truth teacher poses with different perturbation levels, use curriculum learning to gradually increase perturbation, apply depth-correlation and co-regularization, and retain best-performing students based on multi-signal quality metrics.

Result: Outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes.

Conclusion: Curriculum-guided training with student views effectively addresses sparse-view challenges in 3D Gaussian Splatting, achieving superior reconstruction quality.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/

[81] Crossmodal learning for Crop Canopy Trait Estimation

Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar

Main category: cs.CV

TL;DR: Cross-modal learning strategy that enhances high-resolution satellite imagery with UAV-level detail for crop canopy trait estimation, improving performance on agricultural monitoring tasks.

Details

Motivation: Satellite missions are limited by spatial resolution for micro-plot management, while UAVs provide high-resolution data but have operational constraints. Need to bridge gap between satellite and UAV sensing capabilities.

Method: Train model using co-registered satellite-UAV image pairs from 84 maize varieties across 5 locations to learn fine-grained spectral-spatial correspondences between modalities.

Result: Generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on yield and nitrogen prediction tasks.

Conclusion: Cross-modal correspondence learning effectively bridges the gap between satellite and UAV sensing for agricultural monitoring applications.

Abstract: Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

[82] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

Qing Wang, Chong-Wah Ngo, Ee-Peng Lim, Qianru Sun

Main category: cs.CV

TL;DR: LLM-powered framework for food recognition that addresses domain shift, long-tailed distribution, and fine-grained classification by aligning image and text embeddings in shared space.

Details

Motivation: Food recognition faces challenges: domain shift between training (web images) and real-world user photos, long-tailed data distribution, and subtle visual differences between similar dishes.

Method: Use LLMs to parse food images into titles/ingredients, project both text and images to shared embedding space to maximize pair similarities, then use aligned multimodal features for recognition.

Result: Outperforms existing approaches for long-tailed data, domain adaptation, and fine-grained classification on two food datasets.

Conclusion: Simple LLM-powered framework effectively addresses multiple food recognition challenges through multimodal alignment in shared embedding space.

Abstract: Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.

[83] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

Boxun Xu, Yu Wang, Zihu Wang, Peng Li

Main category: cs.CV

TL;DR: AMS-KV is a scale-adaptive KV caching policy for visual autoregressive models that reduces KV cache usage by 84.83% and self-attention latency by 60.48% by prioritizing condensed and local scales while optimizing cache utilization through inter-scale similarity analysis.

Details

Motivation: KV caching for next-scale prediction in visual autoregressive models faces excessive memory growth with increasing scales, severely limiting scalability. Current KV caching designs for this paradigm remain largely unexplored.

Method: AMS-KV prioritizes storing KVs from condensed and local scales, preserves most relevant tokens, and optimizes KV cache utilization by identifying cache-demanding layers through inter-scale similarity analysis.

Result: Reduces KV cache usage by up to 84.83%, self-attention latency by 60.48%, and enables stable scaling to batch size 256 where baseline fails at batch size 128 with improved throughput.

Conclusion: AMS-KV effectively addresses KV memory bottlenecks in next-scale prediction VAR models through intelligent scale-adaptive caching policies, significantly improving scalability and computational efficiency.

Abstract: Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

[84] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

Pei Liu, Songtao Wang, Lang Zhang, Xingyue Peng, Yuandong Lyu, Jiaxin Deng, Songxin Lu, Weiliang Ma, Xueyang Zhang, Yifei Zhan, XianPeng Lang, Jun Ma

Main category: cs.CV

TL;DR: LiSTAR is a generative world model for synthesizing high-fidelity 4D LiDAR data using a Hybrid-Cylindrical-Spherical representation and Spatio-Temporal Attention with Ray-Centric Transformer, achieving significant improvements in generation quality and reconstruction accuracy.

Details

Motivation: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving, but challenging due to sensor geometry, temporal sparsity, and dynamic scene complexity.

Method: Uses Hybrid-Cylindrical-Spherical representation to reduce quantization artifacts, Spatio-Temporal Attention with Ray-Centric Transformer for temporal coherence, and Masked Generative START for controllable synthesis with tokenized scene representations.

Result: Achieved state-of-the-art performance: 76% reduction in generation MMD, 32% improvement in reconstruction IoU, and 50% reduction in prediction L1 Med error.

Conclusion: LiSTAR provides a powerful foundation for creating realistic and controllable autonomous systems simulations with substantial quantitative gains across multiple metrics.

Abstract: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor’s unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor’s native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR’s state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.

[85] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li

Main category: cs.CV

TL;DR: VideoSeg-R1 introduces reinforcement learning for video reasoning segmentation using a decoupled architecture with hierarchical frame sampling, explicit reasoning chains, and adaptive difficulty-aware reasoning length control.

Details

Motivation: Traditional supervised fine-tuning methods for video reasoning segmentation lack generalization to out-of-distribution scenarios and explicit reasoning capabilities.

Method: Three-stage framework: (1) hierarchical text-guided frame sampling, (2) reasoning model with explicit reasoning chains and spatial cues, (3) segmentation-propagation using SAM2 and XMem with adaptive difficulty-aware reasoning length control.

Result: Achieves state-of-the-art performance on multiple benchmarks for complex video reasoning and segmentation tasks.

Conclusion: VideoSeg-R1 successfully integrates reinforcement learning into video reasoning segmentation, demonstrating superior performance through its decoupled architecture and explicit reasoning approach.

Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

[86] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

Meihua Zhou, Liping Yu, Jiawei Cai, Wai Kin Fung, Ruiguo Hu, Jiarui Zhao, Wenzhuo Liu, Nan Wan

Main category: cs.CV

TL;DR: SpectralTrain is a universal training framework for hyperspectral image classification that combines curriculum learning with PCA-based spectral downsampling to reduce computational costs while maintaining performance.

Details

Motivation: Hyperspectral image classification involves large-scale data and computationally intensive training, limiting practical deployment of deep learning models in real-world remote sensing applications.

Method: Integrates curriculum learning with PCA-based spectral downsampling to gradually introduce spectral complexity while preserving essential information, making it architecture-agnostic and compatible with various models.

Result: Achieves 2-7x training speedups with small-to-moderate accuracy deltas across three benchmark datasets, demonstrating strong generalization across spatial scales, spectral characteristics, and application domains.

Conclusion: Training strategy optimization serves as an effective complement to architectural design in HSI models, with potential applications in climate-related remote sensing like cloud classification.

Abstract: Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral – spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets – Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 – demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Yoichi Sato

Main category: cs.CV

TL;DR: MLLMs lack social deception detection abilities. New MIDA task and dataset reveal models struggle with truth/falsehood distinction. Proposed SoCoT and DSEM framework shows improvement.

Details

Motivation: State-of-the-art MLLMs lack human-like ability to 'read the room' and assess deception in social interactions, highlighting a critical gap in AI social reasoning capabilities.

Method: Introduced Multimodal Interactive Deception Assessment (MIDA) task with synchronized video-text dataset. Evaluated 12 MLLMs and proposed Social Chain-of-Thought (SoCoT) reasoning pipeline with Dynamic Social Epistemic Memory (DSEM) module.

Result: Significant performance gap found - even powerful models like GPT-4o struggle with truth/falsehood distinction. Models fail to ground language in social cues and model others’ knowledge/beliefs/intentions.

Conclusion: Proposed SoCoT and DSEM framework shows promising improvement, offering a path toward building MLLMs with genuine human-like social reasoning capabilities for more perceptive and trustworthy AI systems.

Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room’ and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.

[88] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

Renxiang Xiao, Wei Liu, Yuanfan Zhang, Yushuai Chen, Jinming Chen, Zilu Wang, Liang Hu

Main category: cs.CV

TL;DR: Rad-GS is a 4D radar-camera SLAM system using 3D Gaussian representation for large-scale outdoor environments, combining radar point clouds with Doppler information to improve localization and reduce rendering artifacts.

Details

Motivation: To develop a robust outdoor mapping system using 4D mmWave radar that can handle kilometer-scale environments while addressing challenges like dynamic objects and memory consumption in large scenes.

Method: Combines radar point clouds with Doppler information for dynamic object masking in images, uses unsynchronized images for global refinement of 3D Gaussian representation, and employs global octree structure with Gaussian primitive management to reduce noise and memory usage.

Result: Achieves performance comparable to traditional 3D Gaussian methods using camera or LiDAR inputs, successfully reconstructs kilometer-scale real-world scenes with improved texture consistency and novel view synthesis.

Conclusion: Rad-GS demonstrates the feasibility of robust outdoor mapping using 4D mmWave radar, offering a viable alternative to camera/LiDAR-based methods for large-scale scene reconstruction.

Abstract: We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.

[89] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu

Main category: cs.CV

TL;DR: T2T-VICL enables cross-task visual in-context learning for vision-language models by generating text prompts that describe differences between distinct vision tasks and using perceptual reasoning.

Details

Motivation: To investigate whether vision-language models can perform visual in-context learning when visual prompts and target images come from different vision tasks, unlocking cross-task capabilities.

Method: Proposes T2T-VICL pipeline with text prompt generation/selection mechanism to describe task differences, constructs cross-task VICL dataset, and uses perceptual score-based reasoning with traditional metrics.

Result: Achieves top-tier results in 9 cross-task scenarios and second-tier performance in 10 additional scenarios, demonstrating successful cross-task VICL.

Conclusion: Successfully unlocks cross-task visual in-context learning boundaries in vision-language models through the proposed collaborative pipeline and inference framework.

Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

[90] Clustered Error Correction with Grouped 4D Gaussian Splatting

Taeho Kang, Jaeyeon Park, Kyungjin Lee, Youngki Lee

Main category: cs.CV

TL;DR: A novel 4D Gaussian Splatting method that improves dynamic scene reconstruction through elliptical error clustering and grouped splatting, achieving better temporal consistency and state-of-the-art rendering quality.

Details

Motivation: Existing 4DGS methods struggle with ambiguous pixel correspondences and inadequate densification in dynamic regions, leading to inaccurate reconstruction of dynamic scenes.

Method: Two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition to pinpoint dynamic areas and initialize fitting splats, (2) Grouped 4D Gaussian Splatting to improve consistency between splats and dynamic objects. Uses cross-view color consistency to classify rendering errors and apply targeted corrections.

Result: Significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, with 0.39dB PSNR improvement on Technicolor Light Field dataset. Shows improved alignment between splats and dynamic objects.

Conclusion: The proposed method effectively addresses limitations in dynamic scene reconstruction through targeted error correction and grouped splatting, demonstrating superior performance over existing approaches.

Abstract: Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method’s capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.

[91] Decoupling Complexity from Scale in Latent Diffusion Model

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Boyuan Jiang, Xin Tao, Pengfei Wan

Main category: cs.CV

TL;DR: DCS-LDM decouples content complexity from scale in latent diffusion models, enabling flexible resolution/frame-rate generation with fixed latent tokens and progressive coarse-to-fine synthesis.

Details

Motivation: Existing latent diffusion models couple scale with content complexity, using more tokens for higher resolutions/frame rates, but latent capacity should primarily depend on content complexity rather than scale.

Method: Constructs hierarchical scale-independent latent space with multi-level tokens, models sample complexity through these levels, supports decoding to arbitrary resolutions/frame rates with fixed latent representation.

Result: Achieves performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities with computation-quality tradeoff.

Conclusion: DCS-LDM successfully decouples information complexity from scale, enabling progressive coarse-to-fine generation and flexible multi-scale visual synthesis.

Abstract: Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.

[92] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

Chenyang Wu, Jiayi Fu, Chun-Le Guo, Shuhao Han, Chongyi Li

Main category: cs.CV

TL;DR: VTinker proposes a novel video frame interpolation pipeline with guided flow upsampling and texture mapping to address blurring and ghosting issues in high-resolution motion estimation.

Details

Motivation: Existing flow-based VFI methods suffer from blur/mosaic artifacts at flow edges and misalignment of fine pixel motion due to low-resolution motion estimation, leading to ghosting and discontinuities in interpolated frames.

Method: VTinker uses guided flow upsampling (GFU) with input frames as guidance to refine bilinear upsampling flows, followed by texture mapping that creates an intermediate proxy frame for selecting clear texture blocks from input frames, which are then mapped and reconstructed.

Result: Extensive experiments show VTinker achieves state-of-the-art performance in video frame interpolation, with codes publicly available.

Conclusion: The proposed VTinker pipeline effectively addresses flow edge blurring and pixel-level ghosting issues through guided flow upsampling and texture mapping, resulting in superior interpolation quality.

Abstract: Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows’ edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows’ edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.

[93] How Noise Benefits AI-generated Image Detection

Jiazhen Yan, Ziqiang Li, Fan Wang, Kai Zeng, Zhangjie Fu

Main category: cs.CV

TL;DR: PiN-CLIP improves AI-generated image detection by injecting positive-incentive noise during training to suppress spurious shortcuts and amplify stable forensic cues, achieving state-of-the-art performance.

Details

Motivation: Current AI-generated image detection methods struggle with out-of-distribution generalization due to reliance on spurious shortcuts during training.

Method: Jointly trains a noise generator and detection network using variational positive-incentive principle, injecting cross-attention fused noise into feature space to fine-tune visual encoder.

Result: Achieves new SOTA with 5.4% average accuracy improvement on dataset containing images from 42 generative models.

Conclusion: Feature-space noise injection effectively suppresses shortcut dominance and enhances generalization in AI-generated image detection.

Abstract: The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

Li Yu, Yingbo Zhao, Shiyu Wu, Siyue Yu, Moncef Gabbouj, Qingshan Liu

Main category: cs.CV

TL;DR: Proposes a blind quality enhancement method for compressed videos using pretrained degradation representation learning and hierarchical termination mechanism to handle unknown QPs and varying computational demands.

Details

Motivation: Existing QECV methods require known QPs and use uniform architectures, limiting real-world applicability where QPs may be unknown and ignoring varying computational needs across compression levels.

Method: Uses pretrained Degradation Representation Learning (DRL) to extract multiscale degradation representations, and hierarchical termination mechanism to dynamically adjust artifact reduction stages based on compression level.

Result: Achieves 110% PSNR improvement (0.31 dB to 0.65 dB) over state-of-the-art blind method at QP=22, and reduces average inference time by half at QP=22 compared to QP=42.

Conclusion: The proposed blind QECV approach effectively handles unknown QPs through degradation representation learning and improves efficiency via adaptive computational allocation.

Abstract: Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.

[95] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin

Main category: cs.CV

TL;DR: TimeViper is a hybrid Mamba-Transformer model for long video understanding that processes hour-long videos with 10,000+ frames using a novel token compression method called TransV.

Details

Motivation: Long video understanding requires efficient architectures and effective temporal context handling. The paper addresses vision token redundancy in multimodal models and aims to develop interpretable hybrid architectures.

Method: Uses hybrid Mamba-Transformer backbone combining state-space model efficiency with attention expressivity. Introduces TransV module to transfer and compress vision tokens into instruction tokens while maintaining multimodal understanding.

Result: TimeViper processes hour-long videos exceeding 10,000 frames and competes with state-of-the-art models across multiple benchmarks while extending frame numbers. Provides insights into hybrid model interpretability.

Conclusion: This work represents an initial step toward developing, interpreting, and compressing hybrid Mamba-Transformer architectures for long video understanding.

Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

[96] Real-Time 3D Object Detection with Inference-Aligned Learning

Chenyu Zhao, Xianwei Zheng, Zimin Xia, Linwei Yue, Nan Xue

Main category: cs.CV

TL;DR: SR3D is a real-time 3D object detection framework for indoor point clouds that bridges the training-inference gap through spatial-prioritized optimal transport assignment and rank-aware adaptive self-distillation.

Details

Motivation: There exists a gap between how 3D object detectors are trained and evaluated, stemming from lack of spatial reliability and ranking awareness during training, which conflicts with ranking-based prediction selection used during inference.

Method: Two main components: 1) Spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, 2) Rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via self-distillation.

Result: Extensive experiments on ScanNet V2 and SUN RGB-D show SR3D significantly outperforms prior methods in accuracy while maintaining real-time speed.

Conclusion: SR3D effectively bridges the training-inference gap in 3D object detection and achieves superior performance for real-time indoor point cloud detection.

Abstract: Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model’s ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

[97] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, Linlin Shen

Main category: cs.CV

TL;DR: SurvAgent is a hierarchical chain-of-thought enhanced multi-agent system for multimodal survival prediction in cancer prognosis, addressing limitations of existing methods through explainable AI and experiential learning.

Details

Motivation: Existing survival analysis methods lack transparency for clinical adoption, and current pathology agents cannot integrate multimodal data, effectively explore regions of interest, or leverage historical case learning for survival prediction.

Method: Two-stage approach: (1) WSI-Gene CoT-Enhanced Case Bank Construction with hierarchical analysis of pathology images and gene-stratified analysis, generating structured reports with chain-of-thought reasoning; (2) Dichotomy-Based Multi-Expert Agent Inference using RAG for similar case retrieval and progressive interval refinement for multimodal integration.

Result: Extensive experiments on five TCGA cohorts demonstrate SurvAgent’s superiority over conventional methods, proprietary MLLMs, and medical agents.

Conclusion: SurvAgent establishes a new paradigm for explainable AI-driven survival prediction in precision oncology, providing transparent and clinically adoptable solutions.

Abstract: Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent’s superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

[98] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

Quanqing Ma, Jiaen Chen, Peng Wang, Yao Zheng, Qingzhan Zhao, Yuchen Zheng

Main category: cs.CV

TL;DR: Proposes HSRW-CD dataset for high-resolution water body change detection and SSCP attention module to better leverage spatial semantics and structural information in deep features.

Details

Motivation: Addresses scarcity of high spatial resolution datasets for water body change detection and limitations of previous deep learning methods in exploiting spatial semantic and structural information.

Method: Creates HSRW-CD dataset with >3m spatial resolution, and designs SSCP attention module with three components: Multi-Semantic spatial Attention (MSA), Structural Relation-aware Global Attention (SRGA), and Channel-wise Self-Attention (CSA).

Result: Experiments on HSRW-CD and Water-CD datasets validate effectiveness and generalization of SSCP module in improving water body discrimination capability.

Conclusion: The proposed SSCP module significantly enhances water body change detection performance and can be integrated as plug-and-play component into existing models.

Abstract: Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.

[99] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng

Main category: cs.CV

TL;DR: Introduces TwiG, the first interleaved framework that enables co-evolving textual reasoning throughout visual generation, allowing reasoning to guide upcoming regions and reflect on synthesized ones.

Details

Motivation: Current visual generation methods incorporate textual reasoning only before or after generation, lacking on-the-fly multimodal interaction during generation itself.

Method: Three strategies: zero-shot prompting, supervised fine-tuning on TwiG-50K dataset, and reinforcement learning via customized TwiG-GRPO strategy.

Result: Dynamic interplay produces more context-aware and semantically rich visual outputs.

Conclusion: This work inspires further research into interleaving textual reasoning for enhanced visual generation.

Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

[100] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

Sibaek Lee, Seongbo Ha, Kyeongsu Kang, Joonyeol Choi, Seungjun Tak, Hyeonwoo Yu

Main category: cs.CV

TL;DR: LEGO-SLAM integrates language features into 3D Gaussian Splatting SLAM using a scene-adaptive encoder-decoder to compress high-dimensional embeddings into compact 16D features, enabling real-time open-vocabulary mapping with 60% Gaussian reduction.

Details

Motivation: Current 3DGS SLAM systems create photorealistic maps but lack semantic understanding for robotic interaction. Storing high-dimensional language features requires excessive memory, and static models can't adapt to novel environments.

Method: Uses a scene-adaptive encoder-decoder to distill language embeddings into 16D features, language-guided pruning to reduce Gaussians by 60%, and language-based loop detection using mapping features without separate models.

Result: Achieves competitive mapping quality and tracking accuracy with open-vocabulary capabilities at 15 FPS, significantly reducing memory usage while maintaining rendering quality.

Conclusion: LEGO-SLAM successfully integrates language understanding into 3DGS SLAM in real-time, providing semantic capabilities while optimizing memory and computational efficiency.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map’s Gaussian count by over 60% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

[101] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu, Rui Zhao, Limin Wang

Main category: cs.CV

TL;DR: RGE enhances multimodal embeddings by incorporating explicit reasoning from MLLMs, improving retrieval performance by 4.9% over non-reasoning baselines.

Details

Motivation: Existing approaches treat embedding extraction as direct encoding, overlooking MLLMs' generative reasoning capability that could enhance representation quality.

Method: Proposes Reasoning Guided Embeddings (RGE) that preserves generative rationale process and couples it with contrastive training, enabling structured rationale generation before representation extraction.

Result: Experiments on MMEB benchmark show 4.9% improvement in multimodal retrieval performance over non-reasoning baseline.

Conclusion: Explicit reasoning can effectively enhance embedding quality in multimodal representation learning.

Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.

[102] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: PPCL is a flexible structured pruning framework for Diffusion Transformers that achieves 50% parameter reduction with minimal performance degradation through plug-and-play distillation.

Details

Motivation: Diffusion Transformers have high computational costs due to large parameter counts, limiting deployment in resource-constrained settings.

Method: Uses linear probing with differential trend analysis to identify redundant layers, then implements plug-and-play teacher-student alternating distillation for depth-wise and width-wise pruning in a single training phase.

Result: Achieves 50% parameter reduction compared to full model with less than 3% degradation in key metrics, maintaining high-quality image generation capabilities.

Conclusion: PPCL enables efficient compression of DiT models for resource-constrained environments while preserving performance.

Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

[103] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao

Main category: cs.CV

TL;DR: Video2Layout is a framework that reconstructs metric-grounded spatial layouts from videos using continuous object boundary coordinates instead of discretized grid maps, enabling fine-grained spatial reasoning in MLLMs.

Details

Motivation: Current grid-based map methods for spatial understanding in MLLMs rely on discretized raster representations, which limit fine-grained spatial reasoning capabilities and create ambiguity in describing spatial relationships.

Method: The framework uses two stages: supervised fine-tuning with a dataset from AI2THOR simulator to learn mapping from visual inputs to boundary coordinates, followed by reinforcement fine-tuning for real-world generalization. It employs continuous object boundary coordinates to quantify physical distances and object sizes.

Result: The V2LO-7B model achieves an average 4.92% improvement over grid map-trained models on QVS-Bench and mainstream spatial reasoning benchmarks, demonstrating superior spatial reasoning capabilities.

Conclusion: Using continuous boundary coordinates instead of discretized grid maps significantly enhances MLLMs’ spatial reasoning accuracy and quantitative computation capabilities, effectively reducing ambiguity in spatial relationship descriptions.

Abstract: Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model’s ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model’s real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

[104] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

Lirui Zhang, Zhengkai Zhao, Zhi Zuo, Pan Gao, Jie Qin

Main category: cs.CV

TL;DR: Simba reformulates point cloud completion by replacing direct regression with distribution learning using diffusion models, achieving SOTA performance while preserving details and global structure.

Details

Motivation: Address limitations of regression-based methods that overfit to instance-specific transformations and are sensitive to input noise, leading to poor generalization.

Method: Integrates symmetry priors with diffusion models for distribution learning, and uses hierarchical Mamba-based architecture for high-fidelity upsampling.

Result: Achieves state-of-the-art performance across PCN, ShapeNet, and KITTI benchmarks, demonstrating superior robustness and generalization.

Conclusion: Reformulating point cloud completion as distribution learning with diffusion models effectively overcomes limitations of regression-based approaches and enables robust, high-quality completion.

Abstract: Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method’s state-of-the-art (SOTA) performance.

[105] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

Yuting Lu, Ziliang Wang, Weixin Xu, Wei Zhang, Yongqiang Zhao, Yang Yu, Xiaohong Zhang

Main category: cs.CV

TL;DR: LNG-SWR is a plug-in framework that improves segmentation robustness by injecting layer-wise noise during training to learn frequency-bias priors, then applying selective wavelet reconstruction to suppress noise-sensitive bands while enhancing structural cues.

Details

Motivation: Address the limitations of adversarial training (AT) which creates clean-robustness trade-offs and high training costs, limiting scalability in medical imaging applications.

Method: Inject small zero-mean noise at multiple layers to learn frequency-bias priors, then apply prior-guided selective wavelet reconstruction on input/feature branches to achieve frequency adaptation by suppressing noise-sensitive bands and enhancing directional structures.

Result: On CT and ultrasound datasets, LNG-SWR consistently improves clean Dice/IoU and significantly reduces performance drop under strong attacks (PGD-L∞/L₂ and SSAH). When combined with AT, it yields additive gains without sacrificing clean accuracy.

Conclusion: LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation that works with or without adversarial training, offering better scalability and maintainability.

Abstract: Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean–robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.

[106] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou

Main category: cs.CV

TL;DR: This paper proposes VTIA, a two-stage attack method that injects imperceptible adversarial perturbations into images to induce Vision-Language Models to generate excessively verbose outputs, significantly increasing token consumption and deployment costs.

Details

Motivation: Address the limitations of existing methods that fail to directly maximize output token length as an explicit optimization objective, lacking stability and controllability in inducing verbose outputs from VLMs.

Method: A two-stage framework: 1) Adversarial prompt search using reinforcement learning to identify prompts that induce verbose outputs from LLM components, 2) Vision-aligned perturbation optimization to craft adversarial images that maximize similarity with adversarial prompt embeddings.

Result: Comprehensive experiments on four popular VLMs show the method achieves significant advantages in effectiveness, efficiency, and generalization capability for inducing verbose text generation.

Conclusion: The proposed VTIA method successfully demonstrates how imperceptible adversarial perturbations can be systematically crafted to maximize output token length in VLMs, highlighting security vulnerabilities in multimodal AI systems.

Abstract: With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image’s visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

[107] EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: EvoVLA is a self-supervised Vision-Language-Action framework that addresses stage hallucination in long-horizon robotic manipulation through stage-aligned rewards, pose-based object exploration, and long-horizon memory mechanisms.

Details

Motivation: Current VLA models suffer from stage hallucination where agents exploit coarse evaluation signals to shortcut multi-step tasks without truly completing them, limiting their effectiveness in long-horizon robotic manipulation.

Method: Three complementary components: Stage-Aligned Reward (SAR) using triplet contrastive learning with hard negatives, Pose-Based Object Exploration (POE) grounding curiosity in relative object-gripper pose, and Long-Horizon Memory with selective context retention and gated fusion.

Result: Improves average task success by 10.2 percentage points over OpenVLA-OFT (69.2%), achieves 1.5x better sample efficiency, reduces stage hallucination from 38.5% to 14.8%, and real-world deployment reaches 54.6% success rate (11 points better than OpenVLA-OFT).

Conclusion: EvoVLA effectively addresses stage hallucination in VLA models, demonstrates strong sim-to-real transfer and generalization capabilities for long-horizon robotic manipulation tasks.

Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

[108] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li, Yang Lu, Yachao Zhang, Yong Xie, Fangyong Wang, Yuan Xie, Yanyun Qu

Main category: cs.CV

TL;DR: RF-CLIP is a training-free method that addresses CLIP’s attention distraction to irrelevant tokens by filtering dimension-specific over-activation, improving pixel-level multimodal alignment for open-vocabulary semantic segmentation.

Details

Motivation: Existing methods don't investigate CLIP's performance boundaries for dense prediction from interpretability mechanisms perspective, and CLIP diverts significant attention from target regions to irrelevant tokens due to dimension-specific over-activation.

Method: Propose ReFocusing CLIP (RF-CLIP) that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, filtering dimension-specific over-activation tokens without training.

Result: Achieves state-of-the-art performance on eight benchmarks while maintaining high inference efficiency.

Conclusion: RF-CLIP effectively enhances CLIP’s multimodal alignment granularity by addressing attention distraction, demonstrating superior performance in open-vocabulary semantic segmentation.

Abstract: Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP’s internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP’s dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP’s multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

[109] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

Main category: cs.CV

TL;DR: Mantis introduces a Disentangled Visual Foresight framework that separates visual prediction from the main VLA backbone using meta queries and diffusion Transformer, achieving superior performance on robotics tasks.

Details

Motivation: Existing VLA models face challenges with high-dimensional visual state prediction distributing model capacity and high training costs, while compressed visual states create information bottlenecks and neglect language supervision.

Method: Mantis uses meta queries and a diffusion Transformer head to decouple visual foresight prediction from the backbone, with residual connections for current visual state and next-state prediction to capture latent actions.

Result: Achieves 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing baselines with high convergence speed. Outperforms π₀.₅ in real-world evaluations for instruction-following, generalization, and reasoning.

Conclusion: The disentangled approach reduces VLA backbone burden while maintaining comprehension and reasoning through language supervision, demonstrating effective visual foresight for robotics applications.

Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

[110] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

Nianchang Huang, Yi Xu, Ruida Xi, Ruida Xi, Qiang Zhang

Main category: cs.CV

TL;DR: Proposes DSLGA, a two-stage unsupervised domain adaptation method for visible-infrared person re-identification that handles inter-domain and intra-domain modality discrepancies through domain-shared learning and gradual alignment strategies.

Details

Motivation: Existing VI-ReID algorithms struggle in real-world applications due to discrepancies between public datasets and real-world data, requiring a solution that transfers knowledge without new annotations.

Method: Two-stage DSLGA model: 1) Pre-training with Domain-Shared Learning Strategy (DSLS) to mitigate inter-domain modality discrepancies, 2) Fine-tuning with Gradual Alignment Strategy (GAS) using cluster-to-holistic alignment for intra-domain modality discrepancies.

Result: Significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.

Conclusion: DSLGA effectively addresses UDA-VI-ReID challenges and demonstrates superior performance, with the new CMDA-XD testing method providing a benchmark for future research.

Abstract: Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.

[111] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

Deniz Sayin Mercadier, Hieu Le, Yihong Chen, Jiancheng Yang, Udaranga Wickramasinghe, Pascal Fua

Main category: cs.CV

TL;DR: PrIntMesh is a template-based framework that reconstructs organs as unified systems, preserving topology and internal boundaries while achieving high geometric accuracy.

Details

Motivation: Current deep-learning approaches treat organ substructures independently, leading to anatomically implausible reconstructions that don't respect the interconnected nature of organ components.

Method: Uses a connected template and jointly deforms all substructures to match patient-specific anatomy while preserving internal boundaries and enforcing smooth surfaces.

Result: Achieves high geometric accuracy, correct topology, and robust performance on heart, hippocampus, and lungs, even with limited or noisy training data. Better reconstructs shared interfaces than voxel- and surface-based methods.

Conclusion: PrIntMesh provides a topology-preserving, data-efficient solution that maintains structural consistency and is suitable for clinical use.

Abstract: Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.

[112] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yinxin Zhang, Lingjuan Lyu, Yaochu Jin

Main category: cs.CV

TL;DR: VLA-Fool is a comprehensive study of multimodal adversarial robustness in Vision-Language-Action models, introducing unified attacks across text, vision, and cross-modal misalignment under white-box and black-box settings.

Details

Motivation: The adversarial robustness of Vision-Language-Action models remains largely unexplored, especially under realistic multimodal and black-box conditions, with existing studies overlooking cross-modal misalignment that affects embodied reasoning.

Method: VLA-Fool unifies three levels of multimodal adversarial attacks: textual perturbations (gradient-based and prompt-based), visual perturbations (patch and noise distortions), and cross-modal misalignment attacks. It incorporates a VLA-aware semantic space into linguistic prompts for automated semantic guidance.

Result: Experiments on LIBERO benchmark using fine-tuned OpenVLA model show that even minor multimodal perturbations cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

Conclusion: The study reveals the vulnerability of embodied VLA models to multimodal adversarial attacks and highlights the importance of addressing cross-modal misalignment for robust embodied AI systems.

Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

[113] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

Melih Baydar, Emre Akbas

Main category: cs.CV

TL;DR: ICCE introduces adaptive nearest neighbor selection and cluster ensembling to improve unsupervised image clustering, achieving state-of-the-art performance including 70.4% accuracy on ImageNet.

Details

Motivation: To bridge the performance gap between unsupervised and supervised image classification by improving clustering methods after the shift from representation learning to clustering-focused approaches.

Method: Multi-head clustering on frozen backbone, cluster ensembling to consolidate results, and training classifier with consensus pseudo-labels.

Result: Achieved 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet - first unsupervised method to exceed 70% on ImageNet.

Conclusion: ICCE demonstrates that advanced clustering strategies can significantly narrow the performance gap between unsupervised and supervised image classification.

Abstract: Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, “Image Clustering through Cluster Ensembles” (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming Zhou, Gangshan Wu, Jinde Cao

Main category: cs.CV

TL;DR: SwiTrack is a novel state-switching framework for cross-modal object tracking (RGB-NIR) that uses three specialized streams to handle different modalities and invalid inputs, achieving state-of-the-art performance with real-time tracking.

Details

Motivation: Existing CMOT methods use parallel RGB and NIR branches with shared backbone, which limits extraction of modality-specific features and fails to address object drift from unreliable inputs.

Method: Deploys three specialized streams: visual encoder for RGB, NIR gated adapter with visual encoder for NIR, and consistency trajectory prediction module for invalid modalities. Also uses dynamic template reconstruction and similarity alignment loss.

Result: Achieves state-of-the-art performance with 7.2% precision rate gain and 4.3% success rate gain, while maintaining real-time tracking at 65 FPS.

Conclusion: SwiTrack effectively addresses cross-modal tracking challenges by handling different modalities and unreliable inputs through specialized streams, demonstrating superior performance over existing methods.

Abstract: Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2% and 4.3%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.

[115] Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

Sinan Mutlu, Georgios F. Angelis, Savas Ozkan, Paul Wisbey, Anastasios Drosou, Mete Ozay

Main category: cs.CV

TL;DR: A novel MLP-based method with Memory-Block for full-body tracking from sparse sensor inputs, achieving state-of-the-art performance and real-time speeds on mobile HMDs.

Details

Motivation: Existing AR/VR systems only track head and hands via HMDs and controllers, leaving full-body reconstruction incomplete. There's a need to generate complete full-body motions from limited sensor data.

Method: Uses multi-layer perceptron backbone enhanced with residual connections and a novel Memory-Block component that represents missing sensor data with trainable code-vectors, combined with temporal information from previous frames. Formulated as multi-task learning for robust representations.

Result: Outperforms state-of-the-art baselines with substantial error reduction. Achieves 72 FPS on mobile HMDs, improving accuracy-running time tradeoff.

Conclusion: The proposed method effectively addresses incomplete full-body tracking in AR/VR by generating realistic motions from sparse inputs, achieving both high accuracy and real-time performance.

Abstract: Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

[116] TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid

Seonghun Oh, Youngjung Uh, Jin-Hwa Kim

Main category: cs.CV

TL;DR: TetraSDF is an analytic meshing framework that extracts precise meshes from neural SDFs using ReLU MLPs with tetrahedral positional encoding, overcoming discretization errors and directional bias.

Details

Motivation: Existing methods for extracting meshes from neural SDFs have limitations: sampling-based approaches introduce discretization error, while analytic methods only work with plain ReLU MLPs without encoders.

Method: Uses ReLU MLP composed with multi-resolution tetrahedral positional encoder, where barycentric interpolation preserves continuous piecewise affine structure. Includes fixed analytic input preconditioner to reduce directional bias and stabilize training.

Result: Matches or surpasses grid-based encoders in SDF reconstruction accuracy across benchmarks. Produces highly self-consistent meshes faithful to learned isosurfaces with practical runtime and memory efficiency.

Conclusion: TetraSDF enables precise analytic meshing for encoded SDFs, overcoming limitations of previous methods while maintaining efficiency and accuracy.

Abstract: Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder’s barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder’s metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.

[117] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM

Gergely Dinya, Péter Halász, András Lőrincz, Kristóf Karacs, Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: A fast spatio-temporal scene understanding framework using Vision Gated Generative Transformers (VGGT) for efficient near real-time assistive navigation applications.

Details

Motivation: To enable efficient, close to real-time scene understanding for assistive navigation while overcoming VGGT's high memory demands.

Method: Process image flow with sliding window for continuous 3D scene updates, use VGGT tracking head to aggregate 2D semantic instance masks into 3D objects, and maintain timestamps and instance-level identities for temporal consistency.

Result: Evaluated on well-known benchmarks and custom assistive navigation datasets, demonstrating applicability to real-world scenarios.

Conclusion: The framework successfully enables efficient spatio-temporal scene understanding for assistive navigation applications with real-time performance.

Abstract: We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT’s high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.

[118] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability

Abishek Karthik, Pandiyaraju V, Sreya Mynampati

Main category: cs.CV

TL;DR: Hybrid deep learning framework combining CNNs, ViTs, and GNNs for robust weed detection with 99.33% accuracy using GAN augmentation and contrastive pre-training.

Details

Motivation: Accurate weed species identification enables selective herbicide application, supporting sustainable agriculture and precision farming practices.

Method: Combines CNNs, Vision Transformers, and Graph Neural Networks with GAN-based augmentation and self-supervised contrastive pre-training for feature learning from limited data.

Result: Achieved 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets with high interpretability and adaptability.

Conclusion: Framework enables real-time edge deployment for automated weed detection, reducing herbicide overuse and providing scalable sustainable farming solutions.

Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.

[119] Optimizing 3D Gaussian Splattering for Mobile GPUs

Md Musfiqur Rahman Sanim, Zhihao Shu, Bahram Afsharmanesh, AmirAli Mirian, Jiexiong Guan, Wei Niu, Bin Ren, Gagan Agrawal

Main category: cs.CV

TL;DR: Texture3dgs is an optimized mobile GPU implementation of 3D Gaussian Splatting that achieves up to 4.1× sorting speedup and 1.7× overall speedup while reducing memory usage by 1.6× through novel 2D texture cache optimization.

Details

Motivation: To enable efficient 3D scene reconstruction on mobile devices for benefits like data privacy, offline operation, and faster response times, by optimizing 3D Gaussian Splatting for mobile GPU constraints.

Method: Developed a novel sorting algorithm optimized for 2D texture cache, improved variable layout design, and other optimizations specifically targeting mobile GPU architecture.

Result: Achieved up to 4.1× speedup for sorting, 1.7× overall speedup for 3D scene reconstruction, and reduced memory usage by up to 1.6× compared to baseline implementations.

Conclusion: Texture3dgs effectively optimizes 3DGS for mobile deployment, demonstrating significant performance improvements through careful 2D texture cache optimization and mobile-specific algorithm design.

Abstract: Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively – while also reducing memory usage by up to 1.6$\times$ – demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.

[120] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Minseok Seo, Mark Hamilton, Changick Kim

Main category: cs.CV

TL;DR: Upsample Anything is a test-time optimization framework that upsamples low-resolution features to high-resolution outputs using anisotropic Gaussian kernels, achieving SOTA performance without training.

Details

Motivation: Vision Foundation Models produce downsampled representations (14x/16x) that limit pixel-level applications, while existing upsampling methods require dataset-specific retraining or heavy optimization.

Method: Per-image optimization learns anisotropic Gaussian kernels combining spatial and range cues, bridging Gaussian Splatting and Joint Bilateral Upsampling as a universal edge-aware operator.

Result: Runs in ≈0.419s per 224x224 image, achieves state-of-the-art performance on semantic segmentation, depth estimation, and depth/probability map upsampling across architectures and modalities.

Conclusion: The framework provides efficient, training-free upsampling that transfers seamlessly across different vision tasks and models, enabling precise high-resolution reconstruction.

Abstract: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

[121] Sparse Autoencoders are Topic Models

Leander Girrbach, Zeynep Akata

Main category: cs.CV

TL;DR: SAEs are reframed as topic models, leading to SAE-TM framework that learns reusable topic atoms for coherent thematic analysis across text and image datasets.

Details

Motivation: To clarify the role and practical value of sparse autoencoders (SAEs) by providing a new theoretical foundation that positions them as topic models rather than just embedding analyzers.

Method: Extend Latent Dirichlet Allocation to embedding spaces, derive SAE objective as MAP estimator, and develop SAE-TM framework that trains SAEs to learn topic atoms that can be interpreted as word distributions and merged into topics without retraining.

Result: SAE-TM produces more coherent topics than strong baselines on text and image datasets while maintaining diversity, and enables analysis of thematic structure in images and temporal topic changes in Japanese woodblock prints.

Conclusion: SAEs are positioned as effective tools for large-scale thematic analysis across modalities, with the topic modeling perspective providing both theoretical grounding and practical applications.

Abstract: Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

[122] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Samuel Stevens

Main category: cs.CV

TL;DR: BioBench is a new ecology vision benchmark that addresses ImageNet’s limitations in predicting performance on scientific imagery, providing better evaluation for computer vision models in ecological applications.

Details

Motivation: ImageNet-1K linear-probe accuracy no longer reliably predicts performance on scientific imagery, explaining only 34% of variance on ecology tasks and mis-ranking 30% of models above 75% accuracy.

Method: BioBench unifies 9 publicly released ecology tasks across 4 taxonomic kingdoms and 6 acquisition modalities (3.1M images total), with a Python API for data download, lightweight classifier fitting to frozen backbones, and reporting class-balanced macro-F1 scores.

Result: The benchmark provides new signal for computer vision in ecology and enables reliable evaluation of vision models on diverse ecological data types.

Conclusion: BioBench serves as both a solution to ImageNet’s limitations for scientific imagery and a template recipe for building reliable AI-for-science benchmarks in any domain.

Abstract: ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

[123] NaTex: Seamless Texture Generation as Latent Color Diffusion

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, Chunchao Guo

Main category: cs.CV

TL;DR: NaTex is a native 3D texture generation framework that predicts texture color directly in 3D space, avoiding limitations of 2D multi-view diffusion approaches by treating texture as a dense color point cloud.

Details

Motivation: To overcome limitations of previous MVD-based approaches that struggle with occluded regions, mesh-texture alignment, and cross-view consistency in both content and color intensity.

Method: Proposes latent color diffusion with geometry-aware color point cloud VAE and multi-control diffusion transformer (DiT), trained from scratch using 3D data. Uses native geometry control via positional embeddings and geometry latents for precise alignment.

Result: Significantly outperforms previous methods in texture coherence and alignment, and demonstrates strong generalization for downstream applications like material generation and texture refinement.

Conclusion: NaTex provides an effective native 3D texture generation approach that addresses key limitations of 2D-based methods while enabling precise alignment and strong generalization capabilities.

Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

[124] WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement

Ching-Heng Cheng, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu

Main category: cs.CV

TL;DR: WWE-UIE is an efficient underwater image enhancement network that integrates adaptive white balance, wavelet-based enhancement, and gradient-aware modules to achieve competitive restoration quality with low computational cost.

Details

Motivation: Existing hybrid UIE approaches achieve strong performance but have high computational costs that limit their practicality in real-time scenarios on resource-limited platforms.

Method: Proposes WWE-UIE with three interpretable priors: adaptive white balance for color correction, wavelet-based enhancement block for multi-band decomposition, and gradient-aware module with Sobel operators to preserve edge structures.

Result: Extensive experiments show WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms.

Conclusion: The proposed WWE-UIE provides an efficient solution for underwater image enhancement that maintains high quality while being computationally lightweight, making it suitable for real-time applications.

Abstract: Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.

[125] ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery

Ching-Heng Cheng, Chih-Chung Hsu

Main category: cs.CV

TL;DR: ChangeDINO is a Siamese framework for building change detection that combines a lightweight backbone with frozen DINOv3 features, uses a spatial-spectral transformer decoder with change priors, and includes learnable morphology for boundary refinement.

Details

Motivation: Existing deep learning methods rely too much on change-map annotations and underuse semantic information from non-changing regions, limiting robustness under illumination variation, off-nadir views, and scarce labels.

Method: Multiscale Siamese framework with lightweight backbone stream fused with frozen DINOv3 features, spatial-spectral differential transformer decoder using multi-scale absolute differences as change priors, and learnable morphology module for boundary refinement.

Result: Outperforms recent state-of-the-art methods on four public benchmarks in IoU and F1 metrics, with ablation studies confirming each component’s effectiveness.

Conclusion: ChangeDINO provides an effective end-to-end solution for optical building change detection that better leverages semantic information and handles various challenging conditions.

Abstract: Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.

[126] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks

Yi Ting Tsai, Yu Wei Chen, Hong-Han Shuai, Ching-Chun Huang

Main category: cs.CV

TL;DR: ARASFSR is an arbitrary-resolution and arbitrary-scale face super-resolution method using implicit representation networks that overcomes limitations of fixed scales and input size sensitivity in existing FSR methods.

Details

Motivation: Existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations, which restricts their practical applications in real-world scenarios where input sizes and desired resolutions vary.

Method: Uses implicit representation networks with three key designs: 1) 2D deep features + local coordinates + scale ratios to predict RGB values for arbitrary scales, 2) local frequency estimation module to capture high-frequency textures and reduce spectral bias, 3) global coordinate modulation module to leverage facial structure knowledge for resolution adaptation.

Result: Quantitative and qualitative evaluations demonstrate ARASFSR’s robustness over state-of-the-art methods, effectively super-resolving facial images across various input sizes and up-sampling scales.

Conclusion: ARASFSR provides a flexible and effective solution for arbitrary-scale face super-resolution that addresses the limitations of traditional FSR methods while maintaining high-quality results across diverse input conditions.

Abstract: Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.

[127] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach

Chi-Han Chen, Chieh-Ming Chen, Wen-Huang Cheng, Ching-Chun Huang

Main category: cs.CV

TL;DR: A teacher-student framework with key frame selection and updating enables weakly supervised terrain classification from UAV data using only 30% labeled data while improving both accuracy and temporal consistency.

Details

Motivation: UAV terrain classification faces challenges with data annotation complexity, temporal consistency, data scarcity, and technology range limitations. Traditional methods with full labeling are suboptimal for aerial positioning tasks.

Method: Proposed teacher-student architecture with key frame selection and updating algorithms for weakly supervised learning and temporal consistency knowledge distillation.

Result: Method using only 30% labeled data simultaneously improves mIoU and temporal consistency, enabling stable terrain object localization in aerial tasks.

Conclusion: The framework successfully overcomes traditional temporal consistency training deficiencies in aerial tasks through weakly supervised learning and knowledge distillation.

Abstract: The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection

[128] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering

Joni Vanherck, Steven Moonen, Brent Zoomers, Kobe Werner, Jeroen Put, Lode Jorissen, Nick Michiels

Main category: cs.CV

TL;DR: A real-time camera localization method using pre-captured LiDAR point clouds with neural rendering to bridge synthetic-real domain gaps, achieving drift-free metric-scale tracking.

Details

Motivation: Existing visual localization methods suffer from drift, scale ambiguity, and dependency on fiducials or loop closure, limiting reliability for robotics and XR applications.

Method: Render synthetic views from colored LiDAR point clouds, establish 2D-3D correspondences with live frames, and use neural rendering to reduce domain gaps between synthetic and real images for improved feature matching.

Result: Achieves drift-free camera tracking with correct metric scale in global LiDAR coordinates, outperforms existing SLAM pipelines on ScanNet++ dataset, with two real-time variants: Online Render and Match, and Prebuild and Localize.

Conclusion: The proposed method enables reliable, drift-free camera localization with metric scale accuracy by leveraging pre-captured LiDAR data and neural rendering techniques.

Abstract: Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.

[129] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: MOMNet is an alignment-free depth super-resolution framework that handles misaligned RGB-D data through multi-order matching and aggregation, achieving state-of-the-art performance without requiring spatial alignment.

Details

Motivation: Real-world RGB-D data often suffers from misalignment due to hardware limitations and calibration drift, causing performance degradation in existing guided depth super-resolution methods that assume strict spatial alignment.

Method: Proposes Multi-Order Matching Network (MOMNet) with: 1) Multi-order matching mechanism (zero-order, first-order, second-order) to identify relevant RGB information across feature spaces, and 2) Multi-order aggregation with structure detectors that use multi-order priors as prompts for selective feature transfer.

Result: Extensive experiments show MOMNet achieves state-of-the-art performance and exhibits outstanding robustness in handling misaligned RGB-D data.

Conclusion: MOMNet provides an effective alignment-free solution for depth super-resolution that overcomes the limitations of existing methods in real-world misaligned scenarios.

Abstract: Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

[130] DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Meng-Cheng Shih, Tsai-Ling Huang, Yu-Heng Shih, Hong-Han Shuai, Hsuan-Tung Liu, Yi-Ren Yeh, Ching-Chun Huang

Main category: cs.CV

TL;DR: DetailSemNet is a new model for offline signature verification that focuses on matching local structures between signature images rather than holistic features, achieving state-of-the-art performance through a Detail Semantics Integrator that enhances fine-grained details.

Details

Motivation: Previous methods rely on holistic features for pair comparisons, which may not capture fine-grained differences needed for robust signature verification. Transformer-based backbones can naturally obscure local details, adversely impacting performance.

Method: Proposes DetailSemNet with local structure matching between signature images and introduces a Detail Semantics Integrator using feature disentanglement and re-entanglement to enhance intricate details while expanding discriminative semantics.

Result: Outperforms recent methods on leading benchmarks with clear margins, achieves state-of-the-art results, demonstrates remarkable generalization in cross-dataset testing, and improves model interpretability.

Conclusion: The emphasis on local structure matching improves performance and interpretability, while the combination of generalizability and interpretability significantly bolsters DetailSemNet’s potential for real-world applications.

Abstract: Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model’s interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.

[131] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Pan Yang, Cheng Deng, Jing Yang, Han Zhao, Yun Liu, Yuling Chen, Xiaoli Ruan, Yanping Chen

Main category: cs.CV

TL;DR: CAMS proposes a novel CLIP-based compositional zero-shot learning method that uses gated cross-attention and multi-space disentanglement to better separate attribute and object semantics for improved recognition of unseen compositions.

Details

Motivation: Existing CLIP-based CZSL methods using global semantic representations have limited representational capacity and cannot fully disentangle attributes and objects, hindering generalization to unseen compositions.

Method: Uses Gated Cross-Attention to extract fine-grained semantic features from CLIP’s image encoder blocks through latent units while suppressing irrelevant information, followed by Multi-Space Disentanglement to separate attribute and object semantics.

Result: Achieves state-of-the-art performance on MIT-States, UT-Zappos, and C-GQA benchmarks in both closed-world and open-world settings.

Conclusion: CAMS effectively improves compositional zero-shot learning by extracting semantic features from visual features and performing semantic disentanglement in multidimensional spaces.

Abstract: Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.

[132] End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss

Hai Lan, Zongyan Li, Jianmin Hu, Jialing Yang, Houde Dai

Main category: cs.CV

TL;DR: Introduces Rigid Body Markers (RBMs) for sparse 6-DoF motion capture, enabling real-time SMPL parameter estimation with geodesic loss that matches optimization methods at 10x less computation.

Details

Motivation: Traditional marker-based MoCap requires dense markers, causing setup complexity and identification ambiguity. RBMs provide a simpler, scalable solution with unambiguous 6-DoF data.

Method: Uses RBMs as fundamental MoCap units, trains deep learning regression model on AMASS data with geodesic loss to directly estimate SMPL parameters end-to-end.

Result: Achieves state-of-the-art body pose estimation accuracy comparable to optimization methods but with 10x less computation, validated on real Vicon system data.

Conclusion: Sparse 6-DoF RBMs with geodesic loss provide practical, high-fidelity real-time MoCap for graphics, VR, and biomechanics applications.

Abstract: Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.

[133] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud, Christian Grannemann, Max Mehltretter

Main category: cs.CV

TL;DR: A geometry-guided self-supervised method for consistent surround-view depth estimation using multi-camera rigs with explicit spatial attention on a shared cylinder projection.

Details

Motivation: Existing self-supervised surround-view depth estimation methods suffer from inconsistent depth estimates between overlapping images, limiting their practical utility for 360° perception.

Method: Predicts initial depth maps per image, projects 3D points onto a shared unit cylinder to establish cross-image neighborhood relations, then uses explicit non-learned spatial attention based on cylinder distances to aggregate features and predict final consistent depth maps.

Result: Improves depth consistency across images and overall depth quality on DDAD and nuScenes datasets compared to state-of-the-art methods.

Conclusion: The proposed geometry-guided approach with explicit spatial attention enables dense, metric, and cross-view-consistent depth estimation for multi-camera systems.

Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

[134] Graph Neural Networks for Surgical Scene Segmentation

Yihan Li, Nikhil Churamani, Maria Robu, Imanol Luengo, Danail Stoyanov

Main category: cs.CV

TL;DR: Graph-based segmentation models combining Vision Transformers with Graph Neural Networks improve surgical scene analysis by better handling occlusions, long-range dependencies, and fine-scale geometry of rare anatomical structures.

Details

Motivation: Accurate identification of hepatocystic anatomy is critical for preventing surgical complications during laparoscopic cholecystectomy, but deep learning models struggle with occlusions, long-range dependencies, and capturing fine-scale geometry of rare structures.

Method: Two segmentation models integrating Vision Transformer feature encoders with Graph Neural Networks: (1) static k-NN graph with GCNII for stable long-range information propagation, (2) dynamic Differentiable Graph Generator with GAT for adaptive topology learning.

Result: Achieved 7-8% improvement in mIoU and 6% improvement in mDice scores over state-of-the-art baselines, with anatomically coherent predictions particularly on thin, rare and safety-critical structures.

Conclusion: Graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation, improving interpretability and reliability for safer laparoscopic and robot-assisted surgery.

Abstract: Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.

[135] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu

Main category: cs.CV

TL;DR: LDAG introduces language-driven attribute generalization for few-shot segmentation, using LLM-generated attribute descriptions as unbiased meta guidance instead of relying on support images, achieving SOTA performance.

Details

Motivation: Existing FSS methods struggle with intra-class variations in support images, leading to inaccurate meta guidance for untrained classes. The authors argue that support images may not be essential and propose using language descriptions as unbiased guidance.

Method: Proposes LDAG with two modules: Multi-attribute Enhancement (MaE) uses LLMs to generate detailed attribute descriptions and builds visual-text prior guidance, and Multi-modal Attribute Alignment (MaA) enables cross-modal interaction between attribute texts and visual features.

Result: The method outperforms existing approaches by a clear margin and achieves new state-of-the-art performance in few-shot segmentation.

Conclusion: Language-driven attribute generalization provides more robust and unbiased support strategy for few-shot segmentation compared to traditional support image-based approaches.

Abstract: Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

[136] Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho, Bruno Martins

Main category: cs.CV

TL;DR: Pheye is a novel VLM architecture that efficiently processes high-resolution images with fewer parameters while maintaining strong performance on fine-grained understanding and scene-text tasks.

Details

Motivation: Current VLMs struggle with accurate recognition of fine details in high-resolution images, limiting performance in multiple tasks.

Method: Introduces Pheye architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs.

Result: Achieves high efficiency while maintaining strong performance, particularly in fine-grained image understanding and scene-text handling tasks.

Conclusion: Pheye provides an effective solution for high-resolution image processing in VLMs with improved efficiency and maintained performance on detailed tasks.

Abstract: Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

[137] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Diogo J. Paulo, João Martins, Hugo Proença, João C. Neves

Main category: cs.CV

TL;DR: StreetView-Waste dataset addresses the gap in monitoring overflowing waste containers from garbage truck images, supporting detection, tracking, and segmentation tasks with improved baselines.

Details

Motivation: Urban waste management is critical for smart cities, but existing datasets lack annotations for container tracking or are captured in static environments, limiting real-world utility.

Method: Created StreetView-Waste dataset with three evaluation tasks, benchmarked state-of-the-art models, and proposed two strategies: heuristic-based tracking improvement and geometry-aware segmentation refinement.

Result: Fine-tuned detectors achieved reasonable container detection; proposed heuristics reduced counting error by 79.6%; geometry-aware strategy improved segmentation mAP@0.5 by 27% on lightweight models.

Conclusion: StreetView-Waste provides a challenging benchmark for real-world perception systems in urban waste management, demonstrating value of multimodal inputs and specialized strategies.

Abstract: Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

[138] PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting

Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayawardhana, Matthias Zwicker, Tom Goldstein

Main category: cs.CV

TL;DR: Proposes PUP 3D-GS, a principled pruning method for 3D Gaussian Splatting that achieves high compression ratios (90% pruning) while maintaining visual quality and foreground details through sensitivity-based scoring and multi-round refinement.

Details

Motivation: 3D Gaussian Splatting models require high storage/memory due to millions of Gaussians, limiting deployment on resource-constrained devices. Existing pruning methods cause heavy degradation at high compression ratios.

Method: Uses a principled sensitivity pruning score based on second-order approximation of reconstruction error, plus a multi-round prune-refine pipeline applicable to any pretrained 3D-GS model without changing training.

Result: After pruning 90% of Gaussians, achieves 3.56× faster rendering while retaining more foreground details and higher image quality metrics than existing methods on multiple datasets.

Conclusion: The proposed PUP 3D-GS method enables substantial compression of 3D-GS models while preserving visual fidelity, making them more viable for resource-constrained devices.

Abstract: Recent advances in novel view synthesis have enabled real-time rendering speeds with high reconstruction accuracy. 3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. However, complex scenes can consist of millions of Gaussians, resulting in high storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which Gaussians to remove. At high compression ratios, these pruned scenes suffer from heavy degradation of visual fidelity and loss of foreground details. In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing its training pipeline. After pruning 90% of Gaussians, a substantially higher percentage than previous methods, our PUP 3D-GS pipeline increases average rendering speed by 3.56$\times$ while retaining more salient foreground information and achieving higher image quality metrics than existing techniques on scenes from the Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets.

[139] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

Main category: cs.CV

TL;DR: VLA-Pruner is a token pruning method for Vision-Language-Action models that addresses the limitations of existing VLM pruning methods by considering both semantic understanding and action execution needs through dual-level importance criteria.

Details

Motivation: Existing VLM token pruning methods focus only on semantic salience, ignoring VLA's dual-system nature of high-level semantic understanding and low-level action execution, which leads to degraded performance in action generation tasks.

Method: VLA-Pruner uses dual-level importance criteria: vision-language prefill attention for semantic relevance and action decode attention (estimated via temporal smoothing) for action importance. It adaptively selects tokens for both semantic understanding and action execution under computational constraints.

Result: VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks, demonstrating superior efficiency and effectiveness compared to existing methods.

Conclusion: The proposed dual-level token pruning approach effectively addresses the unique requirements of VLA models, enabling real-time deployment while maintaining both semantic understanding and action execution capabilities.

Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA’s intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

[140] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe

Main category: cs.CV

TL;DR: LLaVA³ improves 3D scene understanding in vision-language models using multi-view 2D images without fine-tuning, inspired by Cubist art to create omnidirectional object representations.

Details

Motivation: Address the challenge of limited 3D training data by leveraging abundant 2D datasets to enhance 3D scene understanding capabilities in vision-language models.

Method: Proposes omnidirectional visual representations of objects derived from multi-view 3D reconstruction, inspired by Cubist painting principles that show multiple viewpoints in a single image.

Result: Outperforms previous 2D-based VLM solutions on 3D VQA and 3D language grounding tasks through extensive experiments.

Conclusion: LLaVA³ successfully enhances 3D scene understanding using only 2D multi-view images without requiring fine-tuning, demonstrating effective transfer of 2D capabilities to 3D tasks.

Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

[141] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry

Clemens Pollak, Kersten Diers, Santiago Estrada, David Kügler, Martin Reuter

Main category: cs.CV

TL;DR: FastSurfer-CC is an automated framework for corpus callosum morphometry that outperforms existing tools and detects significant differences in Huntington’s disease patients.

Details

Motivation: There is a lack of comprehensive and automated tools for corpus callosum analysis despite its importance in aging, neurological disease research, and clinical interventions.

Method: Automated pipeline that identifies mid-sagittal slices, segments corpus callosum and fornix, localizes commissures for head standardization, generates thickness profiles and subdivisions, and extracts shape metrics.

Result: Outperforms existing specialized tools across individual tasks and reveals statistically significant differences between Huntington’s disease patients and healthy controls not detected by current state-of-the-art methods.

Conclusion: FastSurfer-CC provides an efficient, fully automated solution for comprehensive corpus callosum morphometry that enables more sensitive detection of neurological differences.

Abstract: The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington’s disease patients and healthy controls that are not detected by the current state-of-the-art.

[142] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig

Main category: cs.CV

TL;DR: CAIRe is a new evaluation metric that measures cultural relevance in text-to-image models by grounding entities to a knowledge base and providing independent graded judgments for cultural labels, outperforming baselines by 22% F1 points.

Details

Motivation: To address the challenge of measuring cross-cultural biases in text-to-image models, which has stalled progress in ensuring equitable performance across diverse cultural contexts due to trade-offs like performance loss, factual inaccuracies, or offensive outputs.

Method: The framework grounds entities and concepts in images to a knowledge base and uses factual information to provide independent graded judgments for each culture label. It was tested on manually curated datasets of culturally salient but rare items built using language models.

Result: CAIRe surpassed all baselines by 22% F1 points on culturally salient datasets and achieved Pearson’s correlations of 0.56 and 0.66 with human ratings on culturally universal concepts datasets, demonstrating strong alignment with human judgment.

Conclusion: CAIRe provides a reliable method for measuring cultural relevance in text-to-image models, enabling better assessment and mitigation of cross-cultural biases while maintaining strong correlation with human judgment across diverse image sources.

Abstract: As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

[143] Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault

Main category: cs.CV

TL;DR: This paper presents the first systematic study on occluded video prediction, showing that incorporating point-flow (motion) and depth-map (geometry) information improves prediction performance in occluded scenarios and background motion.

Details

Motivation: Occlusion remains a fundamental challenge in video prediction despite general models performing well in standard scenarios. The authors hypothesize that explicit motion and geometric structure information can help models handle occlusion better.

Method: Used a standard multi-object latent transformer architecture for video prediction, modified to incorporate depth and point-flow information. Evaluated on synthetic and real-world datasets using appearance metrics and Wasserstein distances on object masks.

Result: Models assisted with point-flow and depth performed better in occluded scenarios and predicted more accurate background motion compared to models without these modalities.

Conclusion: Explicit motion and geometric structure information (point-flow and depth) significantly improves video prediction performance in challenging occluded scenarios.

Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

[144] HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

Main category: cs.CV

TL;DR: HAWAII is a framework that distills knowledge from multiple visual experts into a single vision encoder using teacher-specific LoRA adapters with routing, enabling efficient knowledge transfer with minimal computational overhead.

Details

Motivation: To improve visual understanding in VLMs while reducing the high computational costs associated with using multiple pretrained visual experts during training and inference.

Method: Uses teacher-specific LoRA adapters with a router to mitigate conflicts between different teachers, and employs fine-grained (token importance scores) and coarse-grained (general-knowledge adapters) distillation approaches.

Result: Extensive experiments show HAWAII’s superiority over popular open-source VLMs across various vision-language tasks.

Conclusion: HAWAII successfully enables a single vision encoder to inherit complementary strengths from multiple visual experts with minimal computational overhead, providing an efficient solution for visual knowledge distillation.

Abstract: Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The code is available at https://github.com/yimuwangcs/wise-hawaii.

[145] Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

Zongcai Tan, Lan Wei, Dandan Zhang

Main category: cs.CV

TL;DR: A physics-informed GAN framework that integrates wave optics and depth alignment to generate high-fidelity microscope images for microrobot pose estimation, achieving near-real-data performance without extensive real data collection.

Details

Motivation: Current microrobot pose estimation methods require large, expensive microscope image datasets that are difficult to acquire due to complex fabrication and labor-intensive labeling. Digital twin systems struggle to replicate complex optical microscopy phenomena like diffraction artifacts.

Method: Proposes a physics-informed deep generative learning framework that integrates wave optics-based physical rendering and depth alignment into a GAN to synthesize high-fidelity microscope images efficiently.

Result: Improves SSIM by 35.6% compared to purely AI-driven methods while maintaining real-time rendering (0.022 s/frame). Pose estimator trained on synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, only 5.0%/5.4% below real-data training. Generalizes to unseen poses without additional training.

Conclusion: The framework enables efficient data augmentation and robust pose estimation for novel microrobot configurations, overcoming limitations of traditional data collection methods while maintaining high accuracy comparable to real-data training.

Abstract: Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.

[146] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI

Rui Wang, Yuexi Du, John Lewin, R. Todd Constable, Nicha C. Dvornek

Main category: cs.CV

TL;DR: Proposed time-modulated tumor segmentation for breast DCE-MRI using FiLM layers to incorporate acquisition time information, improving segmentation performance and generalization across different protocols.

Details

Motivation: Varying acquisition protocols and individual factors cause large appearance variations in DCE-MRI images, making automated tumor segmentation challenging despite the importance of DCE-MRI in breast cancer screening and treatment.

Method: Incorporated acquisition times using feature-wise linear modulation (FiLM) layers to modulate model features according to specific acquisition sequences, allowing use of variable numbers of images per study.

Result: Time-modulated models showed improved tumor segmentation performance and better generalization on both in-domain and out-of-domain datasets compared to baseline models.

Conclusion: Incorporating knowledge of phase acquisition time through FiLM layers effectively improves breast tumor segmentation in DCE-MRI and enhances model generalization across different imaging protocols.

Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.

[147] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

Main category: cs.CV

TL;DR: Sigma is a skeleton-based sign language understanding framework that addresses semantic grounding, local-global balance, and cross-modal learning challenges through sign-aware fusion, hierarchical alignment, and unified pre-training, achieving SOTA results across multiple SLU tasks.

Details

Motivation: Current skeleton-based SLU methods face three key limitations: weak semantic grounding (struggling to relate motion patterns to linguistic meaning), imbalance between local details and global context, and inefficient cross-modal learning for semantically aligned representations.

Method: Proposes Sigma framework with: 1) sign-aware early fusion for deep visual-textual interaction, 2) hierarchical alignment learning to maximize agreements across different feature levels, and 3) unified pre-training combining contrastive learning, text matching and language modeling.

Result: Achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation across multiple benchmarks spanning different sign and spoken languages.

Conclusion: Demonstrates the effectiveness of semantically informative pre-training and proves skeletal data can serve as a stand-alone solution for SLU, overcoming previous limitations in semantic grounding and cross-modal learning.

Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

[148] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

Jaime Álvarez Urueña, David Camacho, Javier Huertas Tato

Main category: cs.CV

TL;DR: A two-stage framework for synthetic image detection using contrastive learning embeddings and k-NN classification achieves 91.3% accuracy with minimal training data.

Details

Motivation: Address the challenge of detecting AI-generated images as new models rapidly emerge, making traditional retraining-based detection impractical.

Method: Two-stage approach: 1) Vision model trained via supervised contrastive learning on subset of generators, 2) k-NN classifier with few-shot learning using limited samples from unseen generators.

Result: 91.3% average detection accuracy with only 150 images per class, 5.2% improvement over existing methods. 14.70% AUC and 4.27% OSCR improvements for source attribution.

Conclusion: The framework provides robust, scalable forensic attribution that adapts to evolving generative AI without exhaustive retraining.

Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

[149] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras

Fan Yang, Sosuke Yamao, Ikuo Kusajima, Atsunori Moteki, Shoichi Masui, Shan Jiang

Main category: cs.CV

TL;DR: Proposes a joint framework for mapping indoor scenes and registering ceiling-mounted cameras using a mobile agent with RGB-D camera, achieving simultaneous scene mapping and camera registration through trajectory correlation and factor graph optimization.

Details

Motivation: Manual registration of ceiling-mounted cameras is inefficient and costly, while automatic visual localization performs poorly with visual ambiguity. Need a reliable method to register CMCs to scene layouts.

Method: Uses mobile agent with head-mounted RGB-D camera to traverse scene while CMCs capture the agent. Correlates trajectories from egocentric and CMC videos, then applies factor graph optimization to jointly optimize ego-camera poses, scene layout, and CMC poses.

Result: Method effectively accomplishes both scene mapping and CMC registration in unified framework, jointly enhancing performance of both tasks. Provides reliable tool for position-aware applications.

Conclusion: Proposed solution provides efficient and accurate joint mapping and registration of ceiling-mounted cameras, overcoming limitations of manual and automatic methods through collaborative trajectory correlation and optimization.

Abstract: Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.

Rahul Kumar, Vipul Baghel, Sudhanshu Singh, Bikash Kumar Badatya, Shivam Yadav, Babji Srinivasan, Ravi Hegde

Main category: cs.CV

TL;DR: A comprehensive video dataset for punch detection and classification in boxing, featuring 6,915 manually annotated punch clips across six types from 20 YouTube sparring sessions.

Details

Motivation: To address the bottleneck in robust datasets for combat sports analysis due to dynamic actions and varied recording environments, enabling better computer vision research in boxing.

Method: Curated 6,915 high-quality punch clips from 20 publicly available YouTube sparring sessions, manually segmented and labeled into six distinct punch types with precise temporal boundaries.

Result: Created a well-annotated dataset capturing diverse motion styles, camera angles, and athlete physiques, specifically designed for real-time vision-based action recognition in unconstrained environments.

Conclusion: This dataset provides a rich benchmark to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related combat sports domains.

Abstract: Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

[151] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening

Misaal Khan, Mayank Vatsa, Kuldeep Singh, Richa Singh

Main category: cs.CV

TL;DR: NutriScreener is a retrieval-augmented graph attention network that uses CLIP embeddings and context awareness to detect malnutrition and predict anthropometric measurements from children’s images, achieving high accuracy and efficiency in clinical settings.

Details

Motivation: Child malnutrition is a global crisis, but existing screening methods are laborious and poorly scalable, hindering early intervention in low-resource settings.

Method: Retrieval-augmented multi-pose graph attention network combining CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness for malnutrition detection and anthropometric prediction from images.

Result: Achieved 0.79 recall, 0.82 AUC, significantly lower anthropometric RMSEs, and cross-dataset improvements of up to 25% recall gain and 3.5 cm RMSE reduction. Doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency.

Conclusion: NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments, demonstrating reliable measurement in unconstrained pediatric settings.

Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children’s images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.

[152] Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d’Avila Garcez

Main category: cs.CV

TL;DR: SemCLIP improves CLIP’s robustness to semantic transformations like negation and paraphrasing through a new contrastive loss function and LLM-generated training triples, achieving better performance on negation benchmarks and downstream tasks.

Details

Motivation: CLIP struggles with negated or paraphrased text due to minimal lexical changes in negation and different expressions in paraphrasing, posing challenges for vision-language model alignment and evaluation.

Method: Proposes SemCLIP with a new contrastive loss function that accounts for both paraphrasing and negation, using LLM-generated training triples (original, paraphrased, negated captions) in CLIP-like training.

Result: SemCLIP moves paraphrased captions closer to original image embeddings while pushing negated captions further away. Improves CC-Neg benchmark accuracy from 68.1% to 78.1% and shows better performance on downstream zero-shot classification tasks.

Conclusion: SemCLIP achieves significant robustness to semantic transformations, preserving CLIP’s performance while substantially improving handling of negated captions and downstream task performance.

Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP’s performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP’s performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

[153] Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration

Fan Yang, Shigeyuki Odashima, Shoichi Masui, Ikuo Kusajima, Sosuke Yamao, Shan Jiang

Main category: cs.CV

TL;DR: A robust multi-camera tracking system for gymnastics that combines triangulation and ray-plane intersection to handle limited camera views and detection failures, successfully deployed at international championships.

Details

Motivation: Track gymnasts in competition settings with limited cameras and challenging conditions (lighting, occlusions, uniforms) that cause detection failures in some views, making conventional triangulation unreliable.

Method: Cascaded data association using triangulation when cross-view detections are sufficient, and ray-plane intersection when detections are insufficient, leveraging domain knowledge that gymnasts perform within predefined vertical planes.

Result: Superior performance over existing methods in challenging scenarios, successfully deployed at Gymnastics World Championships with recognition from the International Gymnastics Federation.

Conclusion: Incorporating gymnastics domain knowledge through ray-plane intersection effectively compensates for detection uncertainties and enables robust tracking in real-world competition environments.

Abstract: We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast’s 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast’s 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.

[154] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation

Haytham Ziani

Main category: cs.CV

TL;DR: Analysis of local (Lucas-Kanade) and global (Horn-Schunck) optical flow methods, with implementation of a multiresolution Horn-Schunck algorithm using bilinear interpolation and prolongation to enhance motion estimation accuracy.

Details

Motivation: To explore and compare the theoretical and practical aspects of local and global optical flow computation methods, and improve motion estimation accuracy under varying image conditions.

Method: Implemented a multiresolution version of the Horn-Schunck algorithm using bilinear interpolation and prolongation, while analyzing both local (Lucas-Kanade) and global (Horn-Schunck) optical flow approaches.

Result: The combined multiresolution strategy with bilinear interpolation and prolongation improved the accuracy and convergence of motion estimation between frames.

Conclusion: Multiresolution Horn-Schunck with interpolation techniques effectively enhances optical flow computation performance, making it suitable for motion estimation under diverse image conditions.

Abstract: This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.

[155] VisPlay: Self-Evolving Vision-Language Models from Images

Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang

Main category: cs.CV

TL;DR: VisPlay is a self-evolving RL framework that enables Vision-Language Models to autonomously improve reasoning using unlabeled image data through dual roles of questioner and reasoner trained with diversity and difficulty rewards.

Details

Motivation: Existing RL approaches for VLMs rely on costly human annotations or task-specific heuristics, which are difficult to scale. There's a need for autonomous improvement methods that can leverage abundant unlabeled data.

Method: Assigns VLM into two roles: Image-Conditioned Questioner that formulates challenging visual questions, and Multimodal Reasoner that generates silver responses. Jointly trained with Group Relative Policy Optimization (GRPO) incorporating diversity and difficulty rewards.

Result: Achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU when trained on Qwen2.5-VL and MiMo-VL models.

Conclusion: VisPlay demonstrates a scalable path toward self-evolving multimodal intelligence by enabling autonomous improvement of VLMs using unlabeled data without human supervision.

Abstract: Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/

[156] Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap

Satyam Gaba

Main category: cs.CV

TL;DR: Using generative AI to create synthetic smoke datasets and applying domain adaptation methods to improve wildfire smoke detection models.

Details

Motivation: Early wildfire detection is crucial for damage mitigation, but limited annotated smoke datasets hinder deep learning model performance.

Method: Generate synthetic smoke datasets using generative AI, apply unsupervised domain adaptation, and enhance realism with style transfer, GANs, and image matting.

Result: Developed methods to bridge domain gap between synthetic and real smoke data, enabling more accurate detection models.

Conclusion: Generative AI and domain adaptation techniques can overcome data scarcity in wildfire smoke detection, leading to improved early warning systems.

Abstract: The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.

Pierrick Bournez, Luca Savant Aira, Thibaud Ehret, Gabriele Facciolo

Main category: cs.CV

TL;DR: EOGS++ extends 3D Gaussian Splatting for satellite imagery, operating directly on raw panchromatic data without preprocessing, integrating bundle adjustment via optical flow for improved camera poses, and achieving state-of-the-art reconstruction quality and efficiency.

Details

Motivation: To enhance the Earth Observation Gaussian Splatting framework by eliminating the need for external preprocessing tools, improving camera pose estimation, and achieving better reconstruction quality and geometric accuracy for satellite imagery.

Method: Extends EOGS framework with direct operation on raw high-resolution panchromatic data, embeds bundle adjustment using optical flow during training, implements early stopping, and applies TSDF post-processing for sharper reconstructions.

Result: EOGS++ achieves state-of-the-art performance on IARPA 2016 and DFC2019 datasets, improving building reconstruction MAE from 1.33 to 1.19 compared to original EOGS, while maintaining computational efficiency advantages of Gaussian Splatting.

Conclusion: EOGS++ successfully enhances satellite imagery reconstruction by integrating bundle adjustment directly into training, eliminating preprocessing dependencies, and improving both reconstruction quality and geometric accuracy over previous methods.

Abstract: Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models

[158] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

Satyam Gaba

Main category: cs.CV

TL;DR: The paper addresses long-tailed object detection using LVISv1 dataset, achieving state-of-the-art 24.5% mAP through enhanced Balanced Group Softmax and metric learning with k-NN classification.

Details

Motivation: Real-world object detection faces class imbalance where rare categories have few instances, causing bias toward frequent classes and degraded rare class performance.

Method: Two-stage Faster R-CNN with enhanced Balanced Group Softmax framework and metric learning for feature embeddings, using k-NN for inference to improve rare class classification.

Result: Achieved new state-of-the-art 24.5% mAP on LVISv1 dataset, surpassing previous 24.0% benchmark.

Conclusion: The proposed methods effectively mitigate class imbalance in long-tailed object detection, particularly improving rare class performance through better feature separation and clustering.

Abstract: Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.

[159] Progressive Supernet Training for Efficient Visual Autoregressive Modeling

Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, Mingbao Lin

Main category: cs.CV

TL;DR: VARiant reduces memory overhead in Visual Auto-Regressive models by using weight-shared subnets with progressive training, achieving near-equivalent quality with 40-80% memory reduction.

Details

Motivation: Progressive multi-scale generation in VAR models incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment despite reduced inference steps.

Method: Propose VARiant with equidistant sampling of subnets (16 to 2 layers) from original 30-layer network. Early scales use full network, later scales use subnets with weight sharing. Progressive training strategy resolves optimization conflicts.

Result: VARiant-d16 and VARiant-d8 achieve near-equivalent quality (FID 2.05/2.12 vs 1.95) with 40-65% memory reduction. VARiant-d2 achieves 3.5x speedup and 80% memory reduction at moderate quality cost (FID 2.97).

Conclusion: VARiant’s single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency for diverse application scenarios.

Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the “next-scale” prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant’s single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

[160] SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

Main category: cs.CV

TL;DR: SAM 3D is a generative model for 3D object reconstruction from single images, achieving state-of-the-art performance on real-world scenes with occlusion and clutter through large-scale human-annotated data and multi-stage training.

Details

Motivation: To address the challenge of 3D object reconstruction from single images in natural scenes with occlusion and clutter, where visual context plays a crucial role but existing methods struggle.

Method: Uses a human- and model-in-the-loop pipeline for large-scale annotation of object shape, texture, and pose, combined with multi-stage training that includes synthetic pretraining and real-world alignment.

Result: Achieves significant improvements over recent work with at least 5:1 win rate in human preference tests on real-world objects and scenes, breaking the 3D data barrier.

Conclusion: SAM 3D successfully enables high-quality 3D reconstruction from single images in challenging real-world scenarios and will release code, models, demo, and a new benchmark for in-the-wild 3D reconstruction.

Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D “data barrier”. We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

[161] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: Lite Any Stereo is an efficient stereo depth estimation framework that achieves strong zero-shot generalization with ultra-light computational cost (<1% of SOTA methods).

Details

Motivation: Traditional efficient stereo models have limited capacity and poor zero-shot ability, while accurate models are computationally expensive. There's a need for models that balance efficiency with strong generalization.

Method: Uses a compact expressive backbone, hybrid cost aggregation module, and three-stage training strategy on million-scale data to bridge sim-to-real gap.

Result: Ranks 1st across four real-world benchmarks, achieves accuracy comparable to or exceeding SOTA non-prior-based methods with <1% computational cost.

Conclusion: Demonstrates that ultra-light models can deliver strong generalization, setting new standard for efficient stereo matching.

Abstract: Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

[162] POMA-3D: The Point Map Way to 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: POMA-3D is the first self-supervised 3D representation model using point maps, which encode 3D coordinates on 2D grids to leverage 2D foundation models while preserving 3D geometry.

Details

Motivation: Addresses the scarcity of pretrained priors and limited data in 3D representation learning by exploring point maps as a bridge between 2D and 3D understanding.

Method: Uses point maps with view-to-scene alignment strategy and POMA-JEPA architecture for geometrically consistent features across multiple views. Trained on ScenePoint dataset from 6.5K room-level RGB-D scenes and 1M 2D image scenes.

Result: POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding, benefiting tasks including 3D question answering, embodied navigation, scene retrieval, and embodied localization using only geometric inputs.

Conclusion: POMA-3D successfully explores point maps as an effective approach for 3D scene understanding, overcoming data scarcity and pretraining limitations in 3D representation learning.

Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

[163] Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman, Nabeel Mohammed

Main category: cs.CV

TL;DR: A teacher-guided pruning framework that integrates Knowledge Distillation with importance score estimation for efficient one-shot unstructured pruning, achieving high sparsity with minimal performance loss.

Details

Motivation: To overcome the computational overhead of iterative train-prune-retrain cycles in unstructured pruning by developing a more efficient one-shot pruning approach.

Method: Leverages gradient signals from teacher network during importance score calculation to identify critical parameters, followed by sparsity-aware retraining without reactivating pruned connections.

Result: Outperforms state-of-the-art methods like EPG and EPSD at high sparsity levels on CIFAR-10, CIFAR-100, and TinyImageNet benchmarks, with minimal performance degradation.

Conclusion: Provides a computation-efficient, performance-preserving pruning solution suitable for resource-constrained environments.

Abstract: Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

[164] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks

Nirjhor Datta, Md. Golam Rabiul Alam

Main category: cs.CV

TL;DR: Erase to Retain is a controllable unlearning framework for medical image segmentation that uses teacher-student distillation with LoRA constraints to selectively remove knowledge while preserving global anatomical understanding.

Details

Motivation: The need for selective knowledge removal from medical segmentation networks for privacy compliance, ethical deployment, and continual dataset revision.

Method: Teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, featuring strong unlearning phase with adversarial optimization and gentle restoration phase with head-only supervised refinement.

Result: For ISIC segmentation: forget-set IoU reduced from 0.875 to 0.509 while maintaining performance on retain/validation (0.647 to 0.677 IoU). For ISIC classification: forget accuracy decreased from 87.0% to 64.1% while retain accuracy improved from 83.9% to 90.6%.

Conclusion: LoRA-based subspace unlearning provides a practical pathway for responsible, controllable, and reversible unlearning in medical image analysis, enabling selective forgetting while preserving essential performance.

Abstract: The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher’s confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.

[165] TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

Zeyuan Yin, Xiaoming Liu

Main category: cs.CV

TL;DR: TRIM accelerates 3D Gaussian diffusion models by reducing denoising trajectories and pruning redundant Gaussian primitives, improving efficiency without quality loss.

Details

Motivation: Current 3D Gaussian diffusion models are slow due to massive Gaussian primitives requiring time-intensive denoising and post-processing, limiting scalability and generation speed.

Method: Proposes TRIM with temporal trimming (lightweight selector model for early trajectory reduction) and spatial trimming (instance mask denoising to prune redundant background Gaussian primitives).

Result: Significantly improves both efficiency and quality of 3D generation while supporting inference-time scaling for Gaussian diffusion models.

Conclusion: TRIM provides an effective post-training approach to accelerate 3D Gaussian diffusion models without compromising output quality.

Abstract: Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.

[166] Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette, Antonio Torralba, Vincent Sitzmann

Main category: cs.CV

TL;DR: Dataset distillation method that optimizes synthetic images to match gradients in linear classifiers when using pre-trained vision models, enabling efficient training of linear probes.

Details

Motivation: Existing dataset distillation methods focus on training from scratch, but modern vision approaches use pre-trained models. Need distillation methods that work with pre-trained feature extractors.

Method: Linear Gradient Matching - optimizes synthetic images to induce similar gradients in linear classifiers as real data when passed through pre-trained feature extractors.

Result: Synthetic data outperforms real-image baselines, generalizes across pre-trained models (e.g., CLIP probe trained with DINO-distilled data), effective for fine-grained classification and model interpretability.

Conclusion: Proposed method enables efficient dataset distillation for pre-trained models, providing valuable tools for model analysis and interpretability across different vision architectures.

Abstract: The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models’ embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

[167] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

Shuyu Cao, Chongshou Li, Jie Xu, Tianrui Li, Na Zhao

Main category: cs.CV

TL;DR: Proposes a novel framework for 3D hierarchical semantic segmentation with a late-decoupled architecture and auxiliary discrimination branch to address multi-hierarchy conflicts and class imbalance issues.

Details

Motivation: Previous 3DHS methods overlook two key challenges: multi-hierarchy conflicts in cross-hierarchy optimization and class imbalance across multiple hierarchies, which dominate model performance.

Method: Uses a primary 3DHS branch with multiple decoders using coarse-to-fine hierarchical guidance, plus an auxiliary discrimination branch with semantic prototype-based bi-branch supervision for mutual learning.

Result: Achieves state-of-the-art 3DHS performance on multiple datasets and backbones, with core components serving as plug-and-play enhancements for previous methods.

Conclusion: The proposed framework effectively mitigates multi-hierarchy conflicts and class imbalance in 3D hierarchical semantic segmentation through late-decoupled architecture and auxiliary supervision.

Abstract: 3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

[168] Solving Spatial Supersensing Without Spatial Supersensing

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu

Main category: cs.CV

TL;DR: The paper critically analyzes Cambrian-S’s video world models and benchmarks, showing that simple baselines can solve VSR without spatial cognition, and that Cambrian-S’s inference methods exploit shortcuts in VSC rather than performing genuine spatial supersensing.

Details

Motivation: To critically evaluate whether Cambrian-S's benchmarks and methods truly measure spatial supersensing capabilities in video world models, or if they can be solved through simpler approaches and shortcut exploitation.

Method: Used two approaches: (1) NoSense baseline using only bag-of-words SigLIP model without temporal structure, (2) VSC-Repeat sanity check by concatenating videos with themselves to test if object-count predictions remain unchanged.

Result: NoSense achieved 95% accuracy on VSR without spatial cognition. VSC-Repeat collapsed Cambrian-S’s accuracy from 42% to 0%, showing it relies on the shortcut that rooms are never revisited rather than true spatial supersensing.

Conclusion: Current VSI-Super benchmarks don’t reliably measure spatial supersensing, and Cambrian-S’s inference methods improve performance by exploiting shortcuts rather than through robust spatial supersensing capabilities.

Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

[169] PartUV: Part-Based UV Unwrapping of 3D Meshes

Zhaoning Wang, Xinyue Wei, Ruoxi Shi, Xiaoshuai Zhang, Hao Su, Minghua Liu

Main category: cs.CV

TL;DR: PartUV is a part-based UV unwrapping pipeline that generates fewer, part-aligned charts with low distortion for challenging AI-generated meshes.

Details

Motivation: Existing UV unwrapping methods struggle with AI-generated meshes that are noisy, bumpy, and poorly conditioned, often producing fragmented charts and suboptimal boundaries that cause artifacts.

Method: Combines semantic part decomposition from PartField with geometric heuristics in a top-down recursive framework, ensuring distortion below user threshold while minimizing chart count. Integrates parameterization/packing algorithms with non-manifold mesh handling and parallelization.

Result: Outperforms existing tools and neural methods in chart count and seam length, achieves comparable distortion, high success rates on challenging meshes, and enables part-specific multi-tiles packing.

Conclusion: PartUV provides an effective solution for UV unwrapping challenging AI-generated meshes by leveraging part-based decomposition and geometric heuristics.

Abstract: UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.

[170] TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

Main category: cs.CV

TL;DR: TriDiff-4D is a diffusion-based pipeline that generates high-quality 4D avatars from text using triplane re-posing and auto-regressive generation, achieving temporal consistency and motion accuracy while reducing generation time from hours to seconds.

Details

Motivation: Address limitations of existing 4D generative methods including temporal/geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics in text-to-4D avatar generation.

Method: Uses diffusion-based triplane re-posing with auto-regressive strategy: first generates canonical 3D avatar and motion sequence from text, then animates avatar using second diffusion model. Learns 3D structure and motion priors from large datasets, enabling skeleton-driven generation.

Result: Significantly outperforms existing methods, reduces generation time from hours to seconds by eliminating optimization, improves complex motion generation with high-fidelity appearance and accurate 3D geometry.

Conclusion: TriDiff-4D provides an efficient, high-quality solution for text-to-4D avatar generation with superior temporal consistency, motion accuracy, and computational efficiency compared to previous approaches.

Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

[171] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Zhenyuan Qin, Xincheng Shuai, Henghui Ding

Main category: cs.CV

TL;DR: SceneDesigner enables precise 9D pose control (location, size, orientation) for multiple objects in image generation, addressing limitations in existing methods through a branched network architecture, CNOCS map representation, and specialized training strategies.

Details

Motivation: Existing methods for controllable image generation lack comprehensive control over 9D poses of multiple objects simultaneously, suffering from limited controllability and degraded quality.

Method: Proposes SceneDesigner with branched network architecture, CNOCS map representation for 9D pose encoding, two-stage training with reinforcement learning, and Disentangled Object Sampling for inference.

Result: Significantly outperforms existing approaches in both controllability and quality, enabling accurate multi-object 9D pose manipulation with improved training efficiency and stability.

Conclusion: SceneDesigner provides an effective solution for comprehensive multi-object 9D pose control in image generation, with strong geometric interpretation and flexible customization capabilities.

Abstract: Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

[172] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You

Main category: cs.CV

TL;DR: V-ReasonBench is a benchmark for evaluating video reasoning across structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics using synthetic and real-world image sequences.

Details

Motivation: There's a growing need for systematic and reliable evaluation of video models' reasoning abilities as generative video models like Veo-3 show surprising zero-shot reasoning capabilities.

Method: Built from synthetic and real-world image sequences, the benchmark provides diverse answer-verifiable tasks that are reproducible, scalable, and unambiguous across four reasoning dimensions.

Result: Evaluations of six state-of-the-art video models reveal clear dimension-wise differences in reasoning abilities, with strong variation across structured, spatial, pattern-based, and physical reasoning.

Conclusion: V-ReasonBench offers a unified and reproducible framework for measuring video reasoning to support development of models with more reliable, human-aligned reasoning skills.

Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

[173] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng, Liang Hou, Xin Tao, Jing Liao

Main category: cs.CV

TL;DR: VANS introduces Video-Next-Event Prediction (VNEP) as a new task that generates video responses instead of text for next-event prediction, using reinforcement learning to align a Vision-Language Model with a Video Diffusion Model.

Details

Motivation: Video's capacity to demonstrate physical-world information that is difficult to convey through language alone, extending video as a new answer modality for Next-Event Prediction to enable more intuitive and customized answers for procedural learning and creative exploration.

Method: VANS leverages reinforcement learning with Joint-GRPO to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM), optimizing the VLM to produce visualization-friendly captions and guiding the VDM to generate videos faithful to captions and input context.

Result: Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization.

Conclusion: VANS successfully addresses the VNEP task by orchestrating VLM and VDM as a unified system, enabling dynamic video responses for next-event prediction that are more intuitive than text-based answers.

Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video’s inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

[174] Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

Main category: cs.CV

TL;DR: DualMindVLM enables visual language models to automatically switch between fast and slow thinking modes based on task difficulty, improving computational efficiency while maintaining performance.

Details

Motivation: Existing VLMs pursue lengthy reasoning chains for all tasks, leading to excessive computational costs, unlike human thinking that adapts to problem complexity.

Method: Two-stage RL approach: 1) Label data as fast/slow thinking based on model output length, 2) Train model using GRPO with thinking mode labels to develop dual-mode thinking.

Result: DualMindVLM significantly outperforms base model and achieves performance on par with state-of-the-art visual reasoning models while maintaining high token efficiency.

Conclusion: The proposed dual-mode thinking approach enables VLMs to efficiently allocate computational resources, achieving strong performance with reduced computational overhead.

Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

[175] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

Main category: cs.CV

TL;DR: EvoLMM is a self-evolving framework that improves large multimodal models’ reasoning capabilities through unsupervised learning using two cooperative agents (Proposer and Solver) that generate and solve image-grounded questions without human annotations.

Details

Motivation: To overcome limitations of existing LMM training pipelines that depend on human-curated data or external reward models, enabling autonomous and scalable improvement of reasoning capabilities.

Method: Uses two cooperative agents from a single backbone model: a Proposer generates diverse image-grounded questions, and a Solver solves them through internal consistency, with continuous self-rewarding feedback.

Result: Achieves consistent gains up to ~3% on multimodal math-reasoning benchmarks (ChartQA, MathVista, MathVision) using only raw training images, with Qwen2.5-VL as base model.

Conclusion: EvoLMM provides a simple yet effective baseline for self-improving LMMs in fully-unsupervised fashion, demonstrating the viability of autonomous reasoning capability enhancement.

Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

[176] NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Jing Wen, Alexander G. Schwing, Shenlong Wang

Main category: cs.CV

TL;DR: NoPo-Avatar reconstructs animatable 3D human avatars from single/sparse images without requiring pose inputs, overcoming limitations of pose-dependent methods that degrade with noisy pose estimates.

Details

Motivation: Existing methods rely on accurate ground-truth camera and human poses for reconstruction, but performance degrades significantly with noisy pose estimates, limiting practical applicability.

Method: Proposes NoPo-Avatar that reconstructs avatars solely from images without any pose input, eliminating dependence on potentially noisy human pose estimates during test-time reconstruction.

Result: Outperforms existing baselines in practical settings without ground-truth poses and achieves comparable results in lab settings with ground-truth poses on THuman2.0, XHuman, and HuGe100K datasets.

Conclusion: NoPo-Avatar provides a more robust and widely applicable solution for 3D human avatar reconstruction by removing pose dependency, making it resilient to noisy pose estimates while maintaining performance.

Abstract: We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate “ground-truth” camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

[177] LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space

Xuekun Zhao, Pu Cao, Xiaoya Yang, Mingjian Zhang, Lu Yang, Qing Song

Main category: cs.CV

TL;DR: The paper proposes a Latent Space Alignment Inversion Paradigm (LSAP) that addresses the challenge of improving both reconstruction fidelity and editability in image inversion by aligning inverted latent codes with the synthetic distribution.

Details

Motivation: Current two-stage image inversion methods improve reconstruction fidelity but leave perception and editability largely unchanged, as these properties depend heavily on the latent codes from the first embedding stage. The key challenge is obtaining latent codes that preserve fidelity while enhancing perception and editability.

Method: Proposes LSAP with two components: (1) Normalized Style Space (S^N space) and Normalized Style Space Cosine Distance (NSCD) to quantify disalignment of inversion methods, (2) a unified alignment framework that can be optimized for both encoder-based and optimization-based embeddings.

Result: Extensive experiments show NSCD effectively captures perceptual and editable characteristics, and the alignment paradigm achieves state-of-the-art performance in both stages of inversion across various domains.

Conclusion: The proposed LSAP provides a comprehensive solution that bridges the gap between reconstruction fidelity and editability in image inversion through latent space alignment, offering both evaluation metrics and optimization methods.

Abstract: As research on image inversion advances, the process is generally divided into two stages. The first step is Image Embedding, involves using an encoder or optimization procedure to embed an image and obtain its corresponding latent code. The second stage, referred to as Result Refinement, further improves the inversion and editing outcomes. Although this refinement stage substantially enhances reconstruction fidelity, perception and editability remain largely unchanged and are highly dependent on the latent codes derived from the first stage. Therefore, a key challenge lies in obtaining latent codes that preserve reconstruction fidelity while simultaneously improving perception and editability. In this work, we first reveal that these two properties are closely related to the degree of alignment (or disalignment) between the inverted latent codes and the synthetic distribution. Based on this insight, we propose the \textbf{ Latent Space Alignment Inversion Paradigm (LSAP)}, which integrates both an evaluation metric and a unified inversion solution. Specifically, we introduce the \textbf{Normalized Style Space ($\mathcal{S^N}$ space)} and \textbf{Normalized Style Space Cosine Distance (NSCD)} to quantify the disalignment of inversion methods. Moreover, our paradigm can be optimized for both encoder-based and optimization-based embeddings, providing a consistent alignment framework. Extensive experiments across various domains demonstrate that NSCD effectively captures perceptual and editable characteristics, and that our alignment paradigm achieves state-of-the-art performance in both stages of inversion.

[178] DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

Haokun Zhou, Yipeng Hong

Main category: cs.CV

TL;DR: LVLMs can somewhat distinguish AI vs human images but perform worse than humans and show rightward bias. A new automated benchmark construction method was developed and validated.

Details

Motivation: To evaluate LVLMs' ability to differentiate AI-generated from human-generated images and develop scalable automated benchmark construction methods.

Method: Compared LVLMs with humans using mixed AI/human image dataset. Developed automated benchmark construction process involving topic retrieval, narrative script generation, error embedding, and image generation.

Result: LVLMs could distinguish image types to some extent but exhibited rightward bias and performed significantly worse than humans. Automated benchmark construction method was successfully validated.

Conclusion: Study reveals LVLMs’ limitations in real-world understanding and advances benchmark construction techniques with scalable automated approaches for AI evaluation.

Abstract: This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.

[179] Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models

Kyungsung Lee, Donggyu Lee, Myungjoo Kang

Main category: cs.CV

TL;DR: SaFaRI is a spatial-and-frequency-aware diffusion model for image restoration that preserves data-fidelity in both spatial and frequency domains, achieving state-of-the-art performance on noisy inverse problems.

Details

Motivation: Existing diffusion-based image restoration methods only consider pixel-wise data-fidelity, but incorporating frequency domain information could enhance reconstruction quality.

Method: Proposed SaFaRI diffusion model that encourages data-fidelity in both spatial and frequency domains for image restoration with Gaussian noise.

Result: Achieves state-of-the-art performance on ImageNet and FFHQ datasets, outperforming existing zero-shot IR methods in LPIPS and FID metrics across inpainting, denoising, and super-resolution tasks.

Conclusion: Incorporating both spatial and frequency domain data-fidelity significantly improves image restoration quality in diffusion models.

Abstract: Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.

[180] Zero-Shot Video Translation via Token Warping

Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

Main category: cs.CV

TL;DR: TokenWarping is a novel framework for temporally coherent video translation that uses optical flow to warp query, key, and value patches in self-attention, improving both feature aggregation and temporal consistency without requiring additional training.

Details

Motivation: Current video models lag behind image models in visual quality and user control, and existing diffusion-based video editing methods sacrifice preservation of local/structural regions while overlooking the importance of query patches for temporal coherence.

Method: Extract optical flows from source videos and use them to warp previous frame’s query, key, and value patches during denoising process, aligning them with current frame’s patches to enhance feature aggregation and ensure temporal consistency.

Result: TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively across various video translation tasks, demonstrating superior temporal coherence and visual quality.

Conclusion: The framework provides temporally coherent video translation without additional training, can be integrated with existing text-to-image editing methods, and effectively addresses limitations of current approaches by leveraging complementary token priors through optical flow-based warping.

Abstract: With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame’s query, key, and value patches, aligning them with the current frame’s patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations are available in supplementary materials.

[181] Adaptive Query Prompting for Multi-Domain Landmark Detection

Yuhui Li, Qiusen Wei, Guoheng Huang, Xiaochen Yuan, Xuhang Chen, Guo Zhong, Jianwen Huang, Jiajie Huang

Main category: cs.CV

TL;DR: A universal transformer-based model for multi-domain medical landmark detection using Adaptive Query Prompting (AQP) and lightweight decoder, achieving SOTA performance across X-ray datasets.

Details

Motivation: Current deep learning methods for medical landmark detection are task-specific and lack generalizability across different anatomical regions and imaging modalities.

Method: Proposes Adaptive Query Prompting (AQP) with learnable prompts in a memory pool, keeping backbone frozen while optimizing prompts. Uses Light-MLD decoder for landmark extraction from features.

Result: Achieves state-of-the-art performance on three X-ray datasets for medical landmark detection tasks, outperforming complex frameworks with simpler design.

Conclusion: The proposed universal model with AQP enables efficient multi-domain landmark detection with minimal parameter tuning, showing strong potential for broader medical imaging applications.

Abstract: Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Adaptive Query Prompting (AQP). Instead of embedding additional modules in the backbone network, we design a separate module to generate prompts that can be effectively extended to any other transformer network. In our proposed AQP, prompts are learnable parameters maintained in a memory space called prompt pool. The central idea is to keep the backbone frozen and then optimize prompts to instruct the model inference process. Furthermore, we employ a lightweight decoder to decode landmarks from the extracted features, namely Light-MLD. Thanks to the lightweight nature of the decoder and AQP, we can handle multiple datasets by sharing the backbone encoder and then only perform partial parameter tuning without incurring much additional cost. It has the potential to be extended to more landmark detection tasks. We conduct experiments on three widely used X-ray datasets for different medical landmark detection tasks. Our proposed Light-MLD coupled with AQP achieves SOTA performance on many metrics even without the use of elaborate structural designs or complex frameworks.

[182] IOR: Inversed Objects Replay for Incremental Object Detection

Zijia An, Boyu Diao, Libo Huang, Ruiqi Liu, Zhulin An, Yongjun Xu

Main category: cs.CV

TL;DR: Proposes Inversed Objects Replay (IOR) to address redundancy in incremental object detection by generating old-class samples through detector inversion instead of separate generative models, using augmented replay and high-value knowledge distillation.

Details

Motivation: Existing incremental object detection methods degrade when unlabeled old-class objects are absent from incremental data, and generation-based approaches suffer from redundancy in model training/storage and overproduction of samples.

Method: IOR generates old-class samples by inversing original detectors, uses augmented replay to reuse objects from generated samples, and applies high-value knowledge distillation focusing on old-class object positions overwhelmed by background.

Result: Extensive experiments on MS COCO 2017 show IOR efficiently improves detection performance in incremental object detection scenarios where old-class objects are absent.

Conclusion: IOR eliminates redundancy in generation-based incremental object detection by using detector inversion instead of separate generative models, achieving better performance without additional training/storage costs.

Abstract: Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods tends to degrade. The absence can be mitigated by generating old-class samples, but it incurs high costs. This paper argues that previous generation-based IOD suffers from redundancy, both in the use of generative models, which require additional training and storage, and in the overproduction of generated samples, many of which do not contribute significantly to performance improvements. To eliminate the redundancy, we propose Inversed Objects Replay (IOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We propose augmented replay to reuse the objects in generated samples, reducing redundant generations. Moreover, we propose high-value knowledge distillation focusing on the positions of old-class objects overwhelmed by the background, which transfers the knowledge to the incremental detector. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in IOD scenarios with the absence of old-class objects. The code is available at https://github.com/JiaJia075/IOR.

[183] Unsupervised learning of spatially varying regularization for diffeomorphic image registration

Junyu Chen, Shuwen Wei, Yihao Liu, Zhangxing Bian, Yufan He, Aaron Carass, Harrison Bai, Yong Du

Main category: cs.CV

TL;DR: A hierarchical probabilistic model for learning spatially varying deformation regularization in deep learning-based image registration, enabling automatic hyperparameter tuning and improved performance.

Details

Motivation: Most deep learning registration models use spatially invariant regularization that ignores anatomical variations, while traditional optimization-based methods successfully use spatially varying regularization for different anatomical regions.

Method: Propose a hierarchical probabilistic model with prior distribution on deformation regularization strength, enabling end-to-end learning of spatially varying regularizer. Integrates with various network architectures and uses Bayesian optimization for automatic hyperparameter tuning.

Result: Significantly improves registration performance on public datasets, enhances interpretability of deep learning registration, and maintains smooth deformations.

Conclusion: The proposed method successfully bridges the gap between traditional optimization-based and modern deep learning approaches by enabling spatially varying regularization in deep learning registration frameworks.

Abstract: Spatially varying regularization accommodates the deformation variations that may be necessary for different anatomical regions during deformable image registration. Historically, optimization-based registration models have harnessed spatially varying regularization to address anatomical subtleties. However, most modern deep learning-based models tend to gravitate towards spatially invariant regularization, wherein a homogenous regularization strength is applied across the entire image, potentially disregarding localized variations. In this paper, we propose a hierarchical probabilistic model that integrates a prior distribution on the deformation regularization strength, enabling the end-to-end learning of a spatially varying deformation regularizer directly from the data. The proposed method is straightforward to implement and easily integrates with various registration network architectures. Additionally, automatic tuning of hyperparameters is achieved through Bayesian optimization, allowing efficient identification of optimal hyperparameters for any given registration task. Comprehensive evaluations on publicly available datasets demonstrate that the proposed method significantly improves registration performance and enhances the interpretability of deep learning-based registration, all while maintaining smooth deformations.

[184] MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

Main category: cs.CV

TL;DR: MagicFace is a diffusion-based model for fine-grained facial expression editing by controlling Action Unit variations while preserving identity, pose, and background.

Details

Motivation: To enable continuous, interpretable facial expression editing of specific individuals while maintaining identity consistency and preserving facial attributes, pose, and background.

Method: Uses a diffusion model conditioned on AU variations with an ID encoder for identity preservation, leveraging pretrained Stable-Diffusion and self-attention mechanisms, plus an Attribute Controller for background and pose consistency.

Result: Achieves superior high-fidelity expression editing compared to other methods, enabling animation of arbitrary identities with various AU combinations.

Conclusion: MagicFace provides an effective solution for fine-grained, continuous facial expression editing with excellent identity preservation and attribute consistency.

Abstract: We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person’s expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.

Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

Main category: cs.CV

TL;DR: CleverDistiller is a self-supervised cross-modal knowledge distillation framework that transfers 2D vision foundation model capabilities to 3D LiDAR models using simple design choices including direct feature similarity loss and occupancy prediction.

Details

Motivation: Existing methods for transferring 2D vision foundation model capabilities to 3D LiDAR models rely on complex distillation losses, pseudo-semantic maps, or are limited to semantic segmentation tasks. There's a need for a simpler, more general approach.

Method: Uses direct feature similarity loss with MLP projection head for semantic dependency learning, and adds auxiliary occupancy prediction task for 3D spatial reasoning. Does not require pseudo-semantic maps or explicit semantic supervision.

Result: Achieves state-of-the-art performance in semantic segmentation and 3D object detection, with up to 10% mIoU improvement, especially effective when fine-tuning on low data amounts.

Conclusion: The simple yet powerful knowledge distillation strategy effectively transfers generalization capabilities from 2D vision foundation models to 3D LiDAR models without complex design choices.

Abstract: Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

[186] Seeing Beyond Haze: Generative Nighttime Image Dehazing

Beibei Lin, Stephen Lin, Robby Tan

Main category: cs.CV

TL;DR: BeyondHaze is a generative nighttime dehazing method that reduces haze/glow effects and reconstructs background structures in heavily degraded regions using diffusion models and guided training.

Details

Motivation: Nighttime image dehazing is challenging due to dense haze and intense glow obscuring background information, with existing methods struggling due to insufficient background priors and limited generative capability.

Method: Uses image diffusion models adapted to nighttime dehazing for strong background priors, with guided training to enhance generative ability in obscured areas. Includes user control over generative level to balance realism and fidelity.

Result: Experiments on real-world nighttime images show BeyondHaze substantially improves visibility and scene detail under dense haze conditions.

Conclusion: The proposed generative approach effectively addresses nighttime dehazing challenges by combining diffusion models with task-specific training, enabling both haze removal and plausible background reconstruction.

Abstract: Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or entirely obscure background information. Existing methods often struggle due to insufficient background priors and limited generative capability, both of which are highly important under such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only reduces haze and glow effects but also reconstructs plausible background structures in regions where visual cues are heavily degraded. Our approach is built on two main ideas: obtaining strong background priors by adapting image diffusion models to nighttime dehazing, and enhancing generative ability in haze- and glow-obscured areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model while preserving its capacity to generate clean images. The diffusion model is further trained on tailored image pairs to improve its ability to recover background details that are suppressed by haze effects. Since generative models may introduce hallucinated content, we design our framework to allow user control over the generative level, enabling a balance between visual realism and fidelity. Experiments on real-world nighttime images demonstrate that BeyondHaze substantially improves visibility and scene detail under dense haze.

[187] Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho, Tae-Kyun Kim

Main category: cs.CV

TL;DR: BHaRNet is a novel framework that enhances skeleton-based human action recognition by combining body-expert and hand-expert models with cross-attention mechanisms, achieving state-of-the-art performance on hand-intensive actions while maintaining computational efficiency.

Details

Motivation: Existing skeleton-based HAR methods focus mainly on full-body movements and overlook subtle hand motions that are crucial for distinguishing fine-grained actions. Unified graph representations often blur fine hand details due to disparities between body and hand action characteristics and feature loss during spatial-pooling.

Method: Proposes BHaRNet with two expert models (body-expert and hand-expert) trained jointly with ensemble loss for cooperative specialization. Uses cross-attention via expertized branch method and pooling-attention module for feature-level interactions and selective fusion. Also extends to multi-modal tasks using RGB information guided by body features.

Result: Achieves state-of-the-art accuracies on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, Northwestern-UCLA), improving hand-intensive actions from 86.4% to 93.0% accuracy while maintaining fewer GFLOPs and parameters than unified methods.

Conclusion: BHaRNet effectively addresses the limitations of existing methods by specializing in both body and hand motions through cooperative expert models and cross-attention mechanisms, demonstrating superior performance particularly for hand-intensive actions with computational efficiency.

Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies – improving from 86.4% to 93.0% in hand-intensive actions – while maintaining fewer GFLOPs and parameters than the relevant unified methods.

[188] Structure-Aware Correspondence Learning for Relative Pose Estimation

Yihan Chen, Wenfei Yang, Huan Ren, Shifeng Zhang, Tianzhu Zhang, Feng Wu

Main category: cs.CV

TL;DR: Proposes Structure-Aware Correspondence Learning for relative pose estimation, using structure-aware keypoint extraction and correspondence estimation to handle small/no overlapping regions without explicit feature matching.

Details

Motivation: Existing 3D correspondence methods rely on explicit feature matching, which suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Humans can assemble object parts by considering structure, inspiring a structure-aware approach.

Method: Two key modules: 1) Structure-aware keypoint extraction with keypoint-based image reconstruction loss to locate representative keypoints; 2) Structure-aware correspondence estimation modeling intra-image and inter-image relationships between keypoints for correspondence estimation.

Result: Significantly outperforms prior methods on CO3D, Objaverse and LineMOD datasets, with 5.7° reduction in mean angular error on CO3D dataset.

Conclusion: The proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching, achieving precise relative pose estimation by leveraging object structure awareness.

Abstract: Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans’ ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.

[189] Human Motion Unlearning

Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso

Main category: cs.CV

TL;DR: The paper introduces human motion unlearning to prevent toxic animation synthesis while maintaining text-to-motion generation performance, proposes the first benchmark using HumanML3D and Motion-X datasets, adapts image unlearning techniques, and presents LCR - a training-free method that outperforms baselines.

Details

Motivation: To prevent synthesis of toxic animations from both explicit toxic prompts and implicit combinations of safe motions while preserving general text-to-motion generative capabilities.

Method: Proposes LCR (Latent Code Replacement), a training-free method suitable for discrete latent spaces of text-to-motion diffusion models, and adapts state-of-the-art image unlearning techniques for spatio-temporal signals.

Result: LCR consistently outperforms baselines both qualitatively and quantitatively on the proposed motion unlearning benchmark.

Conclusion: The paper successfully establishes the first motion unlearning benchmark and demonstrates that LCR is an effective training-free solution for removing toxic motions while maintaining general generation performance.

Abstract: We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., “kicking” is “loading and swinging a leg”). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.

[190] DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation

Karim Abou Zeid, Kadir Yilmaz, Daan de Geus, Alexander Hermans, David Adrian, Timm Linder, Bastian Leibe

Main category: cs.CV

TL;DR: DITR integrates 2D vision foundation model features into 3D point cloud segmentation by projecting 2D features to 3D and injecting them, achieving state-of-the-art results on indoor and outdoor benchmarks.

Details

Motivation: Vision foundation models provide high-quality 2D features but their potential in 3D scene segmentation remains largely untapped, with current 3D methods focusing primarily on 3D data alone.

Method: Extracts 2D foundation model features, projects them to 3D, and injects them into 3D point cloud segmentation models. Also proposes pretraining 3D models by distilling 2D foundation models when images are unavailable during inference.

Result: Achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. Initializing 3D backbone with knowledge from 2D VFMs boosts performance across various datasets.

Conclusion: DITR successfully bridges the gap between 2D vision foundation models and 3D segmentation, demonstrating that integrating 2D VFM features significantly enhances 3D scene understanding performance.

Abstract: Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D scene segmentation remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a generally applicable approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we additionally propose to pretrain 3D models by distilling 2D foundation models. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.

[191] Shape and Texture Recognition in Large Vision-Language Models

Sagi Eppel, Mor Bismut, Alona Faktor-Strugatski

Main category: cs.CV

TL;DR: LVLMs struggle with abstract shape recognition and 2D texture identification despite approaching human performance on 3D material recognition, revealing deficiencies in low-level visual feature extraction.

Details

Motivation: To evaluate how effectively leading vision-language models recognize basic visual building blocks like shapes and textures independently of orientation, context, or object associations.

Method: Created LAS&T dataset with 700K+ images for 2D/3D shape/texture recognition, tested models on matching identical shapes across variations and identifying textures/materials across different objects.

Result: VLMs significantly underperform humans on shape recognition and abstract 2D textures, but approach human performance on 3D material recognition. Models rely heavily on semantic features and struggle with abstract shapes.

Conclusion: Leading vision models have major deficiencies in extracting low-level visual features, while humans and specialized networks excel at these fundamental visual tasks.

Abstract: Shapes and textures are the basic building blocks of visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures and materials independently of their associated objects, is essential for a general visual understanding of the world. This work introduces the Large Shape and Textures dataset (LAS&T), a giant collection of highly diverse shapes and textures, created by unsupervised extraction of patterns from natural images. This dataset is used to benchmark how effectively leading Large Vision-Language Models (VLM) recognize and represent shapes, textures, and materials in 2D and 3D scenes. For shape recognition, we test the models’ ability to match images of identical shapes that differ in orientation, texture, color, or environment. Our results show that the shape recognition capabilities of the LVLMs remain significantly below human performance. VLMs rely predominantly on high-level and semantic features and struggle with abstract shapes lacking class associations. For texture and material recognition, we evaluated the models’ ability to identify images with identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler, more abstract 2D textures and shapes. These results are consistent across a wide range of leading LVLMs (GPT/Gemini/Qwen) and foundation vision models (DINO/CLIP), exposing major deficiencies in the ability of leading models to extract and represent low-level visual features. In contrast, humans and simple nets trained directly for these tasks achieve high accuracy. The LAS&T dataset, featuring over 700,000 images for 2D/3D shape, texture, and material recognition and retrieval is freely available.

[192] Event Stream Filtering via Probability Flux Estimation

Jinze Chen, Wei Zhai, Yang Cao, Bin Li, Zheng-Jun Zha

Main category: cs.CV

TL;DR: EDFilter is a real-time event denoising framework that models event generation as probability fluxes from irradiance diffusion, using kernel-based estimation and O(1) recursive solving to reconstruct continuous event density flow.

Details

Motivation: Event cameras capture brightness changes with microsecond precision but suffer from severe noise and signal inconsistencies. Existing filters ignore inter-event time information, producing sparse outputs that limit continuous irradiance reconstruction.

Method: Models event generation as threshold-crossing probability fluxes from stochastic irradiance diffusion. Uses nonparametric kernel-based probability flux estimation and O(1) recursive solver for continuous event density flow reconstruction.

Result: Achieves high-fidelity, physically interpretable event denoising and motion reconstruction. Also presents Rotary Event Dataset (RED) with microsecond-resolution ground-truth irradiance flow for evaluation.

Conclusion: EDFilter enables real-time processing and provides superior event quality by considering both polarity and inter-event time information, overcoming limitations of existing event filters.

Abstract: Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.

[193] Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI

Mahdi Alehdaghi, Rajarshi Bhattacharya, Pourya Shamsolmoali, Rafael M. O. Cruz, Maguelonne Heritier, Eric Granger

Main category: cs.CV

TL;DR: PCMNet is a part-prototypical concept mining network that learns human-comprehensible prototypes from image regions without supervision, providing structured concept-level explanations and improving robustness.

Details

Motivation: To address the limited interpretability of deep models and the shortcomings of existing methods like GradCAM (limited conceptual insight) and prototype-based approaches (rigid region selection, lack semantic consistency).

Method: Learns human-comprehensible prototypes from meaningful image regions without additional supervision, clusters prototypes into concept groups, and extracts concept activation vectors.

Result: Outperforms state-of-the-art methods in interpretability, stability, and robustness across multiple image classification benchmarks.

Conclusion: Contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems.

Abstract: As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.

[194] UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation

Wei Zhuo, Zhiyue Tang, Wufeng Xue, Hao Ding, Junkai Ji, Linlin Shen

Main category: cs.CV

TL;DR: UINO-FSS is a unified few-shot semantic segmentation framework that integrates DINOv2 and SAM foundation models through coarse-to-fine multimodal distillation, achieving state-of-the-art performance on standard benchmarks.

Details

Motivation: To overcome limitations of dual-branch architectures in few-shot segmentation by creating a unified model that integrates knowledge from different foundation models, addressing the misalignment between class-agnostic segmentation and fine-grained discriminative representations.

Method: Uses a single-encoder architecture with bottleneck adapter for embedding alignment, meta-visual prompt generator with dense similarity volumes and semantic embeddings, and mask decoder. Employs hierarchical cross-model distillation to transfer SAM’s knowledge and Mamba-based 4D correlation mining on support-query pairs.

Result: Achieves new state-of-the-art results: mIoU of 80.6 (+3.8%) on PASCAL-5^i and 64.5 (+4.1%) on COCO-20^i under 1-shot setting.

Conclusion: The unified approach effectively integrates knowledge from different foundation models, demonstrating superior performance and flexibility compared to dual-branch architectures in few-shot semantic segmentation.

Abstract: Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM’s output embeddings. This consistency enables the integration of both models’ knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM’s knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.

[195] A Decade of You Only Look Once (YOLO) for Object Detection: A Review

Leo Thomas Ramos, Angel D. Sappa

Main category: cs.CV

TL;DR: This paper provides a comprehensive 10-year review of the YOLO (You Only Look Once) framework, tracing its evolution from YOLOv1 to YOLOv13, analyzing architectural trends, applications, and future directions.

Details

Motivation: To mark the tenth anniversary of YOLO and provide a critical perspective on its evolution, impact, and ongoing development as one of the most influential real-time object detection frameworks.

Method: The paper presents a technical overview of all main YOLO versions, analyzes key architectural trends, surveys application areas, and addresses evaluation practices and ethical considerations.

Result: The review shows YOLO’s transformation from a streamlined detector into a diverse family of architectures with efficient design, modular scalability, and cross-domain adaptability over the past decade.

Conclusion: YOLO has demonstrated significant evolution and impact in real-time object detection, with continued potential for future development across various domains, though ethical considerations remain important.

Abstract: This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions (from YOLOv1 to YOLOv13), highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework’s continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO’s trajectory and ongoing transformation.

[196] VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

Mohammadreza Teymoorianfard, Siddarth Sitaraman, Shiqing Ma, Amir Houmansadr

Main category: cs.CV

TL;DR: VidStamp is a high-capacity watermarking framework for video diffusion models that embeds 48 bits per frame while maintaining visual quality and robustness to distortions, outperforming existing methods.

Details

Motivation: Address concerns about provenance, ownership, and integrity in AI-generated videos by developing watermarking that has sufficient capacity for meaningful metadata while remaining imperceptible and robust to manipulations.

Method: Fine-tune the decoder of latent video diffusion models in two stages: first with static images for spatial message separation, then with synthesized videos for temporal consistency. Supports dynamic watermarking via control signals.

Result: Embeds 48 bits per frame with preserved visual quality and robustness to distortions. Achieves lower log P-values and stronger detectability than VideoSeal, VideoShield, and RivaGAN. Enables temporal tamper localization with 0.96 accuracy.

Conclusion: VidStamp provides an effective solution for watermarking AI-generated videos with high capacity, minimal perceptual impact, and robust performance across multiple diffusion models.

Abstract: Video diffusion models can generate realistic and temporally consistent videos. This raises concerns about provenance, ownership, and integrity. Watermarking can help address these issues by embedding metadata directly into the content. To work well, a watermark needs enough capacity for meaningful metadata. It must also stay imperceptible and remain robust to common video manipulations. Existing methods struggle with limited capacity, extra inference cost, or reduced visual quality. We introduce VidStamp, a watermarking framework that embeds frame-level messages through the decoder of a latent video diffusion model. The decoder is fine-tuned in two stages. The first stage uses static image datasets to encourage spatial message separation. The second stage uses synthesized video sequences to restore temporal consistency. This approach enables high-capacity watermarks with minimal perceptual impact. VidStamp also supports dynamic watermarking through a control signal that selects message templates during inference. This adds flexibility and creates a second channel for communication. We evaluate VidStamp on Stable Video Diffusion (I2V), OpenSora, and Wan (T2V). The system embeds 48 bits per frame while preserving visual quality and staying robust to common distortions. Compared with VideoSeal, VideoShield, and RivaGAN, it achieves lower log P-values and stronger detectability. Its frame-wise watermarking design also enables precise temporal tamper localization, with an accuracy of 0.96, which exceeds the VideoShield baseline. Code: https://github.com/SPIN-UMass/VidStamp

[197] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, Xiangxiang Chu

Main category: cs.CV

TL;DR: FLUX-Text is a multilingual scene text editing method using DiT architecture that achieves high-quality text modification with 97% fewer training examples than existing methods.

Details

Motivation: Existing UNet-based diffusion models struggle with complex glyph structures in non-Latin languages like Chinese, Korean, and Japanese, requiring better glyph understanding and generation capabilities.

Method: Uses DiT-based architecture with lightweight Visual and Text Embedding Modules, Regional Text Perceptual Loss for text regions, and a two-stage training strategy to balance text editing and image quality.

Result: Achieves superior visual quality and text fidelity on multiple public datasets (English and Chinese benchmarks) while requiring only 0.1M training examples (97% reduction from 2.9M).

Conclusion: FLUX-Text provides an effective multilingual scene text editing solution with significantly reduced training requirements and improved performance for complex glyph structures.

Abstract: Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.

[198] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen, Zhiliang Lyu, Dufan Wu, Ning Guo, Xiang Li, Quanzheng Li

Main category: cs.CV

TL;DR: Proposes Organ-Wise Tokenization (OWT) framework that disentangles medical images into organ-specific token groups, enabling interpretable representations and novel clinical applications without additional training.

Details

Motivation: Address limitations of holistic embeddings that entangle semantic components, which is problematic in medical imaging where anatomically interpretable features are crucial for downstream tasks.

Method: Organ-Wise Tokenization (OWT) framework with Token Group-based Reconstruction (TGR) training paradigm that explicitly disentangles images into separable token groups, each corresponding to distinct organs or semantic entities.

Result: Achieves strong performance on standard tasks (image reconstruction, segmentation) and enables novel clinical capabilities including organ-specific tumor identification, organ-level retrieval, and semantic-level generation without additional training.

Conclusion: OWT serves as a foundational framework for semantically disentangled representation learning, offering broad scalability and new perspectives on leveraging representations in medical imaging.

Abstract: Recent advances in representation learning often rely on holistic embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging, where downstream tasks depend on anatomically interpretable features. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while enabling fine-grained control for targeted clinical applications. Experiments on CT and MRI datasets demonstrate OWT’s power: it not only achieves strong performance on standard tasks like image reconstruction and segmentation, but also unlocks novel, high-impact clinical capabilities including organ-specific tumor identification, organ-level retrieval and semantic-level generation, without requiring any additional training. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and a new perspective on how representations can be leveraged.

[199] On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

Liyao Tang, Zhe Chen, Dacheng Tao

Main category: cs.CV

TL;DR: GEM is a geometry-aware parameter-efficient fine-tuning method for 3D point cloud transformers that integrates local positional encodings with global attention, achieving full fine-tuning performance while updating only 1.6% parameters.

Details

Motivation: Existing PEFT methods perform poorly on 3D point cloud models due to geometric and spatial distribution shifts, and they treat points as orderless tokens while neglecting important local spatial structures and global geometric contexts.

Method: Introduces Geometric Encoding Mixer (GEM) that explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context.

Result: GEM achieves performance comparable to or sometimes exceeding full fine-tuning while only updating 1.6% of parameters, with significantly reduced training time and memory requirements.

Conclusion: GEM sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models.

Abstract: The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model’s parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.

[200] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah

Main category: cs.CV

TL;DR: TF-CoVR is the first large-scale benchmark for temporally fine-grained composed video retrieval, focusing on gymnastics and diving with 180K triplets. The paper proposes TF-CoVR-Base, a two-stage training framework that improves zero-shot and fine-tuned performance on temporal video retrieval tasks.

Details

Motivation: Existing CoVR benchmarks don't adequately test the ability to capture subtle, fast-paced temporal differences in videos, limiting practical usefulness for real-world applications like sports-highlight generation.

Method: Proposed TF-CoVR-Base framework: (1) pre-train video encoder on fine-grained action classification for temporally discriminative embeddings, (2) align composed query with candidate videos using contrastive learning. Constructed dataset using LLM prompts based on label differences between clips from different videos.

Result: TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises state-of-the-art from 19.83 to 27.22 on the new TF-CoVR benchmark.

Conclusion: TF-CoVR addresses the gap in temporally fine-grained video retrieval and demonstrates significant performance improvements through specialized training approaches, enabling better handling of subtle temporal differences in video content.

Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.

[201] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering

Jianhan Qi, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: Proposes SSGCO and EGAEL modules for hyperspectral image clustering that improve spectral feature extraction and adaptively refine superpixel graph edges, achieving significant accuracy improvements.

Details

Motivation: Existing GNN-based HSI clustering methods cannot fully exploit spectral information and suffer from inaccurate superpixel topological graphs that confuse class semantics during aggregation.

Method: Developed structural-spectral graph convolutional operator (SSGCO) for co-extracting spatial-spectral features, and evidence-guided adaptive edge learning (EGAEL) module to refine superpixel graph edges, integrated into contrastive learning framework.

Result: Achieved clustering accuracy improvements of 2.61%, 6.06%, 4.96% and 3.15% over best compared methods on four HSI datasets.

Conclusion: The proposed SSGCO and EGAEL modules effectively address spectral information underutilization and inaccurate superpixel graph issues in HSI clustering.

Abstract: Hyperspectral image (HSI) clustering groups pixels into clusters without labeled data, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.

[202] TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer

Yang Liu, Chuanchen Luo, Zimo Tang, Yingyan Li, Yuran Yang, Yuanyong Ning, Lue Fan, Junran Peng, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: TC-Light is a novel generative renderer for illumination and texture editing in videos that addresses temporal consistency and computation efficiency issues in existing methods.

Details

Motivation: Existing video relighting and world generation models are limited by domain constraints, temporal inconsistency, and computational inefficiency, especially for complex dynamic videos with long durations.

Method: Two-stage optimization: first optimizes appearance embedding for global illumination alignment using an inflated video relighting model, then optimizes a canonical video representation called Unique Video Tensor (UVT) for fine-grained texture and lighting alignment.

Result: Enables physically plausible re-rendering results with superior temporal coherence and low computation cost, validated through extensive experiments on a new long and highly dynamic video benchmark.

Conclusion: TC-Light overcomes limitations of existing methods by providing temporally consistent, computationally efficient illumination and texture editing for complex dynamic videos.

Abstract: Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.

[203] Active Measurement: Efficient Estimation at Scale

Max Hamilton, Jinlin Lai, Wenlong Zhao, Subhransu Maji, Daniel Sheldon

Main category: cs.CV

TL;DR: Active measurement is a human-in-the-loop AI framework that combines AI predictions with importance sampling of human labels to provide statistically guaranteed scientific measurements with reduced human effort.

Details

Motivation: Current AI workflows lack accuracy and statistical guarantees needed for scientific discovery, despite AI's potential to analyze large datasets efficiently.

Method: Uses an AI model to predict measurements, samples units for human labeling via importance sampling, iteratively improves the AI model with new labels, and refines unbiased Monte Carlo estimates of total measurements.

Result: Active measurement provides precise estimates even with imperfect AI models, requires minimal human effort when AI is accurate, and reduces estimation error compared to alternative methods in various measurement tasks.

Conclusion: The framework successfully combines AI efficiency with human expertise to deliver statistically guaranteed scientific measurements while optimizing human effort.

Abstract: AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.

[204] Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

N Dinesh Reddy, Dylan Snyder, Lona Kiragu, Mirajul Mohin, Shahrear Bin Amin, Sudeep Pillai

Main category: cs.CV

TL;DR: Orion is a visual agent that combines vision-based reasoning with tool-augmented execution for multi-step visual intelligence across images, video, and documents.

Details

Motivation: To overcome limitations of traditional vision-language models that only generate descriptive outputs, by enabling active, tool-driven visual intelligence that bridges neural perception with symbolic execution.

Method: Integrates specialized computer vision tools including object detection, keypoint localization, panoptic segmentation, OCR, and geometric analysis to orchestrate complex multi-step visual workflows.

Result: Achieves competitive performance across MMMU, MMBench, DocVQA, and MMLongBench benchmarks while extending monolithic VLM capabilities to production-grade visual intelligence.

Conclusion: Orion marks the transition from passive visual understanding to active, tool-driven visual intelligence through its agentic, tool-augmented approach.

Abstract: We introduce Orion, a visual agent that integrates vision-based reasoning with tool-augmented execution to achieve powerful, precise, multi-step visual intelligence across images, video, and documents. Unlike traditional vision-language models that generate descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition (OCR), and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance across MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic VLM capabilities to production-grade visual intelligence. Through its agentic, tool-augmented approach, Orion enables autonomous visual reasoning that bridges neural perception with symbolic execution, marking the transition from passive visual understanding to active, tool-driven visual intelligence. Try Orion for free at: https://chat.vlm.run Learn more at: https://www.vlm.run/orion

[205] Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset

A. Piffer, J. A. Buchner, A. G. Gennari, P. Grehten, S. Sirin, E. Ross, I. Ezhov, M. Rosier, J. C. Peeken, M. Piraud, B. Menze, A. Guerreiro Stücklin, A. Jakab, F. Kofler

Main category: cs.CV

TL;DR: Deep learning-based segmentation using 3D nnU-Net achieves robust performance for pediatric brain tumor segmentation, particularly for whole tumor and T2-hyperintensity regions, with results comparable to human variability.

Details

Motivation: Pediatric brain tumors are diverse and challenging to diagnose and treat. Deep learning segmentation could help with tumor delineation, but its performance across heterogeneous subtypes and MRI protocols is uncertain.

Method: Used retrospective cohort of 174 pediatric patients with various brain tumor types. Trained 3D nnU-Net on MRI sequences (T1, T1-C, T2, FLAIR) with manual annotations for four tumor subregions. Assessed performance using Dice similarity coefficient.

Result: Model achieved strong performance for whole tumor and T2-hyperintensity (mean DSC: 0.85), comparable to human variability. Moderate accuracy for enhancing tumor (0.75), poor for cystic component. Found that T1, T1-C, and T2 alone produced nearly equivalent results to full protocol.

Conclusion: Deep learning is feasible for pediatric brain tumor segmentation, especially for T2-hyperintensity and whole tumor. Challenges remain for enhancing tumor and cystic component segmentation. Supports potential for protocol simplification and workflow automation in pediatric neuro-oncology.

Abstract: Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.

[206] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

Main category: cs.CV

TL;DR: Kandinsky 5.0 is a family of foundation models for high-resolution image and 10-second video synthesis, featuring three core models with different parameter sizes and capabilities, supported by comprehensive data curation and training techniques.

Details

Motivation: To advance the development and accessibility of high-quality generative models by creating a large-scale, publicly available framework that leverages extensive pre-training and quality-enhancement techniques for various generative applications.

Method: Multi-stage training pipeline with comprehensive data curation (collection, processing, filtering, clustering), extensive pre-training, and quality-enhancement techniques including self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training, plus novel architectural, training, and inference optimizations.

Result: Achieves high generation speeds and state-of-the-art performance across various tasks as demonstrated by human evaluation, with three specialized models: 6B parameter image generation, 2B parameter fast video generation, and 19B parameter superior quality video generation.

Conclusion: Kandinsky 5.0 represents a significant advancement in generative AI, providing an open-source framework that substantially improves the development and accessibility of high-quality generative models for the research community.

Abstract: This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

[207] One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

Main category: cs.CV

TL;DR: OMFA is a unified diffusion framework for virtual try-on and try-off that works without exhibition garments or segmentation masks, supports arbitrary poses, and enables cross-person garment transfer using a single portrait and target garment.

Details

Motivation: Existing virtual try-on methods are limited by reliance on exhibition garments, segmentation masks, and fixed poses, reducing practicality for real-world scenarios where users want to transfer garments between people with different poses.

Method: OMFA uses a bidirectional diffusion framework inspired by language modeling, employing Tweedie’s formula for faithful distribution estimation and SMPL-X-based pose conditioning to support multi-view and arbitrary-pose try-on from a single image.

Result: Extensive experiments show OMFA achieves state-of-the-art results on both try-on and try-off tasks, demonstrating superior performance in flexible outfit combinations and cross-person garment transfer.

Conclusion: OMFA provides a practical, mask-free solution for virtual garment synthesis that better aligns with real-world usage scenarios by supporting arbitrary poses and eliminating the need for exhibition garments or segmentation masks.

Abstract: Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie’s formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.

[208] Training and Inference within 1 Second – Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

Tianyu Xin, Jin-Liang Xiao, Zeyu Xia, Shan Yin, Liang-Jian Deng

Main category: cs.CV

TL;DR: A novel pansharpening method that uses modular decomposition and a Feature Tailor module to address cross-sensor degradation with sub-second training and inference, requiring no external data.

Details

Motivation: Deep learning pansharpening models pretrained on specific sensor data generalize poorly to other sensors, and existing cross-sensor methods are time-consuming or need extra training data.

Method: Modular decomposition of pansharpening models to identify critical interface, then integrating a Feature Tailor at this interface trained with physics-aware unsupervised losses in a patch-wise manner for efficiency.

Result: Achieves state-of-the-art quality and efficiency: 0.2 seconds for 512×512×8 images and 3 seconds for 4000×4000×8 images on RTX 3090 GPU, over 100x faster than zero-shot methods.

Conclusion: The method provides improved generalization ability for cross-sensor cases with extremely low generalization cost, enabling practical real-world deployment.

Abstract: Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $\textit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $\textit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512\times512\times8$ image within $\textit{0.2 seconds}$ and $4000\times4000\times8$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

[209] CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao, Xiaobo Lu, Shuo Wang

Main category: cs.CV

TL;DR: CompTrack is a novel 3D single object tracking framework that addresses spatial and informational redundancy in LiDAR point clouds through foreground prediction and dynamic token compression, achieving real-time performance at 90 FPS.

Details

Motivation: Existing 3D trackers are limited by dual-redundancy challenges in point clouds: spatial redundancy from background noise impairs accuracy, and informational redundancy within foreground hinders efficiency.

Method: Proposes CompTrack with two key modules: Spatial Foreground Predictor (SFP) to filter background noise using information entropy, and Information Bottleneck-guided Dynamic Token Compression (IB-DTC) that uses online SVD analysis to compress redundant foreground into compact proxy tokens.

Result: Achieves top-performing tracking performance on KITTI, nuScenes and Waymo datasets with superior efficiency, running at real-time 90 FPS on a single RTX 3090 GPU.

Conclusion: CompTrack effectively eliminates both spatial and informational redundancy in point clouds, enabling high-performance real-time 3D object tracking.

Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.

[210] Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu

Main category: cs.CV

TL;DR: V-PAE is a distillation framework that enables high-quality single-step video generation from large-scale video models, addressing sampling efficiency bottlenecks in video diffusion generation.

Details

Motivation: Video diffusion generation suffers from critical sampling efficiency bottlenecks, especially for large-scale models and long contexts. Existing acceleration methods lack single-step distillation capability for large video models and task generalization for conditional tasks.

Method: Two-phase process: (1) Stability priming - warm-up to align real and generated video distributions, improving adversarial distillation stability; (2) Unified adversarial equilibrium - flexible self-adversarial process reusing generator parameters for discriminator backbone, achieving co-evolutionary equilibrium in Gaussian noise space.

Result: Outperforms existing acceleration methods by average 5.8% in overall quality score on VBench-I2V, improves semantic alignment, temporal coherence, and frame quality. Reduces diffusion latency of large-scale video models (e.g., Wan2.1-I2V-14B) by 100x while preserving competitive performance.

Conclusion: V-PAE successfully bridges the gap for single-step distillation of large-scale video models and enables efficient high-quality video generation with significant speed improvements.

Abstract: Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

[211] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma

Main category: cs.CV

TL;DR: Medverse is a universal in-context learning model for 3D medical imaging that achieves high-fidelity predictions with global anatomical understanding across diverse tasks and anatomical regions.

Details

Motivation: Current ICL models for medical imaging cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and lack unified training across diverse medical imaging tasks and anatomical regions, limiting the potential of ICL in medical imaging.

Method: Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, and uses a blockwise cross-attention module for long-range interactions while maintaining computational efficiency through spatial sparsity. It’s trained on 22 datasets covering diverse tasks.

Result: Medverse substantially outperforms existing ICL baselines on held-out datasets covering unseen clinical centers, organs, species, and imaging modalities, establishing a novel paradigm for in-context learning.

Conclusion: Medverse presents a universal ICL model that enables high-fidelity, full-resolution volumetric outputs with multi-scale anatomical awareness across diverse medical imaging tasks, advancing the potential of in-context learning in medical imaging.

Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

[212] End-to-End 4D Heart Mesh Recovery Across Full-Stack and Sparse Cardiac MRI

Yihong Chen, Jiancheng Yang, Deniz Sayin Mercadier, Hieu Le, Juerg Schwitter, Pascal Fua

Main category: cs.CV

TL;DR: TetHeart is the first end-to-end framework for unified 4D heart mesh recovery from both offline full-stack and intra-procedural sparse-slice CMR observations, enabling real-time cardiac motion reconstruction during interventions.

Details

Motivation: Existing cardiac motion reconstruction methods require complete CMR stacks, limiting their applicability during interventions when only sparse observations are available. There's a need for methods that work with both pre-procedural full data and intra-procedural sparse data.

Method: Uses deformable tetrahedra to capture shape and motion in a coherent space across cardiac structures. Key innovations include: attentive slice-adaptive 2D-3D feature assembly, distillation strategy for extreme sparsity, and weakly supervised motion learning requiring annotations only at keyframes.

Result: Achieves state-of-the-art accuracy and strong generalization across three large public datasets and additional private interventional datasets without retraining. Works effectively in both pre- and intra-procedural settings with sparse observations down to a single slice.

Conclusion: TetHeart provides a unified solution for cardiac motion reconstruction that bridges the gap between offline analysis and real-time intervention, enabling accurate heart mesh recovery from both complete and sparse CMR data.

Abstract: Reconstructing cardiac motion from CMR sequences is critical for diagnosis, prognosis, and intervention. Existing methods rely on complete CMR stacks to infer full heart motion, limiting their applicability during intervention when only sparse observations are available. We present TetHeart, the first end-to-end framework for unified 4D heart mesh recovery from both offline full-stack and intra-procedural sparse-slice observations. Our method leverages deformable tetrahedra to capture shape and motion in a coherent space shared across cardiac structures. Before a procedure, it initializes detailed, patient-specific heart meshes from high-quality full stacks, which can then be updated using whatever slices can be obtained in real-time, down to a single one during the procedure. TetHeart incorporates several key innovations: (i) an attentive slice-adaptive 2D-3D feature assembly mechanism that integrates information from arbitrary numbers of slices at any position; (ii) a distillation strategy to ensure accurate reconstruction under extreme sparsity; and (iii) a weakly supervised motion learning scheme requiring annotations only at keyframes, such as the end-diastolic and end-systolic phases. Trained and validated on three large public datasets and evaluated zero-shot on additional private interventional and public datasets without retraining, TetHeart achieves state-of-the-art accuracy and strong generalization in both pre- and intra-procedural settings.

[213] Localized Region Guidance for Class Activation Mapping in WSSS

Ali Torabi, Sanjog Gaihre, MD Mahbubur Rahman, Yaqoob Majeed

Main category: cs.CV

TL;DR: IG-CAM is a novel weakly supervised semantic segmentation method that uses instance guidance and influence functions to generate high-quality, boundary-aware localization maps, achieving state-of-the-art performance on PASCAL VOC 2012.

Details

Motivation: Existing WSSS methods struggle with precise object boundary localization and focus only on the most discriminative regions, limiting their segmentation quality.

Method: Proposes IG-CAM with three innovations: Instance-Guided Refinement using object proposals, Influence Function Integration to capture training sample relationships, and Multi-Scale Boundary Enhancement with progressive refinement.

Result: Achieves 82.3% mIoU on PASCAL VOC 2012 before post-processing and 86.6% after CRF refinement, significantly outperforming previous WSSS methods.

Conclusion: IG-CAM establishes a new benchmark for weakly supervised semantic segmentation, with extensive ablation studies validating each component’s contribution.

Abstract: Weakly Supervised Semantic Segmentation (WSSS) addresses the challenge of training segmentation models using only image-level annotations. Existing WSSS methods struggle with precise object boundary localization and focus only on the most discriminative regions. To address these challenges, we propose IG-CAM (Instance-Guided Class Activation Mapping), a novel approach that leverages instance-level cues and influence functions to generate high-quality, boundary-aware localization maps. Our method introduces three key innovations: (1) Instance-Guided Refinement using object proposals to guide CAM generation, ensuring complete object coverage; (2) Influence Function Integration that captures the relationship between training samples and model predictions; and (3) Multi-Scale Boundary Enhancement with progressive refinement strategies. IG-CAM achieves state-of-the-art performance on PASCAL VOC 2012 with 82.3% mIoU before post-processing, improving to 86.6% after CRF refinement, significantly outperforming previous WSSS methods. Extensive ablation studies validate each component’s contribution, establishing IG-CAM as a new benchmark for weakly supervised semantic segmentation.

[214] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning

Zhihao He, Tianyao He, Yun Xu, Tieyuan Chen, Huabin Liu, Chaofan Gan, Zuxuan Wu, Weiyao Lin

Main category: cs.CV

TL;DR: Proposes a multi-video collaborative framework that uses spatio-temporal graphs to represent video knowledge and fuse information from related videos to enhance video language model reasoning.

Details

Motivation: Current video language models suffer from spatio-temporal incompleteness in individual videos, leading to hallucinations and inaccuracies. Using multiple related videos can improve reasoning but direct feeding of video data is inefficient due to redundant information.

Method: Three modules: Video Structuring Module converts videos to spatio-temporal graphs, Graph Fusion Module fuses knowledge from related videos into augmented graph nodes, and multi-video structured prompt integrates graph, visual, and textual tokens for LLM input.

Result: Extensive experiments show the framework effectively enhances video language model performance, demonstrating its potential for advancing video reasoning capabilities.

Conclusion: The proposed multi-video collaborative framework with structured video representation provides a promising solution to address spatio-temporal incompleteness in video language models.

Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video’s knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models. Code will be open-sourced at https://github.com/ziHoHe/SMV-CR.

[215] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen

Main category: cs.CV

TL;DR: VividFace is an efficient one-step diffusion framework for video face enhancement that addresses computational inefficiency, facial texture modeling, and data limitations through flow matching, joint latent-pixel training, and a curated high-quality dataset.

Details

Motivation: Current video face enhancement methods face three key challenges: computational inefficiency from multi-step diffusion, difficulty in modeling facial textures while maintaining temporal consistency, and limited generalization due to lack of high-quality training data.

Method: VividFace uses a one-step flow matching paradigm based on pretrained WANX video generation model, joint latent-pixel face-focused training with spatiotemporally aligned facial masks, and an MLLM-driven automated filtering pipeline to create MLLM-Face90 dataset.

Result: Extensive experiments show VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across synthetic and real-world benchmarks.

Conclusion: VividFace effectively addresses key challenges in video face enhancement through efficient one-step diffusion, focused facial training, and high-quality dataset curation, with plans to release code, models, and dataset publicly.

Abstract: Video Face Enhancement (VFE) aims to restore high-quality facial regions from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) faithfully modeling intricate facial textures while preserving temporal consistency; and (3) limited model generalization due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for VFE. Built upon the pretrained WANX video generation model, VividFace reformulates the traditional multi-step diffusion process as a single-step flow matching paradigm that directly maps degraded inputs to high-quality outputs with significantly reduced inference time. To enhance facial detail recovery, we introduce a Joint Latent-Pixel Face-Focused Training strategy that constructs spatiotemporally aligned facial masks to guide optimization toward critical facial regions in both latent and pixel spaces. Furthermore, we develop an MLLM-driven automated filtering pipeline that produces MLLM-Face90, a meticulously curated high-quality face video dataset, ensuring models learn from photorealistic facial textures. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks. We will publicly release our code, models, and dataset to support future research.

[216] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI

Quang-Khai Bui-Tran, Minh-Toan Dinh, Thanh-Huy Nguyen, Ba-Thinh Lam, Mai-Anh Vu, Ulas Bagci

Main category: cs.CV

TL;DR: A label-efficient liver segmentation method for multi-phase MRI that uses foundation model adaptation and co-training to handle limited labeled data, unlabeled sequences, and vendor/modal variations without spatial registration.

Details

Motivation: Liver segmentation in multi-phase MRI is crucial for fibrosis assessment but faces challenges with scarce labeled data, uneven distribution across modalities/vendors, spatial misalignment, and missing phases in real-world clinical settings.

Method: Integrates foundation-scale 3D segmentation backbone with fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and standardized preprocessing pipeline without requiring spatial registration.

Result: The model demonstrates robust segmentation performance across both labeled and unlabeled domains, effectively generalizing across MRI phases and different vendor systems.

Conclusion: The approach shows effectiveness for label-efficient liver segmentation in multi-phase, multi-vendor MRI and highlights the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

Abstract: Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

[217] DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization

Thanh-Huy Nguyen, Hoang-Thien Nguyen, Vi Vu, Ba-Thinh Lam, Phat Huynh, Tianyang Wang, Xingjian Li, Ulas Bagci, Min Xu

Main category: cs.CV

TL;DR: DuetMatch is a dual-branch semi-supervised framework for medical image segmentation that uses asynchronous optimization of encoder and decoder branches, with novel techniques for regularization, diversity enhancement, and noise reduction.

Details

Motivation: Limited annotated medical imaging data requires semi-supervised learning, but joint optimization in teacher-student frameworks can cause convergence and stability issues, especially in challenging scenarios.

Method: Proposes DuetMatch with dual-branch asynchronous optimization (encoder vs decoder), Decoupled Dropout Perturbation for regularization, Pair-wise CutMix Cross-Guidance for diversity, and Consistency Matching to reduce noisy pseudo-label bias.

Result: Extensive experiments on ISLES2022 and BraTS brain MRI datasets show DuetMatch consistently outperforms state-of-the-art methods across diverse semi-supervised segmentation scenarios.

Conclusion: DuetMatch demonstrates effectiveness and robustness for medical image segmentation in semi-supervised settings through its novel dual-branch architecture and noise-reduction techniques.

Abstract: The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.

[218] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

Main category: cs.CV

TL;DR: Conan is a framework for evidence-grounded multi-step video reasoning that identifies context/evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further, achieving state-of-the-art performance on video reasoning benchmarks.

Details

Motivation: Existing RL-based methods for video reasoning often produce ungrounded or hallucinated conclusions using text-only chains, while frame-retrieval approaches struggle with inaccurate evidence localization.

Method: Developed Conan framework with: 1) Conan-91K dataset of automatically generated reasoning traces including frame identification, evidence reasoning, and action decisions; 2) Multi-stage progressive cold-start strategy with Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning.

Result: Conan surpasses baseline Qwen2.5-VL-7B-Instruct by over 10% accuracy on average across six multi-step reasoning benchmarks, achieving state-of-the-art performance. It also generalizes effectively to long video understanding tasks.

Conclusion: Conan demonstrates strong scalability and robustness for evidence-grounded multi-step video reasoning, effectively addressing limitations of previous approaches through its adaptive reasoning framework and comprehensive training methodology.

Abstract: Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.

[219] LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie

Main category: cs.CV

TL;DR: Efficient multimodal model fusion approach that combines specialized generation and understanding models using interleaved multimodal self-attention blocks, achieving strong performance with minimal training.

Details

Motivation: To create competitive multimodal systems more efficiently by fusing existing specialized models rather than training from scratch, reducing computational requirements.

Method: Retain original model blocks while interleaving multimodal self-attention blocks throughout networks, enabling double fusion mechanism that combines high-level semantic representations with low-level spatial signals.

Result: Achieved strong performance with only ~35B tokens: 0.91 on GenEval, 82.16 on DPG-Bench, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench across text-to-image generation and image editing tasks.

Conclusion: Strategic fusion of publicly available specialized models can achieve competitive multimodal performance efficiently, with full release of code, models, and datasets to support future research.

Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

[220] Fusion of Multi-scale Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

Zhidong Yang, Xiuhui Shi, Wei Ba, Zhigang Song, Haijing Luan, Taiyuan Hu, Senlin Lin, Jiguang Wang, Shaohua Kevin Zhou, Rui Yan

Main category: cs.CV

TL;DR: FuseCPath is a framework for fusing multi-scale heterogeneous pathology foundation models to improve whole slide image analysis performance through multi-view clustering, cluster-level re-embedding, and collaborative distillation.

Details

Motivation: Current pathology foundation models exhibit substantial heterogeneity due to diverse training datasets and architectures, causing performance variability in downstream tasks. There's a need to effectively leverage multiple FMs' advantages.

Method: Proposes FuseCPath with three key components: (1) multi-view clustering to filter discriminative patches using multiple FMs’ embeddings, (2) cluster-level re-embedding for patch-level feature fusion, and (3) collaborative distillation for slide-level FM fusion.

Result: Extensive experiments show FuseCPath achieves state-of-the-art performance across multiple tasks on diverse datasets.

Conclusion: The proposed framework effectively fuses heterogeneous pathology foundation models and demonstrates superior ensemble performance in computational pathology tasks.

Abstract: Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathology foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level multi-scale features from WSIs. However, current pathology FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the features from different FMs in the downstream tasks. To fully explore the advantages of multiple FMs effectively, in this work, we propose a novel framework for the fusion of multi-scale heterogeneous pathology FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs’ embeddings. (ii) To effectively fuse the patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments demonstrate that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on diverse datasets.

[221] CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

Long Li, Shuichen Ji, Ziyang Luo, Nian Liu, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: A unified framework using Chain-of-Thought reasoning in Vision-Language Models to handle three heterogeneous saliency tasks (SOD, CoSOD, SIS) through a two-stage training approach with novel confidence-guided optimization.

Details

Motivation: To address the operational heterogeneity across different saliency tasks by creating a unified framework that bridges task differences through reasoning processes rather than separate specialized models.

Method: Two-stage training: Supervised Fine-Tuning with output-to-reasoning data construction, followed by Reinforcement Learning with Confidence-Guided Policy Optimization that uses reward-confidence discrepancy as advantage signal.

Result: Achieves state-of-the-art performance across all three tasks, particularly excelling in CoSOD with 0.899 S-measure on CoCA (8.0% improvement over prior best), using significantly less training data than competitors.

Conclusion: The proposed unified framework successfully handles heterogeneous saliency tasks through CoT reasoning, with CGPO effectively addressing limitations of prior methods while achieving superior performance with reduced computational requirements.

Abstract: We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO’s key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an “output-to-reasoning” strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.

[222] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

Main category: cs.CV

TL;DR: Otter introduces compound segmentation and temporal reconstruction with RWKV to improve wide-angle few-shot action recognition by emphasizing subjects and reconstructing temporal relations.

Details

Motivation: Wide-angle videos in FSAR are challenging due to background distractions and degraded temporal relations from similar backgrounds, requiring better subject emphasis and temporal modeling.

Method: Uses Compound Segmentation Module (CSM) to segment key patches and highlight subjects, and Temporal Reconstruction Module (TRM) with bidirectional scanning for temporal relation reconstruction, combining regular and temporal-enhanced prototypes.

Result: Achieves state-of-the-art performance on SSv2, Kinetics, UCF101, HMDB51 benchmarks, with superior performance on VideoBadminton dataset for wide-angle FSAR.

Conclusion: Otter effectively addresses background distractions and temporal degradation in wide-angle FSAR through subject emphasis and temporal reconstruction, demonstrating superior performance across multiple benchmarks.

Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

[223] vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

Main category: cs.CV

TL;DR: vMFCoOp is a framework that uses von Mises-Fisher distributions on a hyperspherical manifold to align semantic biases between LLMs and CLIP models, achieving robust biomedical prompting and superior few-shot classification across diverse medical datasets and imaging modalities.

Details

Motivation: Address semantic misalignment between LLMs and CLIP variants due to divergent training, lack scalability across evolving foundation models, and overcome limitations of Euclidean-space optimization that amplifies modality gaps in biomedical imaging.

Method: Inversely estimates von Mises-Fisher distributions on a shared Hyperspherical Manifold, aligns semantic biases via Unified Semantic Anchors, and applies three complementary constraints for robust biomedical prompting.

Result: Demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability.

Conclusion: Provides a scalable framework for biomedical prompt learning that addresses semantic misalignment and modality gaps, with plans to expand to more downstream applications and share resources publicly.

Abstract: Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

[224] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

Jinxuan Li, Yi Zhang, Jian-Fang Hu, Chaolei Tan, Tianming Liang, Beihao Xia

Main category: cs.CV

TL;DR: TubeRMC is a framework for weakly-supervised spatio-temporal video grounding that generates text-conditioned candidate tubes and refines them through tube-conditioned reconstruction with mutual constraints between spatial and temporal proposals.

Details

Motivation: To address limitations in existing weakly-supervised STVG methods that use simple late-fusion and generate tubes independent of text descriptions, leading to target identification failures and inconsistent tracking.

Method: Proposes Tube-conditioned Reconstruction with Mutual Constraints (TubeRMC) that generates text-conditioned candidate tubes using pre-trained visual grounding models and refines them via three reconstruction strategies (temporal, spatial, spatio-temporal) with mutual constraints between spatial and temporal proposals.

Result: Outperforms existing methods on VidSTG and HCSTVG benchmarks and effectively mitigates both target identification errors and inconsistent tracking according to visualization results.

Conclusion: TubeRMC successfully addresses the limitations of previous weakly-supervised STVG methods by incorporating text-conditioned tube generation and comprehensive reconstruction strategies with mutual constraints.

Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

[225] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillation

Shuhan Ye, Yi Yu, Qixin Zhang, Chenqi Kong, Qiangqiang Wu, Kun Wang, Xudong Jiang

Main category: cs.CV

TL;DR: PACE is the first dataset distillation framework for SNNs and event-based vision, reducing training time by 50x and storage by 6000x while maintaining high accuracy.

Details

Motivation: SNNs are energy-efficient for event-based vision but costly to train due to temporal coding, limiting practical deployment.

Method: Uses ST-DSM for spatiotemporal matching of amplitude/phase via residual membrane potentials, and PEQ-N as a plug-and-play probabilistic integer quantizer.

Result: Achieves 84.4% accuracy on N-MNIST (85% of full training performance) with 50x faster training and 6000x storage reduction.

Conclusion: PACE enables minute-scale SNN training and efficient edge deployment through compact synthetic datasets.

Abstract: Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves (84.4%) accuracy, about (85%) of the full training set performance, while reducing training time by more than (50\times) and storage cost by (6000\times), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.

[226] Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings

Zihao Lin, Zhenshan Shi, Sasa Zhao, Hanwei Zhu, Lingyu Zhu, Baoliang Chen, Lei Mo

Main category: cs.CV

TL;DR: A data-driven framework for automatic and interpretable creativity assessment from drawings that analyzes both content (what is drawn) and style (how it’s drawn) to predict creativity scores.

Details

Motivation: Current creativity assessment relies on subjective expert scoring which is labor-intensive and inconsistent. There's a need for automated, objective methods to assess human creativity through drawings.

Method: Proposes a conditional model that predicts content, style, and ratings simultaneously by augmenting existing datasets with content annotations and using conditional learning to adapt feature extraction based on creativity-relevant signals from stylistic and semantic cues.

Result: The model achieves state-of-the-art performance compared to existing regression-based approaches and provides interpretable visualizations that align well with human judgments.

Conclusion: The framework successfully automates creativity assessment from drawings by modeling both content and style dimensions, offering an objective and interpretable alternative to subjective expert scoring.

Abstract: Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions. Specifically, we first augment an existing creativity-labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a conditional model predicting content, style, and ratings simultaneously. In particular, the conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing’s stylistic and semantic cues. Experimental results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at https://github.com/WonderOfU9/CSCA_PRCV_2025

[227] Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu

Main category: cs.CV

TL;DR: DTO-Humans: A novel method for generating scene-consistent multi-person human mesh pseudo-ground-truth using depth-conditioned optimization, enabling metric-aware human mesh recovery with improved depth reasoning.

Details

Motivation: Current multi-person human mesh recovery methods lack scene-level consistency due to single-person-centric pseudo-ground-truth generation, leading to conflicting depths and scales within the same image.

Method: Depth-conditioned Translation Optimization (DTO) jointly refines camera-space translations using anthropometric priors and monocular depth cues in a MAP framework. Also proposes Metric-Aware HMR with camera branch and relative metric loss.

Result: Created DTO-Humans dataset with 0.56M scene-consistent multi-person images (avg 4.8 persons per image). Achieved state-of-the-art performance on relative depth reasoning and human mesh recovery.

Conclusion: The proposed DTO method effectively addresses scene-level consistency in multi-person mesh recovery, and Metric-Aware HMR enables direct metric-scale estimation with improved performance.

Abstract: Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.

[228] CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

Xianming Gu, Lihui Wang, Ying Cao, Zeyu Deng, Yingfeng Ou, Guodong Hu, Yi Chen

Main category: cs.CV

TL;DR: A dual-prompt expert network with convolutional dictionary feature decoupling for multi-contrast MRI super-resolution, achieving superior detail reconstruction and strong generalization.

Details

Motivation: Multi-contrast MRI super-resolution aims to reconstruct high-resolution images using reference images from different contrasts, but contrast disparities between modalities make it challenging to effectively utilize reference textures, leading to suboptimal feature integration.

Method: Proposes CD-DPE strategy with iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, and dual-prompt feature fusion expert module (DP-FFEM) using frequency prompt for feature selection and adaptive routing prompt for optimal fusion.

Result: Extensive experiments on public multi-contrast MRI datasets show CD-DPE outperforms state-of-the-art methods in reconstructing fine details, and demonstrates strong generalization capabilities on unseen datasets.

Conclusion: The proposed CD-DPE method effectively addresses contrast disparities in multi-contrast MRI super-resolution, achieving superior reconstruction quality and generalization performance.

Abstract: Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.

[229] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

Fan Yang, Quanting Xie, Atsunori Moteki, Shoichi Masui, Shan Jiang, Kanji Uchino, Yonatan Bisk, Graham Neubig

Main category: cs.CV

TL;DR: A new benchmark for long-term periodic human activities with 580 multimodal sequences, featuring three evaluation tasks and a training-free baseline method that outperforms existing approaches.

Details

Motivation: Long-term periodic workflows with low-contrast patterns are underexplored compared to short-term periodic activities, creating a gap in activity analysis research.

Method: Proposed a lightweight, training-free baseline for modeling diverse periodic workflow patterns and created a benchmark with 580 multimodal sequences supporting three evaluation tasks.

Result: The benchmark challenges existing methods, the baseline outperforms competitors significantly across all tasks, and shows deployment advantages comparable to supervised approaches without needing annotations.

Conclusion: The work successfully addresses the gap in long-term periodic workflow analysis and provides an effective training-free solution with practical deployment benefits.

Abstract: Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities – characterized by simple structures and high-contrast patterns – have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.

[230] BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI

Wasif Jalal, Md Nafiu Rahman, Atif Hasan Rahman, M. Sohel Rahman

Main category: cs.CV

TL;DR: BrainRotViT is a hybrid brain age estimation model combining vision transformers and residual CNNs that achieves state-of-the-art performance across diverse datasets and provides interpretable aging biomarkers.

Details

Motivation: Traditional brain age estimation methods face limitations including manual feature engineering, limited receptive fields, and overfitting. Pure transformers require large datasets and high computational costs, creating a need for efficient hybrid approaches.

Method: A hybrid architecture with ViT encoder pre-trained on auxiliary age/sex classification, then frozen and applied to sagittal slices to generate embedding matrix, followed by residual CNN regressor incorporating subject sex at final layer.

Result: Achieved MAE of 3.34 years (r=0.98) across 11 datasets from 130+ sites, with generalization MAE of 3.77-5.04 years on independent cohorts. Brain age gap associated with Alzheimer’s, cognitive impairment, and autism.

Conclusion: The method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging CNN and transformer approaches while enabling new aging and neurodegeneration research.

Abstract: Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer’s disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.

[231] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI

Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz

Main category: cs.CV

TL;DR: FunnyNodules is a synthetic medical imaging dataset with controllable lung nodule attributes for systematic evaluation of explainable AI models’ reasoning capabilities.

Details

Motivation: Real medical datasets lack reasoning annotations needed to develop AI that makes correct predictions for the right reasons, similar to radiologists.

Method: Generate parameterized synthetic lung nodule-like shapes with controllable visual attributes (roundness, margin sharpness, spiculation) and predefined attribute-class decision rules.

Result: Enables model-agnostic evaluation of attribute-target relations, interpretation of performance in attribute prediction, and analysis of attention alignment with attribute regions.

Conclusion: FunnyNodules provides complete ground truth and customizable framework for developing, benchmarking, and analyzing explainable AI methods in medical imaging.

Abstract: Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.

[232] RoMa v2: Harder Better Faster Denser Feature Matching

Johan Edstedt, David Nordström, Yushan Zhang, Georg Bökman, Jonathan Astermark, Viktor Larsson, Anders Heyden, Fredrik Kahl, Mårten Wadenbäck, Michael Felsberg

Main category: cs.CV

TL;DR: A new dense feature matching model that achieves state-of-the-art accuracy through architectural improvements, optimized training pipeline, and leveraging foundation models like DINOv3.

Details

Motivation: Existing dense matchers fail in hard real-world scenarios, with high-precision models being too slow and limited in applicability.

Method: Novel matching architecture and loss, curated diverse training distribution, decoupled two-stage matching-then-refinement pipeline with custom CUDA kernel for memory efficiency, and integration of DINOv3 foundation model.

Result: Significantly more accurate than predecessors, sets new state-of-the-art in dense feature matching across extensive experiments.

Conclusion: The proposed model successfully addresses weaknesses of existing dense matchers through systematic improvements, achieving superior performance and practical applicability.

Abstract: Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2

cs.AI

[233] Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization

Ariel Kamen, Yakov Kamen

Main category: cs.AI

TL;DR: Ensemble framework using multiple LLMs for text categorization that improves F1-score by up to 65% over single models, achieving near human-expert performance.

Details

Motivation: Address weaknesses of individual LLMs including inconsistency, hallucination, category inflation, and misclassification in text categorization tasks.

Method: Mathematical model of collective decision-making with principled aggregation criteria, evaluating 10 state-of-the-art LLMs under zero-shot conditions on 8,660 human-annotated samples using IAB hierarchical taxonomy.

Result: Individual models plateau due to text compression into sparse categories, while ensemble LLM (eLLM) improves robustness and accuracy, achieving substantial performance gains.

Conclusion: eLLM offers scalable and reliable taxonomy-based classification solution that may significantly reduce dependence on human expert labeling.

Abstract: This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs). By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems, including inconsistency, hallucination, category inflation, and misclassification. The eLLM approach yields a substantial performance improvement of up to 65% in F1-score over the strongest single model. We formalize the ensemble process through a mathematical model of collective decision-making and establish principled aggregation criteria. Using the Interactive Advertising Bureau (IAB) hierarchical taxonomy, we evaluate ten state-of-the-art LLMs under identical zero-shot conditions on a human-annotated corpus of 8{,}660 samples. Results show that individual models plateau in performance due to the compression of semantically rich text into sparse categorical representations, while eLLM improves both robustness and accuracy. With a diverse consortium of models, eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification that may significantly reduce dependence on human expert labeling.

[234] Graph-Memoized Reasoning: Foundations Structured Workflow Reuse in Intelligent Systems

Yash Raj Singh

Main category: cs.AI

TL;DR: Graph-Memoized Reasoning framework enables persistent storage and reuse of reasoning workflows as graph-structured memory to reduce computational redundancy in LLM systems.

Details

Motivation: Current LLM systems waste resources by recomputing similar reasoning steps across tasks, increasing latency and limiting reproducibility, highlighting the need for persistent reasoning mechanisms.

Method: Encode past decision graphs and retrieve them through structural and semantic similarity, enabling compositional reuse of reasoning subgraphs across new tasks.

Result: A formal framework with optimization objective that minimizes reasoning cost while regularizing inconsistency between stored and generated workflows.

Conclusion: Establishes groundwork for interpretable, cost-efficient, and self-improving reasoning architectures with persistent memory for large-scale agentic systems.

Abstract: Modern large language model-based reasoning systems frequently recompute similar reasoning steps across tasks, wasting computational resources, inflating inference latency, and limiting reproducibility. These inefficiencies underscore the need for persistent reasoning mechanisms that can recall and reuse prior computational traces. We introduce Graph-Memoized Reasoning, a formal framework for representing, storing, and reusing reasoning workflows as graph-structured memory. By encoding past decision graphs and retrieving them through structural and semantic similarity, our approach enables compositional reuse of subgraphs across new reasoning tasks. We formulate an optimization objective that minimizes total reasoning cost regularized by inconsistency between stored and generated workflows, providing a theoretical foundation for efficiency-consistency trade-offs in intelligent systems. We outline a conceptual evaluation protocol aligned with the proposed optimization objective. This framework establishes the groundwork for interpretable, cost-efficient, and self-improving reasoning architectures, offering a step toward persistent memory in large-scale agentic systems.

[235] How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

Bo Wen, Chen Wang, Erhan Bilal

Main category: cs.AI

TL;DR: The paper investigates how different input modalities (text vs. images) affect perception and reasoning in ARC-AGI tasks, finding that structured text captures precise coordinates while images preserve 2D shapes, and combining modalities improves performance.

Details

Motivation: To understand how input modality shapes model perception in ARC-AGI tasks and separate instruction errors from execution errors, since current systems lack principled accounts of how encodings affect perception.

Method: Used a two-stage reasoning pipeline with weighted set-disagreement metric to isolate perception from reasoning across nine text and image modalities, testing how modality affects feature perception.

Result: Structured text yields precise coordinates on sparse features, images capture 2D shapes but are resolution-sensitive, and combining modalities improves execution (about 8 perception points; about 0.20 median similarity).

Conclusion: Aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and reliable execution without changing the underlying model.

Abstract: ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks – text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing – thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

[236] Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art

Noah Bissell, Ethan Paley, Joshua Harrison, Juliano Calil, Myungin Lee

Main category: cs.AI

TL;DR: Sensorium Arc is an interactive AI system that personifies the ocean as a poetic speaker, enabling natural conversations about marine data through a multi-agent LLM framework that triggers dynamic visualizations based on dialogue cues.

Details

Motivation: To reimagine ocean data as a living narrative rather than abstract datasets, enabling affective and intuitive access to complex environmental information through conversational AI.

Method: Built on a modular multi-agent system and retrieval-augmented LLM framework with keyword detection and semantic parsing to dynamically trigger data visualizations and audiovisual content based on time, location, and thematic dialogue cues.

Result: Developed a real-time multimodal interactive system that generates responses blending scientific insight with ecological poetics, allowing users to explore marine data through natural spoken conversations with an AI agent embodying the ocean’s perspective.

Conclusion: Demonstrates the potential of conversational AI agents to mediate human-machine-ecosystem interactions and proposes a new paradigm for accessing high-dimensional environmental data through affective, intuitive interfaces.

Abstract: Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean’s perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.

[237] MACIE: Multi-Agent Causal Intelligence Explainer for Collective Behavior Understanding

Abraham Itzhak Weinberg

Main category: cs.AI

TL;DR: MACIE is a framework that explains multi-agent reinforcement learning systems using causal models, counterfactuals, and Shapley values to attribute outcomes, quantify emergent behavior, and generate natural language explanations.

Details

Motivation: As MARL systems are used in safety-critical applications, understanding why agents make decisions and how they achieve collective behavior is crucial. Existing explainable AI methods struggle in multi-agent settings.

Method: Combines structural causal models, interventional counterfactuals, and Shapley values to provide comprehensive explanations addressing agent contributions, system-level emergent intelligence, and actionable natural language narratives.

Result: Accurate outcome attribution (mean phi_i = 5.07, std < 0.05), detection of positive emergence in cooperative tasks (synergy index up to 0.461), and efficient computation (0.79 seconds per dataset on CPU) across four MARL scenarios.

Conclusion: MACIE uniquely combines causal rigor, emergence quantification, and multi-agent support while remaining practical for real-time use, representing a step toward interpretable, trustworthy, and accountable multi-agent AI.

Abstract: As Multi Agent Reinforcement Learning systems are used in safety critical applications. Understanding why agents make decisions and how they achieve collective behavior is crucial. Existing explainable AI methods struggle in multi agent settings. They fail to attribute collective outcomes to individuals, quantify emergent behaviors, or capture complex interactions. We present MACIE Multi Agent Causal Intelligence Explainer, a framework combining structural causal models, interventional counterfactuals, and Shapley values to provide comprehensive explanations. MACIE addresses three questions. First, each agent’s causal contribution using interventional attribution scores. Second, system level emergent intelligence through synergy metrics separating collective effects from individual contributions. Third, actionable explanations using natural language narratives synthesizing causal insights. We evaluate MACIE across four MARL scenarios: cooperative, competitive, and mixed motive. Results show accurate outcome attribution, mean phi_i equals 5.07, standard deviation less than 0.05, detection of positive emergence in cooperative tasks, synergy index up to 0.461, and efficient computation, 0.79 seconds per dataset on CPU. MACIE uniquely combines causal rigor, emergence quantification, and multi agent support while remaining practical for real time use. This represents a step toward interpretable, trustworthy, and accountable multi agent AI.

[238] Build AI Assistants using Large Language Models and Agents to Enhance the Engineering Education of Biomechanics

Hanzhi Yan, Qin Lu, Xianqiao Wang, Xiaoming Zhai, Tianming Liu, He Li

Main category: cs.AI

TL;DR: This paper proposes using LLMs with Retrieval-Augmented Generation (RAG) and Multi-Agent Systems (MAS) to create education assistants for biomechanics courses, addressing LLMs’ limitations in domain-specific knowledge and complex reasoning tasks.

Details

Motivation: LLMs struggle with domain-specific applications due to knowledge gaps and decline in performance for complex multi-step reasoning problems, particularly in specialized engineering courses like biomechanics.

Method: Developed a dual-module framework: 1) RAG for improving specificity and logical consistency in conceptual true/false questions, 2) MAS for solving calculation-oriented problems requiring multi-step reasoning and code execution. Evaluated Qwen-1.0-32B, Qwen-2.5-32B, and Llama-70B on a biomechanics dataset.

Result: RAG significantly enhanced LLM performance and stability in answering conceptual questions, surpassing vanilla models. MAS successfully performed multi-step reasoning, equation derivation, code execution, and generated explainable solutions for calculation tasks.

Conclusion: RAG and MAS show strong potential for enhancing LLM performance in specialized engineering courses, providing a promising direction for developing intelligent tutoring systems in engineering education.

Abstract: While large language models (LLMs) have demonstrated remarkable versatility across a wide range of general tasks, their effectiveness often diminishes in domain-specific applications due to inherent knowledge gaps. Moreover, their performance typically declines when addressing complex problems that require multi-step reasoning and analysis. In response to these challenges, we propose leveraging both LLMs and AI agents to develop education assistants aimed at enhancing undergraduate learning in biomechanics courses that focus on analyzing the force and moment in the musculoskeletal system of the human body. To achieve our goal, we construct a dual-module framework to enhance LLM performance in biomechanics educational tasks: 1) we apply Retrieval-Augmented Generation (RAG) to improve the specificity and logical consistency of LLM’s responses to the conceptual true/false questions; 2) we build a Multi-Agent System (MAS) to solve calculation-oriented problems involving multi-step reasoning and code execution. Specifically, we evaluate the performance of several LLMs, i.e., Qwen-1.0-32B, Qwen-2.5-32B, and Llama-70B, on a biomechanics dataset comprising 100 true/false conceptual questions and problems requiring equation derivation and calculation. Our results demonstrate that RAG significantly enhances the performance and stability of LLMs in answering conceptual questions, surpassing those of vanilla models. On the other hand, the MAS constructed using multiple LLMs demonstrates its ability to perform multi-step reasoning, derive equations, execute code, and generate explainable solutions for tasks that require calculation. These findings demonstrate the potential of applying RAG and MAS to enhance LLM performance for specialized courses in engineering curricula, providing a promising direction for developing intelligent tutoring in engineering education.

[239] ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang

Main category: cs.AI

TL;DR: ToolMind is a large-scale, high-quality tool-agentic dataset with 160k synthetic and 200k augmented instances, featuring turn-level validation to prevent error propagation and improve LLM agent training.

Details

Motivation: The scarcity of high-quality trajectories hinders LLM agent development, and existing validation methods overlook turn-level errors that degrade model performance during training.

Method: Construct function graphs based on parameter correlations, use multi-agent framework for realistic interactions, and employ fine-grained turn-level filtering to remove erroneous steps while preserving self-corrective reasoning.

Result: Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

Conclusion: ToolMind’s turn-level validation approach effectively mitigates error amplification during training while preserving essential reasoning signals for robust tool-use learning.

Abstract: Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

[240] Chain of Summaries: Summarization Through Iterative Questioning

William Brach, Lukas Galke Poech

Main category: cs.AI

TL;DR: CoS generates information-dense summaries of web content using iterative refinement inspired by Hegel’s dialectical method, improving LLM performance while reducing token usage.

Details

Motivation: Web content is often in LLM-unfriendly formats and exceeds context length limits, making it difficult for LLMs to effectively use external web information.

Method: Chain of Summaries (CoS) iteratively refines initial summaries through questioning to identify limitations, creating general-purpose summaries that anticipate future information needs.

Result: CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods by up to 27% on TriviaQA, TruthfulQA, and SQUAD datasets, with higher Q&A performance using fewer tokens.

Conclusion: CoS provides an effective solution for making web content more accessible to LLMs while maintaining human oversight capabilities, serving as plain-text repositories of web information.

Abstract: Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel’s dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher Q&A performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.

[241] Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Main category: cs.AI

TL;DR: Step-Audio-R1 is the first audio reasoning model that successfully enables reasoning in audio domains through Modality-Grounded Reasoning Distillation, outperforming Gemini 2.5 Pro and matching Gemini 3 Pro on audio understanding benchmarks.

Details

Motivation: Audio language models consistently perform better with minimal reasoning, raising the question of whether audio intelligence can truly benefit from deliberate thinking. The authors aim to unlock reasoning capabilities in the audio domain.

Method: Proposed Modality-Grounded Reasoning Distillation (MGRD) framework that teaches the model to generate audio-relevant reasoning chains grounded in acoustic features rather than hallucinating disconnected deliberations.

Result: Step-Audio-R1 exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music.

Conclusion: Reasoning is a transferable capability across modalities when appropriately anchored. Step-Audio-R1 transforms extended deliberation from a liability into a powerful asset for audio intelligence and opens pathways toward building truly multimodal reasoning systems.

Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

[242] Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models

Islem Sahraoui

Main category: cs.AI

TL;DR: Multimodal AI framework combining text and image analysis for construction safety hazard identification, evaluated through two case studies using LLMs and VLMs on OSHA reports and construction site imagery.

Details

Motivation: Construction safety data exists in multiple formats (text reports, images) making hazard synthesis challenging with traditional approaches. Need for automated multimodal analysis to improve safety monitoring.

Method: Two case studies: 1) Hybrid pipeline using GPT-4o models to extract insights from 28,000 OSHA accident reports; 2) Evaluation of lightweight open-source VLMs (Molmo 7B, Qwen2 VL 2B) on ConstructionSite10k dataset for rule-level safety violation detection using natural language prompts.

Result: Smaller VLMs showed competitive performance in certain prompt configurations despite their reduced size, demonstrating feasibility of low-resource multimodal systems for safety monitoring.

Conclusion: Multimodal AI frameworks combining text and image analysis are viable for construction safety hazard identification, with lightweight open-source models offering cost-effective alternatives to proprietary systems.

Abstract: This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.

[243] Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, Wei Gao

Main category: cs.AI

TL;DR: This survey introduces a cognitive-based taxonomy for spatial reasoning in MLLMs, organizing tasks by reasoning complexity rather than input modality, and analyzes methods for improving spatial abilities.

Details

Motivation: Spatial reasoning is fundamental to human intelligence but remains challenging for MLLMs. Existing surveys focus on input modalities, but spatial ability is not solely determined by input format.

Method: Introduces a taxonomy organizing spatial intelligence from cognitive aspects, divides tasks by reasoning complexity, maps benchmarks across text, vision-language, and embodied settings, and analyzes training-based and reasoning-based improvement methods.

Result: Enables principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. Clarifies strengths and complementary mechanisms of different improvement approaches.

Conclusion: Provides researchers with comprehensive understanding of spatial reasoning in MLLMs and actionable directions for future research through cognitive perspective analysis.

Abstract: Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.

Hyo-Jeong Jang

Main category: cs.AI

TL;DR: This thesis proposes uncertainty-resilient multimodal learning using consistency-guided cross-modal transfer to handle noisy data, low-quality labels, and heterogeneous modalities in human-computer interaction settings.

Details

Motivation: Multimodal learning faces uncertainty from noisy data, poor labels, and varying modality characteristics, especially critical in human-computer interaction where data quality and annotation consistency vary across users and recording conditions.

Method: Uses cross-modal semantic consistency for robust representation learning by projecting heterogeneous modalities into a shared latent space, mitigating modality gaps and uncovering structural relations for uncertainty estimation and stable feature learning.

Result: Experiments on multimodal affect-recognition benchmarks show significant improvements in model stability, discriminative ability, and robustness to noisy/incomplete supervision. Latent space analyses reveal reliable cross-modal structure under challenging conditions.

Conclusion: Provides a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, offering practical insights for developing reliable brain-computer interface systems.

Abstract: Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.

[245] Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

Philip Drammeh

Main category: cs.AI

TL;DR: Multi-agent orchestration achieves 100% actionable recommendations vs 1.7% for single-agent LLMs in incident response, with zero quality variance across 348 trials.

Details

Motivation: Single-agent LLMs generate vague, unusable recommendations for incident response, limiting their production deployment despite their potential to accelerate response times.

Method: MyAntFarm.ai - a reproducible containerized framework comparing single-agent copilot vs multi-agent systems on identical incident scenarios through 348 controlled trials, introducing Decision Quality (DQ) metric.

Result: Multi-agent orchestration achieves 100% actionable recommendation rate (vs 1.7% single-agent), 80x improvement in action specificity, 140x improvement in solution correctness, and zero quality variance across all trials.

Conclusion: Multi-agent orchestration transforms from performance optimization to production-readiness requirement for LLM-based incident response, enabling SLA commitments impossible with inconsistent single-agent outputs.

Abstract: Large language models (LLMs) promise to accelerate incident response in production systems, yet single-agent approaches generate vague, unusable recommendations. We present MyAntFarm.ai, a reproducible containerized framework demonstrating that multi-agent orchestration fundamentally transforms LLM-based incident response quality. Through 348 controlled trials comparing single-agent copilot versus multi-agent systems on identical incident scenarios, we find that multi-agent orchestration achieves 100% actionable recommendation rate versus 1.7% for single-agent approaches, an 80 times improvement in action specificity and 140 times improvement in solution correctness. Critically, multi-agent systems exhibit zero quality variance across all trials, enabling production SLA commitments impossible with inconsistent single-agent outputs. Both architectures achieve similar comprehension latency (approx.40s), establishing that the architectural value lies in deterministic quality, not speed. We introduce Decision Quality (DQ), a novel metric capturing validity, specificity, and correctness properties essential for operational deployment that existing LLM metrics do not address. These findings reframe multi-agent orchestration from a performance optimization to a production-readiness requirement for LLM-based incident response. All code, Docker configurations, and trial data are publicly available for reproduction.

[246] Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems

Dany Moshkovich, Sergey Zeltyn

Main category: cs.AI

TL;DR: AgentOps is a comprehensive framework for observing, analyzing, optimizing, and automating operations of agentic AI systems that use LLMs, addressing unique uncertainties from probabilistic reasoning and dynamic workflows.

Details

Motivation: Traditional software observability practices are insufficient for agentic AI systems due to unique uncertainties from probabilistic reasoning, evolving memory states, and fluid execution paths in LLM-powered agents.

Method: Proposes AgentOps Automation Pipeline with six stages: behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation, addressing needs of developers, testers, SREs, and business users.

Result: A framework that enables management of uncertainty in agentic systems through automation, ensuring safe, adaptive, and effective operation without eliminating uncertainty but by taming it.

Conclusion: AgentOps provides a comprehensive approach to make agentic AI systems self-improving and manageable through automated operations that handle the unique uncertainties inherent in LLM-powered agentic workflows.

Abstract: Large Language Models (LLMs) are increasingly deployed within agentic systems - collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper presents our vision of AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles - developers, testers, site reliability engineers (SREs), and business users - each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems - not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.

[247] Identifying the Supply Chain of AI for Trustworthiness and Risk Management in Critical Applications

Raymond K. Sheh, Karen Geappen

Main category: cs.AI

TL;DR: This paper addresses the gap in systematic assessment of AI supply chain risks and proposes a taxonomy to help stakeholders systematically inventory dependencies in AI systems used in critical applications.

Details

Motivation: There's a significant gap in assessing supply chain risks in modern AI systems, which is problematic when AI is used in critical applications like healthcare, utilities, and transport. Current risk assessment focuses on algorithmic bias and hallucinations but overlooks the complex web of dependencies in AI supply chains.

Method: The authors survey current AI risk assessment practices and then develop a proposed taxonomy specifically for categorizing AI supply chain entities. This taxonomy helps stakeholders systematically inventory dependencies across their organization’s AI systems.

Result: The paper presents a taxonomy that enables stakeholders, especially those without extensive AI expertise, to “consider the right questions” and systematically assess AI supply chain risks across data sources, pre-trained models, agents, services, and other contributing systems.

Conclusion: The proposed taxonomy bridges the gap between current AI governance practices and the urgent need for actionable risk assessment and management of AI use in critical applications, helping organizations better understand and manage their AI supply chain dependencies.

Abstract: Risks associated with the use of AI, ranging from algorithmic bias to model hallucinations, have received much attention and extensive research across the AI community, from researchers to end-users. However, a gap exists in the systematic assessment of supply chain risks associated with the complex web of data sources, pre-trained models, agents, services, and other systems that contribute to the output of modern AI systems. This gap is particularly problematic when AI systems are used in critical applications, such as the food supply, healthcare, utilities, law, insurance, and transport. We survey the current state of AI risk assessment and management, with a focus on the supply chain of AI and risks relating to the behavior and outputs of the AI system. We then present a proposed taxonomy specifically for categorizing AI supply chain entities. This taxonomy helps stakeholders, especially those without extensive AI expertise, to “consider the right questions” and systematically inventory dependencies across their organization’s AI systems. Our contribution bridges a gap between the current state of AI governance and the urgent need for actionable risk assessment and management of AI use in critical applications.

[248] Balancing Natural Language Processing Accuracy and Normalisation in Extracting Medical Insights

Paulina Tworek, Miłosz Bargieł, Yousef Khan, Tomasz Pełech-Pilichowski, Marek Mikołajczyk, Roman Lewandowski, Jose Sousa

Main category: cs.AI

TL;DR: Comparative analysis of rule-based NLP methods vs LLMs for extracting medical information from Polish EHRs, showing rule-based methods are more accurate for demographics while LLMs excel at drug recognition and scalability.

Details

Motivation: Extracting structured medical insights from unstructured clinical text is challenging, especially in non-English contexts with scarce resources, requiring evaluation of different NLP approaches.

Method: Comparative analysis of low-compute rule-based methods and LLMs for information extraction from Polish EHRs, examining effects of text normalization and translation-induced information loss.

Result: Rule-based methods provide higher accuracy for age and sex extraction, while LLMs offer greater adaptability and scalability, excelling in drug name recognition. Translation impacts LLM effectiveness.

Conclusion: Hybrid approaches combining rule-based precision with LLM adaptability offer practical path toward reliable and resource-efficient clinical NLP in real-world hospitals.

Abstract: Extracting structured medical insights from unstructured clinical text using Natural Language Processing (NLP) remains an open challenge in healthcare, particularly in non-English contexts where resources are scarce. This study presents a comparative analysis of NLP low-compute rule-based methods and Large Language Models (LLMs) for information extraction from electronic health records (EHR) obtained from the Voivodeship Rehabilitation Hospital for Children in Ameryka, Poland. We evaluate both approaches by extracting patient demographics, clinical findings, and prescribed medications while examining the effects of lack of text normalisation and translation-induced information loss. Results demonstrate that rule-based methods provide higher accuracy in information retrieval tasks, particularly for age and sex extraction. However, LLMs offer greater adaptability and scalability, excelling in drug name recognition. The effectiveness of the LLMs was compared with texts originally in Polish and those translated into English, assessing the impact of translation. These findings highlight the trade-offs between accuracy, normalisation, and computational cost when deploying NLP in healthcare settings. We argue for hybrid approaches that combine the precision of rule-based systems with the adaptability of LLMs, offering a practical path toward more reliable and resource-efficient clinical NLP in real-world hospitals.

[249] IMACT-CXR - An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

Tuan-Anh Le, Anh Mai Vu, David Yang, Akash Awasthi, Hien Van Nguyen

Main category: cs.AI

TL;DR: IMACT-CXR is an interactive multi-agent conversational tutor that helps trainees interpret chest X-rays by integrating spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning in an AutoGen-based workflow.

Details

Motivation: To create an effective tutoring system that helps medical trainees improve their chest X-ray interpretation skills by combining multiple analysis modalities and providing personalized feedback.

Method: Uses specialized agents to evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger vision-language reasoning. Implements Bayesian Knowledge Tracing for skill mastery tracking and lung-lobe segmentation for anatomically aware gaze feedback.

Result: The system demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility for live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.

Conclusion: IMACT-CXR successfully integrates multiple AI components into a unified tutoring workflow that enhances chest X-ray interpretation training while maintaining control over answer disclosure and system responsiveness.

Abstract: IMACT-CXR is an interactive multi-agent conversational tutor that helps trainees interpret chest X-rays by unifying spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning in a single AutoGen-based workflow. The tutor simultaneously ingests learner bounding boxes, gaze samples, and free-text observations. Specialized agents evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger NV-Reason-CXR-3B for vision-language reasoning when mastery remains low or the learner explicitly asks. Bayesian Knowledge Tracing (BKT) maintains skill-specific mastery estimates that drive both knowledge reinforcement and case similarity retrieval. A lung-lobe segmentation module derived from a TensorFlow U-Net enables anatomically aware gaze feedback, and safety prompts prevent premature disclosure of ground-truth labels. We describe the system architecture, implementation highlights, and integration with the REFLACX dataset for real DICOM cases. IMACT-CXR demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility toward live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.

[250] Mini Amusement Parks (MAPs): A Testbed for Modelling Business Decisions

Stéphane Aroca-Ouellette, Ian Berlot-Attwell, Panagiotis Lymperopoulos, Abhiramon Rajasekharan, Tongqi Zhu, Herin Kang, Kaheer Suleman, Sam Pasupalak

Main category: cs.AI

TL;DR: MAPs is an amusement-park simulator that evaluates AI agents’ holistic decision-making abilities in complex business management scenarios, revealing significant performance gaps between humans and current LLM agents.

Details

Motivation: Current AI systems struggle with interconnected real-world decision-making challenges, and existing benchmarks isolate capabilities rather than assessing holistic competence in modeling environments, long-term planning under uncertainty, and strategic business operations.

Method: Developed Mini Amusement Parks (MAPs) simulator to evaluate agents’ ability to model environment dynamics, anticipate long-term consequences, and strategically operate a complex business, with human baselines and comprehensive evaluation of state-of-the-art LLM agents.

Result: Humans significantly outperform LLM agents by 6.5x on easy mode and 9.8x on medium mode, revealing persistent weaknesses in long-horizon optimization, sample-efficient learning, spatial reasoning, and world modeling.

Conclusion: MAPs provides a unified foundation for benchmarking adaptable decision-making agents by integrating multiple real-world challenges in a single environment, highlighting the need for improved AI capabilities in complex, interconnected decision-making scenarios.

Abstract: Despite rapid progress in artificial intelligence, current systems struggle with the interconnected challenges that define real-world decision making. Practical domains, such as business management, require optimizing an open-ended and multi-faceted objective, actively learning environment dynamics from sparse experience, planning over long horizons in stochastic settings, and reasoning over spatial information. Yet existing human–AI benchmarks isolate subsets of these capabilities, limiting our ability to assess holistic decision-making competence. We introduce Mini Amusement Parks (MAPs), an amusement-park simulator designed to evaluate an agent’s ability to model its environment, anticipate long-term consequences under uncertainty, and strategically operate a complex business. We provide human baselines and a comprehensive evaluation of state-of-the-art LLM agents, finding that humans outperform these systems by 6.5x on easy mode and 9.8x on medium mode. Our analysis reveals persistent weaknesses in long-horizon optimization, sample-efficient learning, spatial reasoning, and world modelling. By unifying these challenges within a single environment, MAPs offers a new foundation for benchmarking agents capable of adaptable decision making. Code: https://github.com/Skyfall-Research/MAPs

[251] Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs

Ivan Chulo, Ananya Joshi

Main category: cs.AI

TL;DR: Improved Theory of Mind in LLMs via activation steering is mediated by enhanced emotional processing and suppressed analytical reasoning.

Details

Motivation: To understand the internal mechanisms behind activation steering's improvement of Theory of Mind in language models, specifically what cognitive changes occur that lead to different outputs.

Method: Decomposed ToM by comparing steered vs baseline LLM activations using linear probes trained on 45 cognitive actions, applied Contrastive Activation Addition steering to Gemma-3-4B, and evaluated on 1,000 BigToM forward belief scenarios.

Result: Steering improved belief attribution accuracy from 32.5% to 46.7%, mediated by increased emotional processing (emotion perception +2.23, emotion valuing +2.20) and suppressed analytical processes (questioning -0.78, convergent thinking -1.59).

Conclusion: Successful Theory of Mind abilities in LLMs are mediated by emotional understanding rather than analytical reasoning.

Abstract: Recent work shows activation steering substantially improves language models’ Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs’ activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5% to 46.7% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.

[252] Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs

Chelsea Zou, Yiheng Yao, Basant Khalil

Main category: cs.AI

TL;DR: Self-correcting framework for LLMs that detects hallucinations during multi-step reasoning using fine-grained uncertainty signals and reinforcement learning.

Details

Motivation: To improve LLM reasoning by detecting and mitigating hallucinations in real-time, focusing on both final answer correctness and intermediate reasoning faithfulness.

Method: Uses self-assessed confidence alignment and token-level entropy spikes to detect unreliable reasoning, then applies RL with composite reward function to penalize unjustified confidence and encourage stable reasoning.

Result: Improves both final answer accuracy and reasoning calibration, with ablations validating individual signal contributions.

Conclusion: The framework successfully makes LLMs more introspective and improves reasoning coherence and faithfulness through confidence-aware reward feedback.

Abstract: This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model’s generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.

[253] JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy, Morteza Ziyadi, Xuan Wang

Main category: cs.AI

TL;DR: JudgeBoard is a novel evaluation pipeline that directly queries models to assess answer correctness without requiring answer comparisons. The MAJ framework uses multiple interacting SLMs to achieve LLM-level judgment accuracy through collaborative deliberation.

Details

Motivation: Small language models (SLMs) have unclear ability to judge answer correctness compared to LLMs. Existing LLM-as-a-judge frameworks rely on indirect answer comparisons with predefined metrics, limiting fine-grained and scalable evaluation of reasoning outputs.

Method: Proposed JudgeBoard pipeline for direct correctness assessment without answer comparisons. Developed MAJ (Multi-Agent Judging) framework using multiple interacting SLMs with distinct reasoning profiles for collaborative deliberation. Constructed task-specific evaluation leaderboards using accuracy-based ranking and Elo-based rating across five benchmark datasets.

Result: Significant performance gap between SLMs and LLMs in isolated judging tasks. MAJ framework substantially improves SLM reliability and consistency. On MATH dataset, MAJ with smaller models performs comparably or better than larger counterparts. Multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks.

Conclusion: Multi-agent SLM systems show promise for scalable and efficient assessment, potentially matching LLM performance in judgment tasks. The MAJ framework enables lightweight models to achieve high-quality evaluation capabilities through collaborative deliberation.

Abstract: While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

[254] KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou

Main category: cs.AI

TL;DR: KRAL is a novel paradigm that enhances clinical LLMs by distilling knowledge and reasoning from teacher models via reverse generation, heuristic learning, and agentic reinforcement learning, achieving superior performance at lower costs.

Details

Motivation: Address limitations of LLMs in clinical decision-making including knowledge gaps, privacy concerns, high costs, and limited reasoning capabilities.

Method: Knowledge and reasoning distillation via answer-to-question reverse generation, heuristic learning for semi-supervised data augmentation, agentic reinforcement learning, and hierarchical evaluation with teacher-model proxies.

Result: Outperforms RAG and SFT: Accuracy@1 on MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG; Pass@1 on PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG, at ~20% of SFT’s training costs.

Conclusion: KRAL enables low-cost, high-safety deployment of enhanced local LLMs for complex medical decision support, addressing key limitations in clinical applications.

Abstract: Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles, host factors, pharmacological properties of antimicrobials, and the severity of infection.This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at ~20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

[255] Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani

Main category: cs.AI

TL;DR: A dual-method detection system combining semantic drift analysis and canary baseline comparison achieves 92.5% accuracy in identifying backdoored LLMs in real-time without model modification.

Details

Motivation: LLMs can be backdoored to exhibit malicious behavior under specific conditions while appearing safe during training, and no practical detection methods currently exist for these "sleeper agents".

Method: Combines semantic drift analysis using Sentence-BERT embeddings to measure deviation from safe baselines with injected canary questions that monitor response consistency.

Result: Achieved 92.5% accuracy with 100% precision (zero false positives) and 85% recall on the Cadenza-Labs dolphin-llama3-8B sleeper agent model, operating in real-time (<1s per query).

Conclusion: Provides the first practical solution to LLM backdoor detection, demonstrating that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

Abstract: Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as “sleeper agents.” Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

[256] CARE-RAG - Clinical Assessment and Reasoning in RAG

Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao, Junyuan Hong, Ying Ding

Main category: cs.AI

TL;DR: LLMs struggle with clinical reasoning even when provided correct evidence, requiring evaluation of reasoning quality beyond just retrieval accuracy.

Details

Motivation: There's a gap between retrieval and reasoning in LLMs, especially concerning in clinical settings where outputs must follow structured protocols like Written Exposure Therapy guidelines.

Method: Proposed an evaluation framework measuring accuracy, consistency, and fidelity of reasoning, using clinician-vetted questions and authoritative passages.

Result: Errors persist even with correct evidence; RAG can constrain outputs but safe deployment requires rigorous reasoning assessment.

Conclusion: Safe clinical deployment of LLMs requires assessing reasoning as rigorously as retrieval, not just providing correct evidence.

Abstract: Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

[257] MUSEKG: A Knowledge Graph Over Museum Collections

Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden

Main category: cs.AI

TL;DR: MuseKG is a knowledge-graph framework that integrates structured and unstructured museum data through symbolic-neural integration, enabling natural language queries and outperforming LLM and SPARQL baselines.

Details

Motivation: Digital transformation in cultural heritage has created fragmented artefact data collections, with existing museum information systems struggling to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent queryable form.

Method: MuseKG constructs a typed property graph linking objects, people, organisations, and visual/textual labels through symbolic-neural integration, supporting natural language queries.

Result: Evaluations on real museum collections show robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines.

Conclusion: The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, paving the way for web-scale integration of digital heritage knowledge.

Abstract: Digital transformation in the cultural heritage sector has produced vast yet fragmented collections of artefact data. Existing frameworks for museum information systems struggle to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent and queryable form. We present MuseKG, an end-to-end knowledge-graph framework that unifies structured and unstructured museum data through symbolic-neural integration. MuseKG constructs a typed property graph linking objects, people, organisations, and visual or textual labels, and supports natural language queries. Evaluations on real museum collections demonstrate robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines. The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, and pave the way for web-scale integration of digital heritage knowledge.

[258] SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model

Emanuel C. Silva, Emily S. M. Salum, Gabriel M. Arantes, Matheus P. Pereira, Vinicius F. Oliveira, Alessandro L. Bicho

Main category: cs.AI

TL;DR: SpellForger is a game where players create custom spells using natural language prompts, with AI interpreting descriptions to generate balanced spell parameters in real-time.

Details

Motivation: To explore AI as a core gameplay co-creation tool rather than just for content generation, enabling unique personalization and creativity experiences.

Method: Uses a supervised-trained BERT model to interpret natural language prompts, mapping them to spell prefabs and balancing parameters (damage, cost, effects). Built with Unity Game Engine and Python AI backend.

Result: Expected to deliver a functional prototype demonstrating real-time spell generation within an engaging gameplay loop centered on player creativity.

Conclusion: Validates the use of AI as a direct gameplay mechanic for co-creation and personalization in gaming.

Abstract: Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.

[259] An Aligned Constraint Programming Model For Serial Batch Scheduling With Minimum Batch Size

Jorge A. Huertas, Pascal Van Hentenryck

Main category: cs.AI

TL;DR: A novel Constraint Programming model for serial batch scheduling with minimum batch sizes that avoids predefined virtual batches, using direct sequence reasoning and improved search strategies to outperform existing methods.

Details

Motivation: Existing CP models for serial batch scheduling rely on predefined virtual batches that suffer from dimensionality issues and complexity, especially in practical settings like semiconductor manufacturing where minimum batch sizes are required.

Method: Proposes a CP model with key alignment parameters that reason directly on sequences of same-family jobs, enhanced with tailored search phases and strengthened constraint propagator inference levels.

Result: Extensive experiments on 5000 instances show superiority on small-to-medium instances (up to 100 jobs) and ability to find solutions up to 25% better than existing methods on large instances (up to 500 jobs, 10 families, 10 machines).

Conclusion: The proposed CP model provides a more compact formulation that significantly outperforms existing MIP, tabu search, and CP approaches for serial batch scheduling with minimum batch sizes.

Abstract: In serial batch (s-batch) scheduling, jobs from similar families are grouped into batches and processed sequentially to avoid repetitive setups that are required when processing consecutive jobs of different families. Despite its large success in scheduling, only three Constraint Programming (CP) models have been proposed for this problem considering minimum batch sizes, which is a common requirement in many practical settings, including the ion implantation area in semiconductor manufacturing. These existing CP models rely on a predefined virtual set of possible batches that suffers from the curse of dimensionality and adds complexity to the problem. This paper proposes a novel CP model that does not rely on this virtual set. Instead, it uses key alignment parameters that allow it to reason directly on the sequences of same-family jobs scheduled on the machines, resulting in a more compact formulation. This new model is further improved by exploiting the problem’s structure with tailored search phases and strengthened inference levels of the constraint propagators. The extensive computational experiments on nearly five thousand instances compare the proposed models against existing methods in the literature, including mixed-integer programming formulations, tabu search meta-heuristics, and CP approaches. The results demonstrate the superiority of the proposed models on small-to-medium instances with up to 100 jobs, and their ability to find solutions up to 25% better than the ones produces by existing methods on large-scale instances with up to 500 jobs, 10 families, and 10 machines.

[260] Artificial Intelligence and Accounting Research: A Framework and Agenda

Theophanis C. Stratopoulos, Victor Xiaoqi Wang

Main category: cs.AI

TL;DR: This paper presents a framework for analyzing AI-accounting research along two dimensions (research focus and methodological approach), maps existing studies, identifies research opportunities, and examines how GenAI transforms accounting research processes and creates competitive pressures.

Details

Motivation: Recent advances in generative AI and large language models are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars that need to be systematically analyzed.

Method: Proposes a classification framework with two dimensions (research focus: accounting-centric vs AI-centric; methodological approach: AI-based vs traditional), applies it to papers from IJAIS special issue and leading accounting journals, and analyzes research workflow capabilities of human researchers vs AI agents.

Result: The framework successfully maps existing AI-accounting research and reveals strategic positioning opportunities. Analysis shows GenAI democratizes certain research capabilities but intensifies competition by raising expectations for higher-order contributions where human judgment and creativity remain valuable.

Conclusion: Accounting researchers need strategic positioning and collaboration to leverage their expertise, and doctoral education should be reformed to cultivate comparative advantages while building AI fluency in response to these transformative shifts.

Abstract: Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars’ strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.

[261] A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management

Hrikshesh Kumar, Anika Garg, Anshul Gupta, Yashika Agarwal

Main category: cs.AI

TL;DR: Proactive cloud-edge workload management using CNN-LSTM forecasting embedded in multi-agent DRL, outperforming reactive methods by optimizing cost, performance, and reliability simultaneously.

Details

Motivation: Traditional reactive resource management with static thresholds leads to either overspending or performance degradation, requiring proactive solutions that anticipate problems rather than reacting to them.

Method: Hybrid architecture combining CNN-LSTM for time series forecasting with multi-agent Deep Reinforcement Learning, embedding predictive forecasts directly into the DRL agent’s state space for long-term planning.

Result: The system significantly outperforms traditional methods, effectively solving complex multi-objective optimization problems involving cost efficiency, performance, and reliability.

Conclusion: Embedding predictive capabilities into DRL agents enables smarter, proactive resource management that finds optimal balance between cost savings and system performance, providing smooth long-term planning rather than reactive problem-solving.

Abstract: Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable

[262] SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Main category: cs.AI

TL;DR: SkyRL-Agent is a framework for efficient multi-turn agent training with optimized asynchronous dispatching and tool integration, used to train SA-SWE-32B software engineering agent achieving 39.4% Pass@1 on SWE-Bench with 2x cost reduction.

Details

Motivation: To create an efficient framework for training and evaluating multi-turn, long-horizon agents with seamless integration to existing RL frameworks and improved training efficiency.

Method: Uses optimized asynchronous pipeline dispatcher (1.55x speedup), tool-enhanced training with AST-based search tool for code navigation, and reinforcement learning training from Qwen3-32B base model.

Result: SA-SWE-32B achieves 39.4% Pass@1 on SWE-Bench Verified with >2x cost reduction, and generalizes well to other agentic tasks like Terminal-Bench, BrowseComp-Plus, and WebArena despite being trained only on SWE tasks.

Conclusion: SkyRL-Agent enables efficient agent training with significant performance improvements and cost reductions, demonstrating strong generalization capabilities and extensibility across different agent types and training backends.

Abstract: We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent’s extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

[263] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing

Main category: cs.AI

TL;DR: OpenMMReasoner introduces a transparent two-stage training recipe for multimodal reasoning with supervised fine-tuning and reinforcement learning, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct baseline across nine benchmarks.

Details

Motivation: Despite progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies hinders scalable research in multimodal reasoning.

Method: Two-stage approach: (1) SFT stage with 874K-sample cold-start dataset with step-by-step validation, (2) RL stage with 74K-sample dataset across diverse domains to sharpen and stabilize reasoning abilities.

Result: Achieves 11.6% improvement over Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, demonstrating superior performance and robustness.

Conclusion: The work establishes a solid empirical foundation for future large-scale multimodal reasoning research and highlights the critical role of data quality and training design, with all code, pipeline, and data open-sourced.

Abstract: Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

[264] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints

Yongnan Jin, Xurui Li, Feng Cao, Liucun Gao, Juanjuan Yao

Main category: cs.AI

TL;DR: MR-RML is a novel alignment framework that addresses LLM limitations in medical practice through multidimensional rubric-oriented reward model learning with geometric projection constraints, achieving state-of-the-art performance on medical benchmarks.

Details

Motivation: Current LLMs face critical alignment challenges in medical applications: disconnect from dynamic clinical needs, difficulty adapting to evolving medical standards, and inability of conventional reward models to capture nuanced medical quality criteria.

Method: Proposes MR-RML via GPRC framework with three innovations: (1) Dimensions-Scenarios-Disciplines medical standard system, (2) independent multi-dimensional reward model that decomposes evaluation criteria, (3) geometric projection reference constraints that transform medical logic into mathematical regularization.

Result: Achieves substantial performance gains over base Qwen-32B (45% on full subset, 85% on Hard subset), sets SOTA among open-source LLMs with scores of 62.7 (full) and 44.7 (hard), and outperforms most closed-source models on Healthbench benchmark.

Conclusion: The MR-RML framework effectively addresses LLM alignment challenges in medical practice by integrating medical standards throughout training and enabling synthetic data-driven optimization, demonstrating significant improvements in clinical utility.

Abstract: The integration of large language models (LLMs) into medical practice holds transformative potential, yet their real-world clinical utility remains limited by critical alignment challenges: (1) a disconnect between static evaluation benchmarks and dynamic clinical cognitive needs, (2) difficulties in adapting to evolving, multi-source medical standards, and (3) the inability of conventional reward models to capture nuanced, multi-dimensional medical quality criteria. To address these gaps, we propose MR-RML (Multidimensional Rubric-oriented Reward Model Learning) via GPRC (Geometric Projection Reference Constraints), a novel alignment framework that integrates medical standards into a structured “Dimensions-Scenarios-Disciplines” matrix to guide data generation and model optimization. MR-RML introduces three core innovations: (1) a “Dimensions-Scenarios-Disciplines” medical standard system that embeds domain standards into the full training pipeline; (2) an independent multi-dimensional reward model that decomposes evaluation criteria, shifting from real-time rubric-based scoring to internalized reward modeling for improved consistency and cost-efficiency; (3) geometric projection reference constraints that transform medical cognitive logic into mathematical regularization, aligning scoring gradients with clinical reasoning and enabling synthetic data-driven training. Through extensive evaluations on the authoritative medical benchmark Healthbench, our method yields substantial performance gains over the base LLM Qwen-32B (45% on the full subset and 85% on Hard subset, respectively). It achieves a SOTA among open-source LLMs with scores of 62.7 (full subset) and 44.7 (hard subset), while also outperforming the majority of closed-source models.

[265] TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li, Linbo Jiang, Jianan Lin, Chaochao Chen

Main category: cs.AI

TL;DR: TOFA is a training-free one-shot federated adaptation framework for Vision-Language Models that addresses communication costs and data heterogeneity without requiring additional client or server training resources.

Details

Motivation: Existing federated VLM adaptation methods incur high communication costs and are vulnerable to attacks due to iterative training. One-shot approaches face challenges in exploiting multimodal information, handling data heterogeneity, and avoiding additional training resources.

Method: TOFA uses visual and textual pipelines: a hierarchical Bayesian model learns personalized class prototypes in the visual pipeline, while the textual pipeline evaluates and aligns local text prompts globally. An adaptive weight calibration mechanism combines both modalities.

Result: Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of TOFA in handling data heterogeneity while maintaining performance.

Conclusion: TOFA provides an efficient, training-free solution for federated VLM adaptation that balances personalization and robustness without requiring additional training resources on clients or server.

Abstract: Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris

Main category: cs.AI

TL;DR: FOOTPASS is the first benchmark for play-by-play action spotting in soccer that combines computer vision outputs with tactical knowledge to generate reliable play-by-play data streams.

Details

Motivation: Current action recognition methods are insufficient for reliable play-by-play data extraction, while tactical modeling research has advanced but needs automated vision-based support.

Method: Multi-modal, multi-agent approach that integrates computer vision tasks (tracking, identification) with tactical knowledge and soccer regularities over long time horizons.

Result: Created FOOTPASS dataset enabling player-centric action spotting methods that generate reliable play-by-play data streams for sports analytics.

Conclusion: FOOTPASS provides the first benchmark for holistic play-by-play action spotting in soccer, bridging computer vision and tactical knowledge for automated sports analytics.

Abstract: Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.

[267] From Performance to Understanding: A Vision for Explainable Automated Algorithm Design

Niki van Stein, Anna V. Kononova, Thomas Bäck

Main category: cs.AI

TL;DR: The paper proposes explainable automated algorithm design by combining LLM-driven discovery with systematic benchmarking to understand why algorithms work, rather than just optimizing performance.

Details

Motivation: Current LLM-based automated algorithm design is performance-driven but opaque - it doesn't reveal why algorithms work, which components matter, or how design choices relate to problem structures.

Method: Three-pillar approach: (1) LLM-driven discovery of algorithmic variants, (2) explainable benchmarking that attributes performance to components and hyperparameters, (3) problem-class descriptors connecting algorithm behavior to landscape structure.

Result: Forms a closed knowledge loop where discovery, explanation, and generalization reinforce each other, shifting from blind search to interpretable, class-specific algorithm design.

Conclusion: This integration will accelerate progress while producing reusable scientific insight into when and why optimization strategies succeed, moving beyond pure automation to understanding.

Abstract: Automated algorithm design is entering a new phase: Large Language Models can now generate full optimisation (meta)heuristics, explore vast design spaces and adapt through iterative feedback. Yet this rapid progress is largely performance-driven and opaque. Current LLM-based approaches rarely reveal why a generated algorithm works, which components matter or how design choices relate to underlying problem structures. This paper argues that the next breakthrough will come not from more automation, but from coupling automation with understanding from systematic benchmarking. We outline a vision for explainable automated algorithm design, built on three pillars: (i) LLM-driven discovery of algorithmic variants, (ii) explainable benchmarking that attributes performance to components and hyperparameters and (iii) problem-class descriptors that connect algorithm behaviour to landscape structure. Together, these elements form a closed knowledge loop in which discovery, explanation and generalisation reinforce each other. We argue that this integration will shift the field from blind search to interpretable, class-specific algorithm design, accelerating progress while producing reusable scientific insight into when and why optimisation strategies succeed.

[268] Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang, Eric Yang, Lynn Ai, Bill Shi

Main category: cs.AI

TL;DR: CRM replaces single reward models with a team of specialist evaluators for better robustness and interpretability in RLHF, using multi-agent collaboration and fusion of domain-specific signals.

Details

Motivation: Conventional reward models struggle with optimizing multiple conflicting preference dimensions and lack transparency in scoring decisions, requiring a more robust and interpretable approach.

Method: Decomposes preference evaluation into domain-specific agents producing partial signals, with a centralized aggregator fusing signals using factors like step-wise correctness, multi-agent agreement, and repetition penalties.

Result: Enables multi-perspective reward shaping without additional human annotations, compatible with standard RL pipelines using advantage-based updates and value model regression.

Conclusion: CRM and rewardBench provide a practical, modular path to more transparent reward modeling and stable optimization through collaborative multi-agent evaluation.

Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.

[269] ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

Xu Qiang, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li

Main category: cs.AI

TL;DR: ChemO is a new benchmark from the International Chemistry Olympiad 2025 that addresses chemistry’s multimodal challenges through Assessment-Equivalent Reformulation and Structured Visual Enhancement, with ChemLabs multi-agent framework achieving state-of-the-art performance.

Details

Motivation: Chemistry has remained an open challenge for AI reasoning due to its unique multimodal symbolic language, unlike mathematics and physics which already have Olympiad-level benchmarks.

Method: Uses Assessment-Equivalent Reformulation to convert visual output problems into tractable formats, Structured Visual Enhancement to separate visual perception from chemical reasoning, and ChemLabs hierarchical multi-agent framework with specialized agents.

Result: The top configuration achieves 93.6/100 score, surpassing estimated human gold medal threshold and establishing new state-of-the-art in automated chemical problem-solving.

Conclusion: ChemO benchmark with innovative assessment methods and multi-agent framework successfully addresses chemistry’s multimodal reasoning challenges, demonstrating superior performance over existing approaches.

Abstract: Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model’s visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO

[270] D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang

Main category: cs.AI

TL;DR: D-GARA is a dynamic benchmarking framework that evaluates Android GUI agent robustness against real-world anomalies like permission dialogs and battery warnings, showing significant performance degradation in current agents.

Details

Motivation: Existing GUI agent datasets are static and idealized, failing to capture real-world complexity and unpredictability, particularly the presence of anomalies that agents commonly encounter.

Method: Proposed D-GARA framework with diverse real-world anomalies, constructed benchmark with annotated Android applications containing embedded anomalies, and conducted comprehensive experiments.

Result: State-of-the-art GUI agents showed substantial performance degradation when exposed to anomaly-rich environments, demonstrating the need for robustness-aware learning.

Conclusion: D-GARA provides a modular and extensible framework for evaluating GUI agent robustness against real-world anomalies, supporting integration of new tasks and anomaly types.

Abstract: Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

[271] FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen, Wentao Zhang

Main category: cs.AI

TL;DR: Automated pipeline extracts high-quality QA/VQA pairs from educational documents using OCR and LLM-based parsing, enabling scalable use of real-world educational content for LLM training.

Details

Motivation: Existing instruction-tuning datasets are costly and rely on synthetic samples with hallucination issues, while abundant high-quality educational QA content remains underexploited due to PDF transformation difficulties.

Method: Combines layout-aware OCR with LLM-based semantic parsing to extract well-formed QA and visual-QA pairs from educational documents.

Result: Produces accurate, aligned, and low-noise QA/VQA pairs across diverse document types, enabling scalable use of real-world educational content.

Conclusion: Provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training, with all code and pipelines open-sourced.

Abstract: The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.

[272] Revisiting Fairness-aware Interactive Recommendation: Item Lifecycle as a Control Knob

Yun Lu, Xiaoyu Shi, Hong Xie, Chongjun Xia, Zhenhui Gong, Mingsheng Shang

Main category: cs.AI

TL;DR: This paper introduces LHRL, a lifecycle-aware hierarchical reinforcement learning framework for fair interactive recommendation that leverages item lifecycle patterns to dynamically balance fairness and accuracy.

Details

Motivation: To address fairness in interactive recommendation by exploiting the compressed three-phase lifecycle pattern of items in short-video platforms, which differs from classical four-stage models.

Method: LHRL framework with PhaseFormer for phase detection (combining STL decomposition and attention) and two-level HRL agent (high-level policy for phase-aware fairness constraints, low-level policy for user engagement optimization).

Result: Experiments show LHRL significantly improves both fairness and user engagement across multiple real-world datasets, and lifecycle-aware rewards enhance existing RL models.

Conclusion: Item lifecycle is a valuable control knob for fairness-aware recommendation, and the LHRL framework effectively reconciles long-term equity with short-term utility while being generalizable to existing models.

Abstract: This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, which significantly deviates from the classical four-stage model (introduction, growth, maturity, decline). Second, we introduce LHRL, a lifecycle-aware hierarchical reinforcement learning framework that dynamically harmonizes fairness and accuracy by leveraging phase-specific exposure dynamics. LHRL consists of two key components: (1) PhaseFormer, a lightweight encoder combining STL decomposition and attention mechanisms for robust phase detection; (2) a two-level HRL agent, where the high-level policy imposes phase-aware fairness constraints, and the low-level policy optimizes immediate user engagement. This decoupled optimization allows for effective reconciliation between long-term equity and short-term utility. Third, experiments on multiple real-world interactive recommendation datasets demonstrate that LHRL significantly improves both fairness and user engagement. Furthermore, the integration of lifecycle-aware rewards into existing RL-based models consistently yields performance gains, highlighting the generalizability and practical value of our approach.

[273] MuISQA: Multi-Intent Retrieval-Augmented Generation for Scientific Question Answering

Zhiyuan Li, Haisheng Yu, Guangchuan Guo, Nan Zhou, Jiajun Zhang

Main category: cs.AI

TL;DR: The paper introduces MuISQA benchmark to evaluate multi-intent scientific QA and proposes an intent-aware retrieval framework that uses LLMs to hypothesize answers, decompose queries, and retrieve evidence for different intents.

Details

Motivation: Conventional RAG systems are single-intent oriented, leading to incomplete evidence coverage for complex scientific questions that require multi-hop reasoning across diverse sources.

Method: An intent-aware retrieval framework that uses LLMs to hypothesize potential answers, decompose them into intent-specific queries, retrieve supporting passages for each intent, and aggregate results using Reciprocal Rank Fusion (RRF).

Result: The method consistently outperforms conventional approaches on both MuISQA benchmark and general RAG datasets, particularly in retrieval accuracy and evidence coverage.

Conclusion: The proposed intent-aware retrieval framework effectively addresses the limitations of single-intent RAG systems by improving heterogeneous evidence coverage for multi-intent scientific questions.

Abstract: Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.

[274] Distributed Agent Reasoning Across Independent Systems With Strict Data Locality

Daniel Vaughan, Kateřina Vaughan

Main category: cs.AI

TL;DR: Proof-of-concept for agent-to-agent communication across distributed systems using natural language messages without shared identifiers or centralized data exchange, demonstrating secure cooperation between organizations via pseudonymised tokens and local data lookups.

Details

Motivation: To explore how multiple organizations can cooperate securely while maintaining data privacy and operational boundaries, without requiring shared identifiers, structured schemas, or centralized data exchange.

Method: Uses Orpius platform for multi-agent orchestration with OperationRelay calls for natural-language communication. Agents operate on local data only, using HMAC-based pseudonymous tokens for privacy. Clinic generates tokens, Insurer evaluates coverage and consults Specialist, who returns recommendations.

Result: Successfully demonstrated feasibility of distributed agent communication using natural language messages and pseudonymised tokens while keeping data local to each organization. No patient identity is shared or reconstructed.

Conclusion: The prototype shows architectural patterns for privacy-preserving distributed reasoning among specialized agents. Future work needs more rigorous evaluation and research in decentralized multi-agent systems.

Abstract: This paper presents a proof-of-concept demonstration of agent-to-agent communication across distributed systems, using only natural-language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries. The system uses Orpius as the underlying platform for multi-agent orchestration, tool execution, and privacy-preserving communication. All agents communicate through OperationRelay calls, exchanging concise natural-language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC-based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation. The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production-ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and communication flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi-agent systems.

[275] Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

Anna Luiza Gomes da Silva, Diego Kreutz, Angelo Diniz, Rodrigo Mansilha, Celso Nobre da Fonseca

Main category: cs.AI

TL;DR: Proposes a Super-Metric that aggregates eight metrics across four fidelity dimensions to evaluate synthetic Android malware data quality, showing better stability and correlation with classifier performance than traditional metrics.

Details

Motivation: Existing metrics for evaluating synthetic Android malware data suffer from instability and lack of standardization, making quality assessment challenging.

Method: Integrated a Super-Metric into MalDataGen that aggregates eight metrics across four fidelity dimensions to produce a single weighted score.

Result: Experiments with ten generative models and five balanced datasets showed the Super-Metric is more stable and consistent than traditional metrics, with stronger correlations with classifier performance.

Conclusion: The proposed Super-Metric provides a more reliable and standardized approach for evaluating synthetic Android malware data quality compared to existing metrics.

Abstract: Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

[276] An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

Alexander Zadorojniy, Segev Wasserkrug, Eitan Farchi

Main category: cs.AI

TL;DR: Proposes an agent-based method for automatic validation of optimization models generated from natural language descriptions using software testing techniques.

Details

Motivation: There's a need to validate that optimization models generated by LLMs from natural language descriptions are correct and satisfy requirements, as current validation methods are insufficient.

Method: Uses multiple agents to: 1) generate problem-level testing API, 2) create tests using this API, and 3) generate optimization model-specific mutations to assess test suite effectiveness.

Result: The framework provides high-quality validation as measured by mutation coverage, a well-known software testing metric.

Conclusion: The proposed agent-based validation framework effectively addresses the challenge of validating LLM-generated optimization models through software testing techniques.

Abstract: Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.

[277] CorrectHDL: Agentic HDL Design with LLMs Leveraging High-Level Synthesis as Reference

Kangwei Xu, Grace Li Zhang, Ulf Schlichtmann, Bing Li

Main category: cs.AI

TL;DR: CorrectHDL framework uses HLS results as functional references to correct errors in LLM-generated HDL designs, achieving better area/power efficiency than conventional HLS while maintaining correctness.

Details

Motivation: LLMs show potential in hardware design but suffer from hallucination that introduces functional errors in generated HDL designs, requiring a solution to ensure correctness while leveraging LLM capabilities.

Method: Uses C/C++ program input to generate HDL via LLM, repairs syntax errors with RAG, then iteratively improves functional correctness by comparing simulated behavior with HLS reference design from conventional tools.

Result: Generated circuits achieve significantly better area and power efficiency than conventional HLS designs and approach human-engineered circuit quality while maintaining functional correctness.

Conclusion: The framework effectively combines LLM generative capabilities with traditional correctness-driven design flows, demonstrating the potential of agentic HDL design.

Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in hardware front-end design using hardware description languages (HDLs). However, their inherent tendency toward hallucination often introduces functional errors into the generated HDL designs. To address this issue, we propose the framework CorrectHDL that leverages high-level synthesis (HLS) results as functional references to correct potential errors in LLM-generated HDL designs.The input to the proposed framework is a C/C++ program that specifies the target circuit’s functionality. The program is provided to an LLM to directly generate an HDL design, whose syntax errors are repaired using a Retrieval-Augmented Generation (RAG) mechanism. The functional correctness of the LLM-generated circuit is iteratively improved by comparing its simulated behavior with an HLS reference design produced by conventional HLS tools, which ensures the functional correctness of the result but can lead to suboptimal area and power efficiency. Experimental results demonstrate that circuits generated by the proposed framework achieve significantly better area and power efficiency than conventional HLS designs and approach the quality of human-engineered circuits. Meanwhile, the correctness of the resulting HDL implementation is maintained, highlighting the effectiveness and potential of agentic HDL design leveraging the generative capabilities of LLMs and the rigor of traditional correctness-driven IC design flows.

[278] Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance

Jacopo Tagliabue, Federico Bianchi, Ciro Greco

Main category: cs.AI

TL;DR: Bauplan is an agent-first lakehouse design that uses transaction-based infrastructure to enable trustworthy agentic workflows with data and compute isolation, addressing the limitations of traditional lakehouses for agent access patterns.

Details

Motivation: Enterprises don't trust AI agents with production data because traditional lakehouses are not suited for agent access patterns. The path to trustworthy agentic workflows requires solving infrastructure problems first.

Method: Proposed Bauplan, an agent-first lakehouse design that reimplements data and compute isolation, drawing operational analogies to MVCC in databases but adapted for decoupled, multi-language settings.

Result: Developed a reference implementation of a self-healing pipeline in Bauplan that seamlessly couples agent reasoning with guarantees for correctness and trust.

Conclusion: By designing lakehouses around transactions first, governance follows naturally, enabling trustworthy agentic workflows with the desired guarantees for production data.

Abstract: Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.

[279] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong

Main category: cs.AI

TL;DR: Pharos-ESG is a unified framework that transforms unstructured ESG reports into structured representations using multimodal parsing, contextual narration, and hierarchical labeling to address challenges in analyzing chaotic layouts and implicit hierarchies.

Details

Motivation: ESG reports present significant challenges for large-scale understanding due to chaotic reading order from slide-like irregular layouts and implicit hierarchies in lengthy, weakly structured content, making it difficult to assess corporate ESG performance effectively.

Method: The framework integrates a reading-order modeling module based on layout flow, hierarchy-aware segmentation guided by table-of-contents anchors, and a multimodal aggregation pipeline that contextually transforms visual elements into coherent natural language. It also enriches outputs with ESG, GRI, and sentiment labels.

Result: Extensive experiments on annotated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. The authors also release Aurora-ESG, the first large-scale public dataset of ESG reports spanning multiple markets.

Conclusion: Pharos-ESG successfully addresses the challenges of ESG report analysis and provides structured representations aligned with financial research demands, while the Aurora-ESG dataset supports ESG integration in financial governance and decision-making.

Abstract: Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.

[280] From generative AI to the brain: five takeaways

Claudius Gros

Main category: cs.AI

TL;DR: The paper argues that neuroscience should study generative AI principles and ML characterizations to understand brain function, discussing five key examples.

Details

Motivation: To bridge the gap between advances in machine learning/generative AI and neuroscience by investigating which generative principles and ML characterizations could be operative in the brain.

Method: Analysis and discussion of five key examples from ML research: shortcomings of world modelling, generation of thought processes, attention mechanisms, neural scaling laws, and quantization techniques.

Result: Identifies specific ML concepts and principles that could provide valuable insights for understanding neural information processing in the brain.

Conclusion: Neuroscience has much to learn from ML research, particularly in understanding generative principles and neural information processing characterizations that may be relevant to brain function.

Abstract: The big strides seen in generative AI are not based on somewhat obscure algorithms, but due to clearly defined generative principles. The resulting concrete implementations have proven themselves in large numbers of applications. We suggest that it is imperative to thoroughly investigate which of these generative principles may be operative also in the brain, and hence relevant for cognitive neuroscience. In addition, ML research led to a range of interesting characterizations of neural information processing systems. We discuss five examples, the shortcomings of world modelling, the generation of thought processes, attention, neural scaling laws, and quantization, that illustrate how much neuroscience could potentially learn from ML research.

[281] PersonaDrift: A Benchmark for Temporal Anomaly Detection in Language-Based Dementia Monitoring

Joy Lai, Alex Mihailidis

Main category: cs.AI

TL;DR: PersonaDrift is a synthetic benchmark for evaluating methods to detect progressive communication changes in dementia patients, simulating 60-day interaction logs with caregiver-informed personas and testing various detection approaches.

Details

Motivation: Current computational tools don't effectively track gradual communication changes in dementia patients, which caregivers notice informally but lack systematic monitoring tools for.

Method: Created synthetic 60-day interaction logs based on caregiver interviews, simulating personas with varying communication styles and injecting progressive changes (flattened sentiment and off-topic replies) at different rates. Evaluated anomaly detection, statistical methods, sequence models, and supervised classifiers.

Result: Flattened sentiment detectable with simple statistical models in low-variability users, while semantic drift requires temporal modeling and personalized baselines. Personalized classifiers consistently outperform generalized ones.

Conclusion: Individual behavioral context is crucial for detecting communication changes in dementia, with personalized approaches showing superior performance over generalized methods.

Abstract: People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.

[282] Utilizing Large Language Models for Zero-Shot Medical Ontology Extension from Clinical Notes

Guanchen Wu, Yuzhang Xie, Huanwei Wu, Zhe He, Hui Shao, Xiao Hu, Carl Yang

Main category: cs.AI

TL;DR: CLOZE is a zero-shot framework that uses LLMs to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies, requiring no training data while preserving patient privacy.

Details

Motivation: Clinical notes contain valuable medical insights but are underutilized for ontology extension due to their unstructured nature and privacy concerns.

Method: Uses pre-trained LLMs to identify disease-related concepts and hierarchical relationships from clinical notes in a zero-shot approach with automated PHI removal.

Result: CLOZE provides accurate, scalable, and privacy-preserving ontology extension, demonstrating strong performance in extracting medical entities and relationships.

Conclusion: CLOZE offers a cost-efficient solution for enhancing medical ontologies using clinical notes, with potential applications in biomedical research and clinical informatics.

Abstract: Integrating novel medical concepts and relationships into existing ontologies can significantly enhance their coverage and utility for both biomedical research and clinical applications. Clinical notes, as unstructured documents rich with detailed patient observations, offer valuable context-specific insights and represent a promising yet underutilized source for ontology extension. Despite this potential, directly leveraging clinical notes for ontology extension remains largely unexplored. To address this gap, we propose CLOZE, a novel framework that uses large language models (LLMs) to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies. By capitalizing on the strong language understanding and extensive biomedical knowledge of pre-trained LLMs, CLOZE effectively identifies disease-related concepts and captures complex hierarchical relationships. The zero-shot framework requires no additional training or labeled data, making it a cost-efficient solution. Furthermore, CLOZE ensures patient privacy through automated removal of protected health information (PHI). Experimental results demonstrate that CLOZE provides an accurate, scalable, and privacy-preserving ontology extension framework, with strong potential to support a wide range of downstream applications in biomedical research and clinical informatics.

[283] Consciousness in Artificial Intelligence? A Framework for Classifying Objections and Constraints

Andres Campero, Derek Shiller, Jaan Aru, Jonathan Simon

Main category: cs.AI

TL;DR: A taxonomical framework for classifying challenges to digital AI consciousness, distinguishing levels of granularity and degrees of force from functionalism challenges to strict impossibility claims.

Details

Motivation: To provide structure and tools for disambiguating between different types of challenges to the possibility of consciousness in digital AI systems, as current debates often conflate different levels and degrees of argument.

Method: Developed a framework using Marr’s levels of analysis to identify granularity, and three degrees of force: challenges to computational functionalism (degree 1), practical improbability (degree 2), and strict impossibility (degree 3). Applied this framework to 14 prominent examples from literature.

Result: Created a systematic classification system that can clearly distinguish between different types of challenges to digital consciousness, providing clarity in the ongoing debate about AI consciousness.

Conclusion: The framework offers a valuable tool for structuring debates about digital consciousness by disambiguating challenges at different levels of granularity and degrees of force, without taking sides in the substantive debate.

Abstract: We develop a taxonomical framework for classifying challenges to the possibility of consciousness in digital artificial intelligence systems. This framework allows us to identify the level of granularity at which a given challenge is intended (the levels we propose correspond to Marr’s levels) and to disambiguate its degree of force: is it a challenge to computational functionalism that leaves the possibility of digital consciousness open (degree 1), a practical challenge to digital consciousness that suggests improbability without claiming impossibility (degree 2), or an argument claiming that digital consciousness is strictly impossible (degree 3)? We apply this framework to 14 prominent examples from the scientific and philosophical literature. Our aim is not to take a side in the debate, but to provide structure and a tool for disambiguating between challenges to computational functionalism and challenges to digital consciousness, as well as between different ways of parsing such challenges.

[284] Formal Abductive Latent Explanations for Prototype-Based Networks

Jules Soria, Zakaria Chihani, Julien Girard-Satabin, Alban Grastien, Romain Xu-Darme, Daniela Cancila

Main category: cs.AI

TL;DR: The paper introduces Abductive Latent Explanations (ALEs) to address misleading explanations in case-based reasoning networks by providing formal guarantees about predictions based on latent representations.

Details

Motivation: Case-based reasoning models provide explanations by pointing to prototypes, but these explanations can be misleading as different instances with the same explanation may lead to different predictions, which is problematic for safety-critical applications.

Method: Proposed ALEs formalism that expresses sufficient conditions on latent representations to imply predictions. Developed a solver-free, scalable algorithm using three distinct paradigms to generate these explanations, combining case-based reasoning interpretability with formal XAI guarantees.

Result: Demonstrated feasibility on diverse datasets for both standard and fine-grained image classification, showing that ALEs provide more reliable explanations than traditional prototype-based approaches.

Conclusion: ALEs successfully bridge the gap between inherent interpretability of case-based reasoning models and formal guarantees from XAI, providing more trustworthy explanations for safety-critical applications.

Abstract: Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at https://github.com/julsoria/ale

[285] You Only Forward Once: An Efficient Compositional Judging Paradigm

Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang

Main category: cs.AI

TL;DR: YOFO is a template-conditioned method that enables multimodal LLMs to judge all requirements in a single forward pass, achieving orders-of-magnitude speedups while maintaining interpretability.

Details

Motivation: Existing MLLM judging approaches face a trade-off: single-score adaptation misaligns with generative nature and limits fine-grained understanding, while autoregressive analysis generation is too slow for high-throughput settings.

Method: YOFO accepts a structured requirement template and produces binary yes/no decisions for each requirement in one inference step by reading the logits of the final token associated with each requirement.

Result: Extensive experiments show YOFO achieves state-of-the-art results on standard recommendation datasets, supports dependency-aware analysis, and benefits from post-hoc Chain of Thought.

Conclusion: YOFO provides an efficient and interpretable solution for MLLM-based judgment that overcomes the fundamental trade-off between accuracy and speed in existing approaches.

Abstract: Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.

[286] Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

Main category: cs.AI

TL;DR: DPPO is a metacognitive training framework that alternates between supervised fine-tuning and reinforcement learning to efficiently learn from sparse data, achieving significant performance improvements in embodied intelligence systems.

Details

Motivation: To overcome the embodied data bottleneck (scarce real-world data) and algorithmic inefficiency of existing methods that are resource-prohibitive for developing universal embodied intelligence systems.

Method: Deliberate Practice Policy Optimization (DPPO) - a metacognitive “Metaloop” framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement) to automatically identify weaknesses and allocate resources efficiently.

Result: Training a vision-language embodied model (Pelican-VL 1.0) with DPPO achieved 20.3% performance improvement over base model and surpassed open-source models at 100B-parameter scale by 10.6%.

Conclusion: DPPO provides the first systematic framework that alleviates data and resource bottlenecks, enabling efficient development of versatile embodied agents, with models and code being open-sourced.

Abstract: Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop’’ training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

[287] MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support

Elias Hossain, Md Mehedi Hasan Nipu, Maleeha Sheikh, Rajib Rana, Subash Neupane, Niloofar Yousefi

Main category: cs.AI

TL;DR: MedBayes-Lite is a lightweight Bayesian enhancement for clinical transformers that improves uncertainty calibration without retraining, reducing overconfidence by 32-48% and preventing up to 41% of diagnostic errors.

Details

Motivation: Transformers in clinical applications are prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical for reliable decision support.

Method: Integrates three components: Bayesian Embedding Calibration using Monte Carlo dropout, Uncertainty-Weighted Attention that marginalizes over token reliability, and Confidence-Guided Decision Shaping inspired by clinical risk minimization.

Result: Consistently improves calibration and trustworthiness across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), reducing overconfidence by 32-48% and preventing up to 41% of diagnostic errors in simulated clinical settings.

Conclusion: MedBayes-Lite effectively enables reliable uncertainty propagation and improves interpretability in medical AI systems without requiring retraining or significant architectural changes.

Abstract: We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models designed to produce reliable, uncertainty-aware predictions. Although transformers show strong potential for clinical decision support, they remain prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical. MedBayes-Lite embeds uncertainty quantification directly into existing transformer pipelines without any retraining or architectural rewiring, adding no new trainable layers and keeping parameter overhead under 3 percent. The framework integrates three components: (i) Bayesian Embedding Calibration using Monte Carlo dropout for epistemic uncertainty, (ii) Uncertainty-Weighted Attention that marginalizes over token reliability, and (iii) Confidence-Guided Decision Shaping inspired by clinical risk minimization. Across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), MedBayes-Lite consistently improves calibration and trustworthiness, reducing overconfidence by 32 to 48 percent. In simulated clinical settings, it can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review. These results demonstrate its effectiveness in enabling reliable uncertainty propagation and improving interpretability in medical AI systems.

[288] Enhancing Forex Forecasting Accuracy: The Impact of Hybrid Variable Sets in Cognitive Algorithmic Trading Systems

Juan C. King, Jose M. Amigo

Main category: cs.AI

TL;DR: Implementation of an AI-based algorithmic trading system for EUR-USD Forex pair using both fundamental and technical analysis features, with comparative evaluation of which feature class provides better predictive capacity.

Details

Motivation: To develop an advanced AI trading system for high-frequency Forex trading that integrates both fundamental macroeconomic variables and technical indicators to generate profitable signals.

Method: Integrates fundamental macroeconomic variables (GDP, Unemployment Rate from Euro Zone and US) with technical variables (indicators, oscillators, Fibonacci levels, price divergences) in an AI-based algorithmic trading system.

Result: Algorithm performance evaluated using machine learning metrics for predictive accuracy and backtesting simulations on historical data to assess trading profitability and risk.

Conclusion: Comparative analysis determines which class of input features (fundamental or technical) provides greater and more reliable predictive capacity for generating profitable trading signals.

Abstract: This paper presents the implementation of an advanced artificial intelligence-based algorithmic trading system specifically designed for the EUR-USD pair within the high-frequency environment of the Forex market. The methodological approach centers on integrating a holistic set of input features: key fundamental macroeconomic variables (for example, Gross Domestic Product and Unemployment Rate) collected from both the Euro Zone and the United States, alongside a comprehensive suite of technical variables (including indicators, oscillators, Fibonacci levels, and price divergences). The performance of the resulting algorithm is evaluated using standard machine learning metrics to quantify predictive accuracy and backtesting simulations across historical data to assess trading profitability and risk. The study concludes with a comparative analysis to determine which class of input features, fundamental or technical, provides greater and more reliable predictive capacity for generating profitable trading signals.

[289] Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov

Main category: cs.AI

TL;DR: This paper analyzes differences between human and LLM reasoning, develops a cognitive taxonomy, and shows that test-time guidance can improve LLM performance by up to 60% on complex problems.

Details

Motivation: To understand why LLMs solve complex problems but fail on simpler variants, and to bridge cognitive science with LLM research to develop models that reason through principled cognitive mechanisms rather than brittle shortcuts.

Method: Created a taxonomy of 28 cognitive elements from cognitive science research, analyzed 170K reasoning traces from 17 models across modalities, compared with 54 human think-aloud traces, and conducted meta-analysis of 1,598 LLM reasoning papers.

Result: Revealed systematic structural differences: humans use hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining. Models have behavioral repertoires associated with success but fail to deploy them spontaneously.

Conclusion: Developed test-time reasoning guidance that automatically scaffolds successful structures, improving performance by up to 60% on complex problems, establishing a foundation for models that reason through principled cognitive mechanisms.

Abstract: Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.

[290] The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar

Main category: cs.AI

TL;DR: LRMs show improved reasoning but face accuracy collapse at high complexity, exhibit counterintuitive scaling limits, and have three performance regimes compared to standard LLMs.

Details

Motivation: To systematically investigate the fundamental capabilities, scaling properties, and limitations of Large Reasoning Models (LRMs) beyond current evaluation paradigms that focus mainly on final answer accuracy.

Method: Used controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures, enabling analysis of both final answers and internal reasoning traces.

Result: LRMs face complete accuracy collapse beyond certain complexities, show counterintuitive scaling limits where reasoning effort increases then declines despite token budget, and exhibit three performance regimes compared to standard LLMs.

Conclusion: LRMs have limitations in exact computation, fail to use explicit algorithms, reason inconsistently across scales, and their reasoning capabilities raise questions despite showing advantages in medium-complexity tasks.

Abstract: Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

[291] Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Lifan Yuan, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jiabin Yu, Peixue Wu, Jinchen He, Yifan Su, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Yunkai Wang, Farshid Jafarpour, Yong Zhao, Xinan Chen, Jessie Shelton, Aaron W. Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Christopher Wilson, Xuefei Guo, Juntai Zhou, Daniel Inafuku, Chi Xue, Luyu Gao, Ze Yang, Yaïr Hein, Yonatan Kahn, Kevin Zhou, Di Luo, John Drew Wilson, Jarrod T. Reilly, Dmytro Bandak, Ofir Press, Liang Yang, Xueying Wang, Hao Tong, Nicolas Chia, Eliu Huerta, Hao Peng

Main category: cs.AI

TL;DR: CritPt is the first benchmark testing LLMs on unpublished, research-level physics reasoning tasks across multiple physics domains, created by active researchers. Current LLMs show limited capability (5.7-10% accuracy) on full research challenges.

Details

Motivation: To assess if LLMs can reason through complex, open-ended challenges in frontier physics research and understand what reasoning tasks physicists actually need assistance with.

Method: Created CritPt benchmark with 71 composite research challenges simulating full-scale projects and 190 simpler checkpoint tasks, all newly created by 50+ active physics researchers. Problems are guess-resistant with machine-verifiable answers, evaluated through automated physics-specific grading pipeline.

Result: Current state-of-the-art LLMs show early promise on isolated checkpoints but perform poorly on full research challenges: best base model accuracy is 5.7% (GPT-5), rising to ~10% with coding tools. Large gap exists between current capabilities and physics research demands.

Conclusion: CritPt reveals significant disconnect between current LLM capabilities and realistic physics research needs, providing foundation for developing scientifically grounded AI tools.

Abstract: While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced “critical point”), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

[292] Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang, Jiang Wu, Jun Xu, Fu Zhang, Wenbo Lei, Annie Wang, Peng Gong

Main category: cs.AI

TL;DR: A multi-dimensional data analysis method using LLM agents and Knowledge Graphs to create a dynamic, collaborative analytical ecosystem for product data analysis.

Details

Motivation: Address challenges in extracting insights from massive, heterogeneous multi-dimensional data, overcoming LLM hallucination issues with structured knowledge and KG static limitations.

Method: Uses LLM agents to automatically extract product data from unstructured data, constructs and visualizes KG in real-time, and provides interactive platform for deep exploration of graph nodes.

Result: Significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis.

Conclusion: Provides new ideas and tools for multi-dimensional data analysis through dynamic LLM-KG interaction.

Abstract: In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from “hallucination” issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.

[293] Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Pratinav Seth, Vinay Kumar Sankarapu

Main category: cs.AI

TL;DR: The paper proposes a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism for private AI governance, addressing current fragmentation and manipulation issues in XAI metrics.

Details

Motivation: Current XAI evaluation metrics are fragmented and prone to manipulation, undermining accountability and compliance in high-stakes AI applications. Private actors need standardized metrics to assess trustworthiness.

Method: The authors propose a Governance by Metrics framework with a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment. They extend evaluation from model introspection to systemic accountability through conceptual synthesis and alignment with governance standards.

Result: The framework positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking and support model alignment by ensuring behavioral integrity in GPAI systems.

Conclusion: The paper outlines a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs, establishing standardized metrics as governance primitives for effective private AI oversight.

Abstract: Reliable explainability is not only a technical goal but also a cornerstone of private AI governance. As AI models enter high-stakes sectors, private actors such as auditors, insurers, certification bodies, and procurement agencies require standardized evaluation metrics to assess trustworthiness. However, current XAI evaluation metrics remain fragmented and prone to manipulation, which undermines accountability and compliance. We argue that standardized metrics can function as governance primitives, embedding auditability and accountability within AI systems for effective private oversight. Building upon prior work in XAI benchmarking, we identify key limitations in ensuring faithfulness, tamper resistance, and regulatory alignment. Furthermore, interpretability can directly support model alignment by providing a verifiable means of ensuring behavioral integrity in General Purpose AI (GPAI) systems. This connection between interpretability and alignment positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking, a growing concern among oversight bodies. We propose a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism of private AI governance. Our framework introduces a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment, extending evaluation from model introspection toward systemic accountability. Through conceptual synthesis and alignment with governance standards, we outline a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs.

[294] PoE-World: Compositional World Modeling with Products of Programmatic Experts

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis

Main category: cs.AI

TL;DR: Introduces PoE-World, a program synthesis method using LLMs to create world models as exponentially-weighted products of programmatic experts, enabling efficient learning from sparse observations in complex domains like Atari games.

Details

Motivation: Traditional deep learning world models require large datasets and lack flexibility for sparse observations. Program synthesis with LLMs offers an alternative for better generalization from limited data, but current applications are limited to simple domains.

Method: Represent world models as exponentially-weighted products of programmatic experts synthesized by LLMs. This approach learns complex, stochastic world models from few observations and embeds them in model-based planning agents.

Result: Successfully learned complex world models from sparse data and demonstrated efficient performance and generalization to unseen levels on Atari’s Pong and Montezuma’s Revenge games.

Conclusion: PoE-World enables effective modeling of complex, non-gridworld domains using program synthesis with LLMs, supporting strong generalization from limited observations and advancing world model learning for AI agents.

Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari’s Pong and Montezuma’s Revenge. We release our code and display the learned world models and videos of the agent’s gameplay at https://topwasu.github.io/poe-world.

Pantelis Dogoulis, Fabien Bernier, Félix Fourreau, Karim Tit, Maxime Cordy

Main category: cs.AI

TL;DR: A diffusion-based framework for refining machine learning outputs to satisfy hard constraints like physical laws or structured dependencies, applicable to non-convex/nonlinear constraints and usable with any base model.

Details

Motivation: Many ML tasks require outputs that satisfy hard constraints (physical laws, graph dependencies, tabular relationships), but existing methods are limited to specific domains or simple linear/convex constraints.

Method: Uses denoising diffusion implicit models (DDIMs) to iteratively refine coarse predictions through deterministic diffusion trajectories, guided by learned priors and augmented with constraint gradient corrections.

Result: Demonstrated on constrained adversarial attacks on tabular data and AC power flow prediction, improving both constraint satisfaction and performance while remaining lightweight and model-agnostic.

Conclusion: The proposed diffusion-guided refinement framework effectively handles diverse non-convex/nonlinear constraints, enhances constraint satisfaction and performance, and works post hoc with any base model.

Abstract: Many real-world machine learning tasks require outputs that satisfy hard constraints, such as physical conservation laws, structured dependencies in graphs, or column-level relationships in tabular data. Existing approaches rely either on domain-specific architectures and losses or on strong assumptions on the constraint space, restricting their applicability to linear or convex constraints. We propose a general-purpose framework for constraint-aware refinement that leverages denoising diffusion implicit models (DDIMs). Starting from a coarse prediction, our method iteratively refines it through a deterministic diffusion trajectory guided by a learned prior and augmented by constraint gradient corrections. The approach accommodates a wide class of non-convex and nonlinear equality constraints and can be applied post hoc to any base model. We demonstrate the method in two representative domains: constrained adversarial attack generation on tabular data with column-level dependencies and in AC power flow prediction under Kirchhoff’s laws. Across both settings, our diffusion-guided refinement improves both constraint satisfaction and performance while remaining lightweight and model-agnostic.

[296] Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

Zhenliang Zhang, Yuxi Wang, Hongzhao Xie, Shiyun Zhao, Mingyuan Liu, Yujie Lu, Xinyi He, Zhenku Cheng, Yujia Peng

Main category: cs.AI

TL;DR: Current multimodal large language models (MLLMs) perform poorly on composite tasks requiring object understanding, spatial intelligence, and social activity, revealing a significant gap from general intelligence requirements.

Details

Motivation: To assess whether embodied agents powered by MLLMs can solve composite tasks that require diverse capabilities, similar to early childhood development activities.

Method: Designed composite tasks in a simulated home environment spanning three domains (object understanding, spatial intelligence, social activity) and evaluated 17 leading proprietary and open-source MLLMs.

Result: Consistently poor performance across all three domains, showing substantial gap between current MLLM capabilities and general intelligence requirements.

Conclusion: The tasks provide a preliminary framework for evaluating embodied agents’ general capabilities, representing an early but significant step toward developing embodied MLLMs for real-world deployment.

Abstract: A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object understanding, spatial intelligence, and social activity. We evaluated 17 leading proprietary and open-source MLLMs on these tasks. The results consistently showed poor performance across all three domains, indicating a substantial gap between current capabilities and general intelligence requirements. Together, our tasks offer a preliminary framework for evaluating the general capabilities of embodied agents, marking an early but significant step toward the development of embodied MLLMs and their real-world deployment.

[297] FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

Main category: cs.AI

TL;DR: A benchmark for evaluating multimodal RAG pipelines with human-created questions across text, tables, and images, showing closed-source models outperform open-source ones especially on multimodal tasks.

Details

Motivation: Existing benchmarks focus on specific aspects like retrieval, but there's a need to evaluate RAG pipelines holistically across multiple modalities and document types.

Method: Created a dataset of 93 multimodal questions, developed phrase-level recall and hallucination detection metrics, and evaluated 6 pipelines (2 open-source, 4 closed-source) with human validation.

Result: Closed-source pipelines significantly outperformed open-source ones in correctness and hallucination metrics, with larger gaps on multimodal and cross-document questions. Human evaluation showed strong agreement with automated metrics.

Conclusion: The benchmark effectively evaluates multimodal RAG pipelines, revealing significant performance differences between open-source and closed-source models, particularly for complex multimodal reasoning tasks.

Abstract: Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline’s ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline’s ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating “strongly agree”).

[298] Towards Efficient Multimodal Unified Reasoning Model via Model Merging

Qixiang Yin, Huanjin Yao, Jianghao Chen, Jiaxing Huang, Zhicheng Zhao, Fei Su

Main category: cs.AI

TL;DR: Tiny-R1V is a 3B lightweight multimodal model that achieves faster inference and higher accuracy through two-stage optimization: LIPO for training specialist models and AMM for merging them into a unified architecture.

Details

Motivation: Address challenges in MLLMs including reasoning efficiency, large model size, and overthinking, while balancing high efficiency and performance at small scale.

Method: Two-stage optimization: 1) LIPO - Length-Informed Relative Policy Optimization to train specialist models (math, chart, OCR reasoning) by prioritizing concise high-quality responses; 2) AMM - Adaptive Model Merging that merges multiple specialist models into unified architecture using gradient projection regularization.

Result: Extensive evaluations on ten reasoning benchmarks show superior performance in mathematics, structured data, OCR, and general capabilities, enabling lightweight models to excel in diverse multimodal reasoning tasks.

Conclusion: Tiny-R1V demonstrates that lightweight models can achieve excellent performance in multimodal reasoning through efficient two-stage optimization, providing faster inference with higher accuracy.

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter challenges in terms of reasoning efficiency, large model size and overthinking. However, existing lightweight MLLMs lack the capability to balance high efficiency and performance at a small scale. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks with fewer inference tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a new reinforcement learning method, to train each reasoning model, including mathematical reasoning, chart reasoning, and OCR capability. The LIPO dynamically adjusts the advantages of responses within groups by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate responses. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks. Code will be available at \href{https://github.com/buptyqx/Tiny-R1V}{https://github.com/buptyqx/Tiny-R1V}

[299] Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

Wenhan Yu, Xinbo Lin, Lanxin Ni, Jinhua Cheng, Lei Sha

Main category: cs.AI

TL;DR: MSLR is the first Chinese multi-step legal reasoning dataset using IRAC framework, created via Human-LLM collaborative annotation, showing LLMs struggle with complex legal reasoning but Self-Initiated Chain-of-Thought improves performance.

Details

Motivation: Existing legal benchmarks conflate factual recall with genuine inference, fragment reasoning processes, and overlook reasoning quality, creating a need for better evaluation of LLMs' legal reasoning capabilities.

Method: Created MSLR dataset using IRAC framework from real judicial decisions, developed scalable Human-LLM collaborative annotation pipeline for step-level reasoning annotations, and tested Self-Initiated Chain-of-Thought prompts.

Result: LLMs show only moderate performance on MSLR, highlighting legal reasoning challenges. Self-Initiated Chain-of-Thought prompts outperform human-designed prompts in improving reasoning coherence and quality.

Conclusion: MSLR advances LLM reasoning and Chain-of-Thought strategies, providing open resources for legal AI research while demonstrating the difficulty of adapting LLMs to complex legal reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

[300] SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Xin Gao, Shaohan Yu, Zerui Chen, Yueming Lyu, Weichen Yu, Guanghao Li, Jiyao Liu, Jianxiong Gao, Jian Liang, Ziwei Liu, Chenyang Si

Main category: cs.AI

TL;DR: SafeRBench is the first benchmark that assesses Large Reasoning Models’ safety end-to-end across inputs, intermediate reasoning, and final outputs, addressing dynamic risks in reasoning traces that existing evaluations miss.

Details

Motivation: Large Reasoning Models improve answer quality through chain-of-thought reasoning, but this capability introduces new safety risks where harmful content can be subtly injected, gradually surfaced, or justified by misleading rationales within reasoning traces. Existing safety evaluations primarily focus on output-level judgments and fail to capture these dynamic risks along the reasoning process.

Method: SafeRBench uses three key approaches: (1) Input characterization incorporating risk categories and levels with balanced prompt suite reflecting diverse harm gradients; (2) Fine-grained output analysis using micro-thought chunking to segment reasoning traces into semantically coherent units for evaluation across ten safety dimensions; (3) Human safety alignment by validating LLM-based evaluations against human annotations for safety judgments.

Result: Evaluations on 19 Large Reasoning Models demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

Conclusion: SafeRBench provides the first comprehensive framework for end-to-end safety assessment of Large Reasoning Models, capturing dynamic risks throughout the reasoning process that traditional output-focused evaluations miss.

Abstract: Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end – from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

[301] As If We’ve Met Before: LLMs Exhibit Certainty in Recognizing Seen Files

Haodong Li, Jingqi Zhang, Xiao Cheng, Peihua Mai, Haoyu Wang, Yan Pang

Main category: cs.AI

TL;DR: COPYCHECK is a novel framework that uses uncertainty signals to detect copyrighted content in LLM training data, achieving over 90% accuracy without needing ground truth data or empirical thresholds.

Details

Motivation: Address concerns about unauthorized use of copyrighted material in LLM training by overcoming limitations of existing Membership Inference Attacks, particularly LLM overconfidence and lack of ground truth data.

Method: Leverages LLM uncertainty patterns to distinguish seen/unseen content, uses strategic file segmentation into snippets, and implements uncertainty-guided unsupervised clustering to avoid threshold tuning.

Result: Achieves 90.1% balanced accuracy on LLaMA 7b and 91.6% on LLaMA2 7b, with over 90% relative improvement over SOTA baselines and strong generalization across architectures like GPT-J 6B.

Conclusion: First application of uncertainty for copyright detection in LLMs, providing practical tools for training data transparency by turning LLM overconfidence into an asset rather than a limitation.

Abstract: The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs’ inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds. We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between seen" (training data) and unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy: (1) strategic segmentation of files into smaller snippets to reduce dependence on large-scale training data, and (2) uncertainty-guided unsupervised clustering to eliminate the need for empirically tuned thresholds. Experiment results show that COPYCHECK achieves an average balanced accuracy of 90.1% on LLaMA 7b and 91.6% on LLaMA2 7b in detecting seen files. Compared to the SOTA baseline, COPYCHECK achieves over 90% relative improvement, reaching up to 93.8% balanced accuracy. It further exhibits strong generalizability across architectures, maintaining high performance on GPT-J 6B. This work presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency.

cs.SD

[302] MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi, Kyu Won Kim, Myungjoo Kang

Main category: cs.SD

TL;DR: MMVA is a tri-modal encoder framework that captures emotional content across images, music, and captions using valence-arousal matching, achieving SOTA performance and strong zero-shot capabilities.

Details

Motivation: To develop a unified framework that can effectively capture and match emotional content across different modalities (images, music, captions) using continuous valence and arousal values.

Method: Proposes MMVA framework with multimodal matching scores based on continuous valence (emotional positivity) and arousal (emotional intensity) values. Expands IMEMNet dataset to IMEMNet-C with 24,756 images and 25,944 music clips with captions. Uses random sampling of image-music pairs during training by computing similarity scores from valence-arousal values.

Result: Achieves state-of-the-art performance in valence-arousal prediction tasks. Demonstrates strong efficacy in various zero-shot tasks, showing the potential of valence and arousal predictions for downstream applications.

Conclusion: The MMVA framework successfully captures emotional content across multiple modalities using valence-arousal matching, enabling effective cross-modal emotion understanding and transfer to various applications.

Abstract: We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.

[303] SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Rui Sang, Yuxuan Liu

Main category: cs.SD

TL;DR: SceneGuard protects against voice cloning by adding scene-consistent audible background noise to speech recordings, making it robust to common audio preprocessing countermeasures.

Details

Motivation: Existing voice protection methods using imperceptible adversarial perturbations are vulnerable to audio preprocessing like denoising and compression, creating a need for more robust protection.

Method: Apply scene-consistent audible background noise (e.g., airport, street, park sounds) to speech recordings during training to create contextually appropriate protection that resists countermeasures.

Result: Achieved 5.5% speaker similarity degradation with high statistical significance (p < 10^{-15}, Cohen’s d = 2.18) while maintaining 98.6% speech intelligibility. Robust against MP3 compression, spectral subtraction, lowpass filtering, and downsampling.

Conclusion: Audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection against cloning attacks.

Abstract: Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen’s d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness evaluation shows that SceneGuard maintains or enhances protection under five common countermeasures including MP3 compression, spectral subtraction, lowpass filtering, and downsampling. Our results suggest that audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection. The source code are available at: https://github.com/richael-sang/SceneGuard.

[304] Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education

Pedro Ramoneda, Emilia Parada-Cabaleiro, Dasaem Jeong, Xavier Serra

Main category: cs.SD

TL;DR: Transformer-based method for adjusting MusicXML piano score difficulty using synthetic dataset, enabling open-source music education AI.

Details

Motivation: To democratize AI in music education by overcoming proprietary system limitations and enabling difficulty adjustment for inclusive learning.

Method: Transformer-based approach using synthetic dataset of piano score pairs ordered by difficulty, generated through melody/harmony-conditioned variations with pretrained difficulty/style assessment.

Result: Validated approach showing accurate control of playability and target difficulty through qualitative and quantitative evaluations.

Conclusion: Open release of all resources enables reproducibility and fosters open-source innovation to bridge the digital divide in music education.

Abstract: Despite its potential, AI advances in music education are hindered by proprietary systems that limit the democratization of technology in this domain. In particular, AI-driven music difficulty adjustment is especially promising, as simplifying complex pieces can make music education more inclusive and accessible to learners of all ages and contexts. Nevertheless, recent efforts have relied on proprietary datasets, which prevents the research community from reproducing, comparing, or extending the current state of the art. In addition, while these generative methods offer great potential, most of them use the MIDI format, which, unlike others, such as MusicXML, lacks readability and layout information, thereby limiting their practical use for human performers. This work introduces a transformer-based method for adjusting the difficulty of MusicXML piano scores. Unlike previous methods, which rely on annotated datasets, we propose a synthetic dataset composed of pairs of piano scores ordered by estimated difficulty, with each pair comprising a more challenging and easier arrangement of the same piece. We generate these pairs by creating variations conditioned on the same melody and harmony and leverage pretrained models to assess difficulty and style, ensuring appropriate pairing. The experimental results illustrate the validity of the proposed approach, showing accurate control of playability and target difficulty, as highlighted through qualitative and quantitative evaluations. In contrast to previous work, we openly release all resources (code, dataset, and models), ensuring reproducibility while fostering open-source innovation to help bridge the digital divide.

[305] Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks

Aravindhan G, Yuvaraj Govindarajulu, Parin Shah

Main category: cs.SD

TL;DR: The paper explores cost-efficient white-box and non-transferability black-box adversarial attacks on ASR systems, including novel poisoning attacks that degrade state-of-the-art models’ performance with minimal perturbation.

Details

Motivation: To address vulnerabilities in Automatic Speech Recognition systems to adversarial examples, moving beyond previous constrained white-box attacks and transferability-based black-box attacks to explore more practical and efficient attack methods.

Method: Uses hybrid models combining insights from Fast Gradient Sign Method and Zeroth-Order Optimization to generate subtle adversarial examples with minimal perturbation (35dB SNR) that can be created within a minute.

Result: Successfully demonstrated generation of impactful adversarial examples with very little perturbation that can deceive state-of-the-art ASR models, leading to misinterpretation of audio signals.

Conclusion: The vulnerabilities discovered in state-of-the-art open source ASR models have practical security implications and emphasize the urgent need for improved adversarial security measures in speech recognition systems.

Abstract: Recent studies have demonstrated the vulnerability of Automatic Speech Recognition systems to adversarial examples, which can deceive these systems into misinterpreting input speech commands. While previous research has primarily focused on white-box attacks with constrained optimizations, and transferability based black-box attacks against commercial Automatic Speech Recognition devices, this paper explores cost efficient white-box attack and non transferability black-box adversarial attacks on Automatic Speech Recognition systems, drawing insights from approaches such as Fast Gradient Sign Method and Zeroth-Order Optimization. Further, the novelty of the paper includes how poisoning attack can degrade the performances of state-of-the-art models leading to misinterpretation of audio signals. Through experimentation and analysis, we illustrate how hybrid models can generate subtle yet impactful adversarial examples with very little perturbation having Signal Noise Ratio of 35dB that can be generated within a minute. These vulnerabilities of state-of-the-art open source model have practical security implications, and emphasize the need for adversarial security.

[306] Pitch Estimation With Mean Averaging Smoothed Product Spectrum And Musical Consonance Evaluation Using MASP

Murat Yasar Baskin

Main category: cs.SD

TL;DR: The paper introduces MASP Spectrum, an improved version of Harmonic Product Spectrum for better pitch estimation, and extends it to measure musical consonance using a harmonicity measure.

Details

Motivation: To enhance pitch estimation for frequency spectra that are algorithmically deceptive but still produce clear pitches, and to explore the connection between consonance and periodicity in music perception.

Method: Developed Mean Averaging Smoothed Product (MASP) Spectrum with global mean-based smoothing to reduce sensitivity to missing partials, then extended the algorithm with a harmonicity measure (H) to evaluate musical consonance for multiple tones.

Result: MASP algorithm provided robust pitch estimations consistent with perceptual expectations, and the harmonicity measure yielded consonance hierarchies that align with music theory and perception.

Conclusion: Pitch and consonance perception may share similar underlying mechanisms that depend on spectral characteristics, suggesting a unified approach to understanding auditory perception.

Abstract: This study introduces Mean Averaging Smoothed Product (MASP) Spectrum, which is a modified version of the Harmonic Product Spectrum, designed to enhance pitch estimation for many algorithm-wise deceptive frequency spectra that still lead clear pitches, for both harmonic and inharmonic cases. By introducing a global mean based smoothing for spectrum, the MASP algorithm diminishes the unwanted sensitivity of HPS for spectra with missing partials. The method exhibited robust pitch estimations consistent with perceptual expectations. Motivated upon the strong correlation between consonance and periodicity, the same algorithm is extended and, with the proposition of a harmonicity measure (H), used to evaluate musical consonance for two and three tones; yielding consonance hierarchies that align with perception and practice of music theory. These findings suggest that perception of pitch and consonance may share a similar underlying mechanism that depend on spectrum.

[307] AcousTools: A ‘Full-Stack’, Python-Based, Acoustic Holography Library

Joshua Mukherjee, Giorgos Christopoulos, Zhouyang Shen, Sriram Subramanian, Ryuji Hirayama

Main category: cs.SD

TL;DR: AcousTools is a Python-based acoustic holography library that provides a complete ‘full-stack’ solution for acoustic holography applications, covering setup, acoustic modeling, phase retrieval, field analysis, and hardware control.

Details

Motivation: There is currently no single software framework that provides a complete solution for acoustic holography applications, with existing methods failing to fulfill one or more key requirements in the full-stack process from abstraction to physicalization.

Method: Developed AcousTools, a Python-based library designed to support the full suite of acoustic holographic applications, meeting all steps of the full-stack requirements including setup, acoustic propagation modeling, transducer phase retrieval, sound field analysis, and hardware control.

Result: AcousTools successfully demonstrates the ability to meet each step of the full-stack requirements for acoustic holography applications, providing a uniquely complete suite of features in an easy-to-use Python environment.

Conclusion: AcousTools has the potential to become the standard code library for acoustic holography, enabling researchers to develop novel applications and accurately review others’ work, while also providing a framework for comparing methodologies across the full-stack process.

Abstract: Acoustic Holography is an emerging field where mid-air ultrasound is controlled and manipulated for novel and exciting applications. These range from mid-air haptics, volumetric displays, contactless fabrication, and even chemical and biomedical applications such as drug delivery. To develop these applications, a software framework to predict acoustic behaviour and simulating resulting effects, such as applied forces or scattering patterns is desirable. There have been various software libraries and platforms that attempt to fill this role, but there is yet to be a single piece of software that acts as a ‘full-stack’ solution. We define this full-stack as the process from abstraction to physicalisation starting with setup, modelling acoustic propagation, transducer phase retrieval, sound field analysis, and control of the acoustic holographic hardware itself. Existing methods fail to fulfil one or more of these categories. To address this, we present AcousTools, a Python-based acoustic holography library, designed to support the full suite of acoustic holographic applications and we show AcousTools’s ability to meet each step of the full-stack’s requirements. AcousTools has the potential to become the standard code library for acoustic holography, with the uniquely complete suite of features wrapped in a language that is known to be easy to use, AcousTools will increase the ability for researchers to develop novel applications as well as accurately review other’s work. The full-stack, aside from software, will also be useful for researchers - providing a way to view and compare methodologies by understanding where they fit into the stack.

cs.LG

[308] Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

Chao Yu, Qixin Tan, Jiaxuan Gao, Shi Yu, Hong Lu, Xinting Yang, Zelai Xu, Yu Wang, Yi Wu, Eugene Vinitsky

Main category: cs.LG

TL;DR: The paper introduces 3D test-time scaling, combining context, batch, and turn scaling to enhance reasoning performance beyond conventional test-time scaling limitations.

Details

Motivation: Test-time scaling is limited by base models' context length, which is much smaller than training tokens. The paper aims to extend test-time reasoning capacity through multi-dimensional scaling.

Method: Proposes 3D test-time scaling framework integrating three dimensions: context-length scaling, batch scaling (parallel sampling), and turn scaling (iterative self-refinement).

Result: Each dimension shows bounded test-time scaling effect; combining all three significantly improves reasoning on challenging benchmarks (IOI, IMO, CPHO) and benefits from human preference feedback; extends to embodied learning for humanoid control.

Conclusion: Multi-dimensional test-time scaling substantially enhances reasoning performance and naturally extends to open-ended domains like embodied learning.

Abstract: Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.

[309] Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models

Linnea M. Wolniewicz, Halil S. Kelebek, Simone Mestici, Michael D. Vergalla, Giacomo Acciarini, Bala Poduval, Olga Verkhoglyadova, Madhulika Guhathakurta, Thomas E. Berger, Atılım Güneş Baydin, Frank Soboczenski

Main category: cs.LG

TL;DR: A curated open-access dataset integrating diverse ionospheric and heliospheric measurements for machine learning-ready ionospheric forecasting, with benchmarked spatiotemporal ML models for TEC prediction.

Details

Motivation: Address critical space weather challenges in operational ionosphere forecasting due to sparse observations, complex geospatial coupling, and growing need for accurate predictions supporting GNSS, communications, aviation safety, and satellite operations.

Method: Integrates multiple data sources (Solar Dynamic Observatory, F10.7, solar wind parameters, geomagnetic indices, GIM-TEC) with sparse data (GNSS receiver network, smartphone measurements) into a temporally and spatially aligned modular structure. Trains and benchmarks spatiotemporal ML architectures for TEC forecasting.

Result: Presents a novel heterogeneous dataset and modeling pipeline that enables exploration of ionospheric dynamics and Sun-Earth interactions, supporting both scientific research and operational forecasting efforts.

Conclusion: The work provides an extensive dataset and framework that bridges gaps in current operational forecasting systems, enabling improved ionospheric predictions under various space weather conditions through integrated data-driven approaches.

Abstract: Operational forecasting of the ionosphere remains a critical space weather challenge due to sparse observations, complex coupling across geospatial layers, and a growing need for timely, accurate predictions that support Global Navigation Satellite System (GNSS), communications, aviation safety, as well as satellite operations. As part of the 2025 NASA Heliolab, we present a curated, open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure, designed specifically to support next-generation forecasting models and address gaps in current operational frameworks. Our workflow integrates a large selection of data sources comprising Solar Dynamic Observatory data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL’s Global Ionospheric Maps of Total Electron Content (GIM-TEC). We also implement geospatially sparse data such as the TEC derived from the World-Wide GNSS Receiver Network and crowdsourced Android smartphone measurements. This novel heterogeneous dataset is temporally and spatially aligned into a single, modular data structure that supports both physical and data-driven modeling. Leveraging this dataset, we train and benchmark several spatiotemporal machine learning architectures for forecasting vertical TEC under both quiet and geomagnetically active conditions. This work presents an extensive dataset and modeling pipeline that enables exploration of not only ionospheric dynamics but also broader Sun-Earth interactions, supporting both scientific inquiry and operational forecasting efforts.

[310] TB or Not TB: Coverage-Driven Direct Preference Optimization for Verilog Stimulus Generation

Bardia Nadimi, Khashayar Filom, Deming Chen, Hao Zheng

Main category: cs.LG

TL;DR: TB or not TB is an LLM-based framework for automated hardware verification stimulus generation using Coverage-Driven Direct Preference Optimization (CD-DPO) that achieves significant coverage improvements.

Details

Motivation: Hardware design verification is time-consuming and resource-intensive, particularly stimulus generation for design under test (DUT), motivating automated solutions using advanced LLMs.

Method: Fine-tuned LLMs using Coverage-Driven Direct Preference Optimization (CD-DPO) with PairaNet dataset containing high- and low-quality testbenches labeled by simulation coverage metrics.

Result: Outperforms both open-source and commercial baselines on CVDP CID12 benchmark, achieving up to 77.27% improvement in code coverage.

Conclusion: Coverage-driven preference optimization is effective for LLM-based hardware verification, demonstrating significant performance gains in automated stimulus generation.

Abstract: With the rapid advancement of Large Language Models (LLMs), there is growing interest in applying them to hardware design and verification. Among these stages, design verification remains the most time-consuming and resource-intensive phase, where generating effective stimuli for the design under test (DUT) is both critical and labor-intensive. We present {\it TB or not TB}, a framework for automated stimulus generation using LLMs fine-tuned through Coverage-Driven Direct Preference Optimization (CD-DPO). To enable preference-based training, we introduce PairaNet, a dataset derived from PyraNet that pairs high- and low-quality testbenches labeled using simulation-derived coverage metrics. The proposed CD-DPO method integrates quantitative coverage feedback directly into the optimization objective, guiding the model toward generating stimuli that maximize verification coverage. Experiments on the CVDP CID12 benchmark show that {\it TB or not TB} outperforms both open-source and commercial baselines, achieving up to 77.27% improvement in code coverage, demonstrating the effectiveness of Coverage-driven preference optimization for LLM-based hardware verification.

[311] Saving Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

Main category: cs.LG

TL;DR: FMPlug is a plug-in framework that enhances foundation flow-matching models for inverse problems by combining instance-guided warm-start strategy with Gaussianity regularization, significantly improving performance across various domains.

Details

Motivation: Foundation flow-matching models currently underperform compared to domain-specific or untrained priors in solving inverse problems, despite their promise as universal priors.

Method: FMPlug combines an instance-guided, time-dependent warm-start strategy with sharp Gaussianity regularization to add problem-specific guidance while preserving Gaussian structures.

Result: The framework leads to significant performance boost across image restoration and scientific inverse problems.

Conclusion: FMPlug provides a path for making foundation flow-matching models practical, reusable priors for inverse problem solving.

Abstract: Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.

[312] TopoReformer: Mitigating Adversarial Attacks Using Topological Purification in OCR Models

Bhagyesh Kumar, A S Aravinthakashan, Akshat Satyanarayan, Ishaan Gakhar, Ujjwal Verma

Main category: cs.LG

TL;DR: TopoReformer is a model-agnostic defense pipeline that uses topological features to mitigate adversarial perturbations in text images while preserving structural integrity, achieving robustness against various attacks without affecting clean input performance.

Details

Motivation: Adversarial perturbations in text images can fool OCR systems with invisible changes that survive physical capture, posing security risks to document processing and license plate recognition. Existing defenses are model-specific, computationally expensive, and vulnerable to adaptive attacks.

Method: Leverages topological features (connectivity, holes, loops) that remain unchanged under continuous deformations. Uses a topological autoencoder to enforce manifold-level consistency in latent space without explicit gradient regularization.

Result: Benchmarked on EMNIST and MNIST datasets against FGSM, PGD, Carlini-Wagner attacks, adaptive attacks (EOT, BDPA), and OCR-specific watermark attack (FAWA), showing improved robustness.

Conclusion: TopoReformer provides an effective model-agnostic defense that mitigates adversarial perturbations while maintaining text structural integrity, addressing limitations of existing approaches.

Abstract: Adversarially perturbed images of text can cause sophisticated OCR systems to produce misleading or incorrect transcriptions from seemingly invisible changes to humans. Some of these perturbations even survive physical capture, posing security risks to high-stakes applications such as document processing, license plate recognition, and automated compliance systems. Existing defenses, such as adversarial training, input preprocessing, or post-recognition correction, are often model-specific, computationally expensive, and affect performance on unperturbed inputs while remaining vulnerable to unseen or adaptive attacks. To address these challenges, TopoReformer is introduced, a model-agnostic reformation pipeline that mitigates adversarial perturbations while preserving the structural integrity of text images. Topology studies properties of shapes and spaces that remain unchanged under continuous deformations, focusing on global structures such as connectivity, holes, and loops rather than exact distance. Leveraging these topological features, TopoReformer employs a topological autoencoder to enforce manifold-level consistency in latent space and improve robustness without explicit gradient regularization. The proposed method is benchmarked on EMNIST, MNIST, against standard adversarial attacks (FGSM, PGD, Carlini-Wagner), adaptive attacks (EOT, BDPA), and an OCR-specific watermark attack (FAWA).

[313] A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning

Shreyansh Jain, Madhav Singhvi, Shreya Rahul Jain, Pranav S, Dishaa Lokesh, Naren Chittibabu, Akash Anandhan

Main category: cs.LG

TL;DR: A two-step fine-tuning process using GRPO (likely a reinforcement learning method) on a small language model (<600M parameters) creates a refined resume evaluation system that overcomes traditional ATS limitations and achieves 91% accuracy with perfect precision.

Details

Motivation: Traditional Applicant Tracking Systems (ATS) are inflexible keyword-matchers that often reject qualified candidates due to minor semantic mismatches, creating a need for more nuanced and human-like candidate evaluation.

Method: Two-step process: 1) Supervised Fine-Tuning (SFT) to build baseline model, 2) Reinforcement Learning optimization using GRPO with a multi-component reward function that holistically assesses candidates beyond simple keyword matching.

Result: Achieved 91% accuracy on unseen test data with 0.85 recall for ‘SELECTED’ class and perfect 1.0 precision. Overcame initial reward hacking issues through refined reward function and training hyperparameters.

Conclusion: Proper two-step fine-tuning can effectively refine small language models for human-like candidate scoring, overcoming drawbacks of both traditional ATS and naive RL usage.

Abstract: Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable “gentle polishing process” of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the ‘SELECTED’ class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.

[314] Beyond Tsybakov: Model Margin Noise and $\mathcal{H}$-Consistency Bounds

Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: Introduces Model Margin Noise (MM noise), a weaker alternative to Tsybakov noise that enables enhanced consistency bounds for classification by focusing on hypothesis-Bayes discrepancy rather than distributional minimal margin.

Details

Motivation: To develop improved consistency bounds for classification under weaker noise conditions than Tsybakov noise, allowing for better theoretical guarantees even when traditional noise assumptions fail.

Method: Proposes the Model Margin Noise assumption and derives enhanced H-consistency bounds for binary and multi-class classification, extending previous work with the same favorable exponents but under weaker assumptions.

Result: Achieves enhanced consistency bounds that interpolate between linear and square-root regimes for intermediate noise levels, with instantiation for common surrogate loss families.

Conclusion: MM noise provides a more flexible framework than Tsybakov noise for deriving consistency bounds, enabling improved theoretical guarantees across various noise levels in classification problems.

Abstract: We introduce a new low-noise condition for classification, the Model Margin Noise (MM noise) assumption, and derive enhanced $\mathcal{H}$-consistency bounds under this condition. MM noise is weaker than Tsybakov noise condition: it is implied by Tsybakov noise condition but can hold even when Tsybakov fails, because it depends on the discrepancy between a given hypothesis and the Bayes-classifier rather than on the intrinsic distributional minimal margin (see Figure 1 for an illustration of an explicit example). This hypothesis-dependent assumption yields enhanced $\mathcal{H}$-consistency bounds for both binary and multi-class classification. Our results extend the enhanced $\mathcal{H}$-consistency bounds of Mao, Mohri, and Zhong (2025a) with the same favorable exponents but under a weaker assumption than the Tsybakov noise condition; they interpolate smoothly between linear and square-root regimes for intermediate noise levels. We also instantiate these bounds for common surrogate loss families and provide illustrative tables.

[315] Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense

Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson

Main category: cs.LG

TL;DR: LLM-based reward design approach for generating autonomous cyber defense policies in DRL-driven simulation environments, showing effective defense against diverse adversarial behaviors.

Details

Motivation: Designing rewards for autonomous cyber attack and defense agents in complex environments is challenging for human experts, requiring automated approaches.

Method: Used LLM-guided reward design with multiple attack/defense personas, provided contextual cyber simulation information to LLM, then utilized reward structures in DRL-driven attack-defense simulation to learn defense policy ensembles.

Result: LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.

Conclusion: LLM-based reward design is a promising approach for generating autonomous cyber defense policies in complex simulation environments.

Abstract: Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.

[316] Attention-Based Feature Online Conformal Prediction for Time Series

Meiyi Zhu, Caili Guo, Chunyan Feng, Osvaldo Simeone

Main category: cs.LG

TL;DR: AFOCP improves online conformal prediction by using feature-space calibration with attention mechanisms to create smaller prediction sets while maintaining coverage guarantees under distribution shifts.

Details

Motivation: Standard OCP has limitations: it operates in output space with simple scores and treats all historical data uniformly, leading to inefficient prediction sets that don't adapt well to distribution shifts.

Method: AFOCP introduces two innovations: 1) operates in feature space of pre-trained neural networks to focus on task-relevant information, 2) uses attention mechanism to adaptively weight historical observations based on relevance to current test point.

Result: AFOCP reduces prediction interval sizes by up to 88% compared to standard OCP while maintaining target coverage levels, with theoretical guarantees of smaller intervals under mild conditions.

Conclusion: Feature-space calibration combined with attention-based adaptive weighting significantly improves online conformal prediction efficiency and adaptability to non-stationary data.

Abstract: Online conformal prediction (OCP) wraps around any pre-trained predictor to produce prediction sets with coverage guarantees that hold irrespective of temporal dependencies or distribution shifts. However, standard OCP faces two key limitations: it operates in the output space using simple nonconformity (NC) scores, and it treats all historical observations uniformly when estimating quantiles. This paper introduces attention-based feature OCP (AFOCP), which addresses both limitations through two key innovations. First, AFOCP operates in the feature space of pre-trained neural networks, leveraging learned representations to construct more compact prediction sets by concentrating on task-relevant information while suppressing nuisance variation. Second, AFOCP incorporates an attention mechanism that adaptively weights historical observations based on their relevance to the current test point, effectively handling non-stationarity and distribution shifts. We provide theoretical guarantees showing that AFOCP maintains long-term coverage while provably achieving smaller prediction intervals than standard OCP under mild regularity conditions. Extensive experiments on synthetic and real-world time series datasets demonstrate that AFOCP consistently reduces the size of prediction intervals by as much as $88%$ as compared to OCP, while maintaining target coverage levels, validating the benefits of both feature-space calibration and attention-based adaptive weighting.

[317] Transparent Early ICU Mortality Prediction with Clinical Transformer and Per-Case Modality Attribution

Alexander Bakumenko, Janine Hoelscher, Hudson Smith

Main category: cs.LG

TL;DR: A lightweight multimodal ensemble combining physiological time-series and clinical notes for early ICU mortality prediction, offering transparent decision-making and robust performance even with missing data.

Details

Motivation: Existing machine learning approaches for ICU mortality prediction lack transparency and robustness, limiting clinical adoption despite high predictive performance.

Method: Late-fusion ensemble using logistic regression to combine predictions from bidirectional LSTM (for vitals) and finetuned ClinicalModernBERT transformer (for clinical notes), with multilevel interpretability features.

Result: Improved discrimination over single models (AUPRC 0.565 vs 0.526; AUROC 0.891 vs 0.876) on MIMIC-III, with well-calibrated predictions and robust performance when modalities are missing.

Conclusion: The system demonstrates competitive performance with reliable, auditable risk estimates and transparent operation, which are crucial for clinical adoption.

Abstract: Early identification of intensive care patients at risk of in-hospital mortality enables timely intervention and efficient resource allocation. Despite high predictive performance, existing machine learning approaches lack transparency and robustness, limiting clinical adoption. We present a lightweight, transparent multimodal ensemble that fuses physiological time-series measurements with unstructured clinical notes from the first 48 hours of an ICU stay. A logistic regression model combines predictions from two modality-specific models: a bidirectional LSTM for vitals and a finetuned ClinicalModernBERT transformer for notes. This traceable architecture allows for multilevel interpretability: feature attributions within each modality and direct per-case modality attributions quantifying how vitals and notes influence each decision. On the MIMIC-III benchmark, our late-fusion ensemble improves discrimination over the best single model (AUPRC 0.565 vs. 0.526; AUROC 0.891 vs. 0.876) while maintaining well-calibrated predictions. The system remains robust through a calibrated fallback when a modality is missing. These results demonstrate competitive performance with reliable, auditable risk estimates and transparent, predictable operation, which together are crucial for clinical use.

[318] discretize_distributions: Efficient Quantization of Gaussian Mixtures with Guarantees in Wasserstein Distance

Steven Adams, Elize Alwash, Luca Laurenti

Main category: cs.LG

TL;DR: A Python package for efficient discrete approximation of Gaussian mixture distributions with Wasserstein distance error guarantees, implementing state-of-the-art quantization methods.

Details

Motivation: To provide efficient and scalable discrete approximations of Gaussian mixture models with guaranteed error bounds for use in control and verification pipelines for cyber-physical systems.

Method: Implements state-of-the-art quantization methods for Gaussian mixture models, extends them for improved scalability, and integrates complementary strategies like sigma-point methods with a modular interface.

Result: The package produces accurate approximations at low computational cost, demonstrated through benchmarks on high-dimensional, large, and degenerate Gaussian mixtures.

Conclusion: discretize_distributions is an effective tool for constructing discrete approximations of Gaussian mixture distributions with guaranteed error bounds and good computational performance.

Abstract: We present discretize_distributions, a Python package that efficiently constructs discrete approximations of Gaussian mixture distributions and provides guarantees on the approximation error in Wasserstein distance. The package implements state-of-the-art quantization methods for Gaussian mixture models and extends them to improve scalability. It further integrates complementary quantization strategies such as sigma-point methods and provides a modular interface that supports custom schemes and integration into control and verification pipelines for cyber-physical systems. We benchmark the package on various examples, including high-dimensional, large, and degenerate Gaussian mixtures, and demonstrate that discretize_distributions produces accurate approximations at low computational cost.

[319] Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems

Saptarshi Nath, Christos Peridis, Eseoghene Benjamin, Xinran Liu, Soheil Kolouri, Peter Kinnell, Zexin Li, Cong Liu, Shirin Dora, Andrea Soltoggio

Main category: cs.LG

TL;DR: MOSAIC algorithm enables agents to selectively share and integrate learned policies using performance signals and task similarity, improving collective learning efficiency and solving tasks that isolated agents cannot.

Details

Motivation: Current agentic AI systems lack effective methods for sharing and reusing learned knowledge across multiple agents facing unforeseen tasks, limiting learning acceleration and collective performance.

Method: Proposes MOSAIC algorithm combining: (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular neural representations via masks for transferability, and (3) policy integration, composition and fine-tuning.

Result: MOSAIC outperforms isolated learners and global sharing approaches in learning speed and overall performance, solves tasks that isolated agents cannot, reduces task interference, and enables self-organization where simpler tasks accelerate harder ones.

Conclusion: Selective, goal-driven policy reuse through MOSAIC enables more efficient collective learning in agentic systems, demonstrating the value of modular knowledge sharing and composition.

Abstract: Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reusing policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, \emph{Modular Sharing and Composition in Collective Learning} (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.

[320] GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances

Peter Sharpe

Main category: cs.LG

TL;DR: GLOBE is a neural surrogate for homogeneous PDEs that combines boundary-element methods with equivariant ML, achieving 200x error reduction on AirFRANS dataset with compact 117k parameters and arbitrary point evaluation.

Details

Motivation: To create a more accurate and practical ML-based PDE surrogate for industrial CAE by incorporating rigorous physics- and domain-inspired inductive biases from boundary-element methods and equivariant ML.

Method: Represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, using multiscale branches and communication hyperlayers. Architecture is translation-, rotation-, and parity-equivariant; discretization-invariant; and units-invariant via nondimensionalization.

Result: On AirFRANS dataset: 200x MSE reduction on all fields relative to baselines, 50x relative to next-best model. In scarce data setting: 100x lower error on velocity/pressure, 600x lower on surface pressure than Transolver. Model is compact (117k params) with arbitrary point evaluation.

Conclusion: Rigorous physics- and domain-inspired inductive biases enable substantial gains in accuracy, generalizability, and practicality for ML-based PDE surrogates in industrial CAE applications.

Abstract: We introduce GLOBE, a new neural surrogate for homogeneous PDEs that draws inductive bias from boundary-element methods and equivariant ML. GLOBE represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, composed across multiscale branches and communication hyperlayers. The architecture is translation-, rotation-, and parity-equivariant; discretization-invariant in the fine-mesh limit; and units-invariant via rigorous nondimensionalization. An explicit far-field decay envelope stabilizes extrapolation, boundary-to-boundary hyperlayer communication mediates long-range coupling, and the all-to-all boundary-to-target evaluation yields a global receptive field that respects PDE information flow, even for elliptic PDEs. On AirFRANS (steady incompressible RANS over NACA airfoils), GLOBE achieves substantial accuracy improvements. On the “Full” split, it reduces mean-squared error by roughly 200x on all fields relative to the dataset’s reference baselines, and roughly 50x relative to the next-best-performing model. In the “Scarce” split, it achieves over 100x lower error on velocity and pressure fields and over 600x lower error on surface pressure than Transolver. Qualitative results show sharp near-wall gradients, coherent wakes, and limited errors under modest extrapolation in Reynolds number and angle of attack. In addition to this accuracy, the model is quite compact (117k parameters), and fields can be evaluated at arbitrary points during inference. We also demonstrate the ability to train and predict with non-watertight meshes, which has strong practical implications. These results show that rigorous physics- and domain-inspired inductive biases can achieve large gains in accuracy, generalizability, and practicality for ML-based PDE surrogates for industrial computer-aided engineering (CAE).

[321] Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

Rahul Krishna Thomas, Arka Pal

Main category: cs.LG

TL;DR: This paper presents an efficient algorithm for multi-draft speculative sampling that achieves 90% acceptance rates with minimal overhead by reformulating the optimal transport problem as a max-flow problem and using polymatroid theory to reduce computational complexity.

Details

Motivation: Speculative sampling reduces latency in autoregressive decoding but faces computational challenges with multi-draft extensions, particularly in solving the exponentially large optimal transport linear program (OTLP) for optimal acceptance criteria.

Method: The authors reformulate the OTLP as a max-flow problem and apply polymatroid theory to reduce it to a convex optimization problem in at most V variables, enabling efficient solution for n-draft speculative sampling with i.i.d. draft tokens.

Result: The proposed algorithm achieves 90% acceptance rates with under 100 ms overhead per token and negligible deviation from the target model distribution, significantly outperforming previous approaches.

Conclusion: This work provides the first practical multi-draft speculative sampling algorithm with high acceptance rates and minimal computational overhead, making it feasible for real-world deployment.

Abstract: Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

[322] Unified all-atom molecule generation with neural fields

Matthieu Kirchmeyer, Pedro O. Pinheiro, Emma Willett, Karolis Martinkus, Joseph Kleinhenz, Emily K. Makowski, Andrew M. Watkins, Vladimir Gligorijevic, Richard Bonneau, Saeed Saremi

Main category: cs.LG

TL;DR: FuncBind is a unified generative framework for structure-based drug design that uses neural fields and computer vision architectures to generate target-conditioned molecules across diverse atomic systems.

Details

Motivation: Current generative models for structure-based drug design are limited to specific modalities, restricting their broader applicability across different molecular types and sizes.

Method: Uses neural fields to represent molecules as continuous atomic densities and employs score-based generative models with computer vision architectures, enabling modality-agnostic representation for diverse atomic systems.

Result: Achieved competitive in silico performance for generating small molecules, macrocyclic peptides, and antibody CDR loops, and successfully generated novel antibody binders via de novo CDR H3 loop redesign in vitro.

Conclusion: FuncBind provides a unified, modality-agnostic approach for structure-based drug design that works across diverse molecular systems and demonstrated practical utility in generating novel antibody binders.

Abstract: Generative models for structure-based drug design are often limited to a specific modality, restricting their broader applicability. To address this challenge, we introduce FuncBind, a framework based on computer vision to generate target-conditioned, all-atom molecules across atomic systems. FuncBind uses neural fields to represent molecules as continuous atomic densities and employs score-based generative models with modern architectures adapted from the computer vision literature. This modality-agnostic representation allows a single unified model to be trained on diverse atomic systems, from small to large molecules, and handle variable atom/residue counts, including non-canonical amino acids. FuncBind achieves competitive in silico performance in generating small molecules, macrocyclic peptides, and antibody complementarity-determining region loops, conditioned on target structures. FuncBind also generated in vitro novel antibody binders via de novo redesign of the complementarity-determining region H3 loop of two chosen co-crystal structures. As a final contribution, we introduce a new dataset and benchmark for structure-conditioned macrocyclic peptide generation. The code is available at https://github.com/prescient-design/funcbind.

[323] AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, Kunle Olukotun

Main category: cs.LG

TL;DR: AccelOpt is a self-improving LLM agent that autonomously optimizes AI accelerator kernels without expert knowledge, achieving significant performance improvements on AWS Trainium chips while being highly cost-effective.

Details

Motivation: To eliminate the need for expert-provided hardware-specific optimization knowledge for emerging AI accelerators, enabling autonomous kernel optimization through self-improving AI systems.

Method: Uses iterative generation with an optimization memory that curates experiences from previously encountered slow-fast kernel pairs, and builds NKIBench benchmark suite with real-world LLM workload kernels for evaluation.

Result: Improved average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2. Matches Claude Sonnet 4 kernel improvements while being 26x cheaper using open-source models.

Conclusion: AccelOpt demonstrates effective autonomous kernel optimization with self-improving capabilities, achieving significant performance gains and high cost-effectiveness for AI accelerator optimization.

Abstract: We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt’s capability improves over time, boosting the average percentage of peak throughput from $49%$ to $61%$ on Trainium 1 and from $45%$ to $59%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.

[324] Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

Vaibhav Singh, Oleksiy Ostapenko, Pierre-André Noël, Torsten Scholak

Main category: cs.LG

TL;DR: DiffuApriel is a masked diffusion language model using bidirectional Mamba backbone instead of Transformers, achieving 4.4x higher inference throughput for long sequences while matching performance.

Details

Motivation: Transformer-based diffusion models suffer from quadratic attention and KV-cache overhead, limiting inference efficiency for text generation.

Method: Built on bidirectional Mamba backbone combining diffusion objective with linear-time sequence modeling; also proposed hybrid variant (DiffuApriel-H) interleaving attention and Mamba layers.

Result: Matches Transformer-based diffusion model performance while achieving up to 4.4x higher inference throughput for long sequences with 1.3B model; hybrid variant offers 2.6x throughput improvement.

Conclusion: Bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing practical and scalable foundation for faster, memory-efficient text generation.

Abstract: Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.

[325] iLTM: Integrated Large Tabular Model

David Bonet, Marçal Comajoan Cara, Alvaro Calafell, Daniel Mas Montserrat, Alexander G. Ioannidis

Main category: cs.LG

TL;DR: iLTM is an integrated Large Tabular Model that combines tree embeddings, dimensionality-agnostic representations, meta-trained hypernetworks, MLPs, and retrieval to outperform GBDTs and deep tabular models across classification and regression tasks.

Details

Motivation: Deep learning advances haven't fully transferred to tabular data, where GBDTs remain dominant despite limitations in adaptability and scalability.

Method: Unifies tree-derived embeddings, dimensionality-agnostic representations, meta-trained hypernetwork, MLPs, and retrieval in single architecture; pretrained on 1,800+ heterogeneous classification datasets.

Result: Consistently superior performance across tabular classification/regression tasks; matches or surpasses strong baselines after light fine-tuning; outperforms well-tuned GBDTs and deep tabular models with less task-specific tuning.

Conclusion: iLTM bridges tree-based and neural methods, offering a new framework for tabular foundation models that enables robust, adaptable, and scalable tabular learning.

Abstract: Tabular data underpins decisions across science, industry, and public services. Despite rapid progress, advances in deep learning have not fully carried over to the tabular domain, where gradient-boosted decision trees (GBDTs) remain a default choice in practice. We present iLTM, an integrated Large Tabular Model that unifies tree-derived embeddings, dimensionality-agnostic representations, a meta-trained hypernetwork, multilayer perceptrons (MLPs), and retrieval within a single architecture. Pretrained on more than 1,800 heterogeneous classification datasets, iLTM achieves consistently superior performance across tabular classification and regression tasks, from small datasets to large and high-dimensional tasks. After light fine-tuning, the meta-trained hypernetwork transfers to regression targets, matching or surpassing strong baselines. Extensive experiments show that iLTM outperforms well-tuned GBDTs and leading deep tabular models while requiring less task-specific tuning. By bridging the gap between tree-based and neural methods, iLTM offers a new framework for tabular foundation models for robust, adaptable, and scalable tabular learning.

[326] Self-supervised and Multi-fidelity Learning for Extended Predictive Soil Spectroscopy

Luning Sun, José L. Safanelli, Jonathan Sanderman, Katerina Georgiou, Colby Brungard, Kanchan Grover, Bryan G. Hopkins, Shusen Liu, Timo Bremer

Main category: cs.LG

TL;DR: Self-supervised ML framework for multi-fidelity soil spectroscopy using latent space embeddings from MIR spectra, enabling NIR-to-MIR conversion and improved soil property predictions.

Details

Motivation: To leverage large MIR spectral databases for soil analysis while enabling use of low-cost portable NIR scanners, addressing limitations in current NIR libraries.

Method: Pretrained Variational Autoencoder on MIR spectra for latent embeddings, then used frozen MIR decoder for NIR-to-MIR spectrum conversion, followed by downstream ML models for soil property prediction.

Result: SSML embeddings achieved similar or better accuracy than baselines for all soil properties; NIR-to-MIR conversion performed better than NIR-only models but not as well as original MIR spectra.

Conclusion: The unified spectral latent space effectively leverages diverse MIR datasets for soil property prediction, enabling portable NIR scanners to benefit from MIR library capabilities.

Abstract: We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.

[327] Machine Learning Epidemic Predictions Using Agent-based Wireless Sensor Network Models

Chukwunonso Henry Nwokoye, Blessing Oluchi, Sharna Waldron, Peace Ezzeh

Main category: cs.LG

TL;DR: An agent-based SEIRV model was used to generate synthetic epidemic datasets for WSNs, and multiple ML algorithms were evaluated for predicting infected and recovered nodes, with tree-based methods performing best.

Details

Motivation: The lack of epidemiological data in wireless sensor networks makes it difficult to build robust models for forecasting and mitigating malware threats like viruses and worms.

Method: Used agent-based implementation of SEIRV mathematical model with NetLogo’s BehaviorSpace and Python to generate synthetic datasets, then applied multiple ML algorithms as regression problems to predict infected and recovered nodes.

Result: Predictions showed excellent performance with high R^2 values (0.997-1.000) on training data and good validation scores (0.971-0.999). Random Forest, XGBoost, Decision Trees, and k-nearest neighbors achieved the best results, while support vector and linear regression variants performed worst.

Conclusion: Tree-based ML algorithms like Random Forest and XGBoost are most effective for predicting malware spread in WSNs using synthetic epidemiological data generated from SEIRV models.

Abstract: The lack of epidemiological data in wireless sensor networks (WSNs) is a fundamental difficulty in constructing robust models to forecast and mitigate threats such as viruses and worms. Many studies have examined different epidemic models for WSNs, focusing on how malware infections spread given the network’s specific properties, including energy limits and node mobility. In this study, an agent-based implementation of the susceptible-exposed-infected-recovered-vaccinated (SEIRV) mathematical model was employed for machine learning (ML) predictions. Using tools such as NetLogo’s BehaviorSpace and Python, two epidemic synthetic datasets were generated and prepared for the application of several ML algorithms. Posed as a regression problem, the infected and recovered nodes were predicted, and the performance of these algorithms is compared using the error metrics of the train and test sets. The predictions performed well, with low error metrics and high R^2 values (0.997, 1.000, 0.999, 1.000), indicating an effective fit to the training set. The validation values were lower (0.992, 0.998, 0.971, and 0.999), as is typical when evaluating model performance on unseen data. Based on the recorded performances, support vector, linear, Lasso, Ridge, and ElasticNet regression were among the worst-performing algorithms, while Random Forest, XGBoost, Decision Trees, and k-nearest neighbors achieved the best results.

[328] Descend or Rewind? Stochastic Gradient Descent Unlearning

Siqiao Mu, Diego Klabjan

Main category: cs.LG

TL;DR: This paper provides theoretical guarantees for stochastic machine unlearning algorithms R2D and D2D, proving (ε, δ) certified unlearning for strongly convex, convex, and nonconvex loss functions through gradient system analysis and optimal coupling techniques.

Details

Motivation: Machine unlearning algorithms like D2D and R2D aim to efficiently remove training data impact without full retraining, but stochastic D2D lacks theoretical backing for nonconvex functions despite being widely used as a baseline.

Method: The authors analyze unlearning through disturbed gradient systems, using optimal coupling of unlearning and retraining trajectories to derive probabilistic sensitivity bounds, combined with a novel relaxed Gaussian mechanism.

Result: Proved (ε, δ) certified unlearning guarantees for stochastic R2D and D2D across strongly convex, convex, and nonconvex loss functions. D2D provides tighter guarantees for strongly convex functions, while R2D works for convex and nonconvex settings.

Conclusion: Both algorithms can achieve certified unlearning, with D2D being optimal for strongly convex functions due to contraction properties, and R2D being more versatile for convex and nonconvex cases by reversing accumulated disturbances.

Abstract: Machine unlearning algorithms aim to remove the impact of selected training data from a model without the computational expenses of retraining from scratch. Two such algorithms are Descent-to-Delete" (D2D) and Rewind-to-Delete" (R2D), full-batch gradient descent algorithms that are easy to implement and satisfy provable unlearning guarantees. In particular, the stochastic version of D2D is widely implemented as the ``finetuning" unlearning baseline, despite lacking theoretical backing on nonconvex functions. In this work, we prove $(ε, δ)$ certified unlearning guarantees for stochastic R2D and D2D for strongly convex, convex, and nonconvex loss functions, by analyzing unlearning through the lens of disturbed or biased gradient systems, which may be contracting, semi-contracting, or expansive respectively. Our argument relies on optimally coupling the random behavior of the unlearning and retraining trajectories, resulting in a probabilistic sensitivity bound that can be combined with a novel relaxed Gaussian mechanism to achieve $(ε, δ)$ unlearning. We determine that D2D can yield tighter guarantees for strongly convex functions compared to R2D by relying on contraction to a unique global minimum. However, unlike D2D, R2D can achieve unlearning in the convex and nonconvex setting because it draws the unlearned model closer to the retrained model by reversing the accumulated disturbances.

[329] Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation

Yiling Liu, Juncheng Dong, Chen Fu, Wei Shi, Ziyang Jiang, Zhigang Hua, David Carlson

Main category: cs.LG

TL;DR: A novel framework combining Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM) for improved counterfactual outcome estimation in time-series data, achieving state-of-the-art performance through synergistic deconfounding and temporal generalization.

Details

Motivation: Counterfactual outcome estimation from time-series observations is challenging due to unobserved counterfactual trajectories and evolving time-varying confounders that distort estimation at every step.

Method: Proposes SGA for fine-grained sub-treatment group alignment to improve deconfounding, and RTM that randomly masks covariates with Gaussian noise to enhance temporal generalization and reduce reliance on noisy current-step covariates.

Result: Experiments show that while SGA and RTM individually improve counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance.

Conclusion: The framework successfully addresses time-series counterfactual estimation through complementary approaches: RTM enhances temporal generalization across time steps, while SGA improves deconfounding at each specific time point.

Abstract: Estimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.

[330] Physics-Guided Inductive Spatiotemporal Kriging for PM2.5 with Satellite Gradient Constraints

Shuo Wang, Mengfan Teng, Yun Cheng, Lothar Thiele, Olga Saukh, Shuangshuang He, Yuanting Zhang, Jiang Zhang, Gangfeng Zhang, Xingyuan Yuan, Jingfang Fan

Main category: cs.LG

TL;DR: SPIN is a physics-guided deep learning framework that uses satellite AOD data as gradient constraints rather than direct inputs, achieving state-of-the-art PM2.5 mapping with 9.52 ug/m^3 MAE in Beijing-Tianjin-Hebei region.

Details

Motivation: Overcome limitations of traditional PM2.5 mapping methods that suffer from severe data missingness in satellite AOD data and inversion biases, enabling continuous pollution mapping in unmonitored areas.

Method: Spatiotemporal Physics-Guided Inference Network (SPIN) with parallel graph kernels modeling physical advection/diffusion processes, using AOD as spatial gradient constraints in loss function rather than direct inputs.

Result: Achieved state-of-the-art performance with 9.52 ug/m^3 MAE in BTHSA region, generating continuous, physically plausible pollution fields even in unmonitored areas.

Conclusion: SPIN provides a robust, low-cost, all-weather solution for fine-grained environmental management by synergistically integrating domain knowledge with deep learning.

Abstract: High-resolution mapping of fine particulate matter (PM2.5) is a cornerstone of sustainable urbanism but remains critically hindered by the spatial sparsity of ground monitoring networks. While traditional data-driven methods attempt to bridge this gap using satellite Aerosol Optical Depth (AOD), they often suffer from severe, non-random data missingness (e.g., due to cloud cover or nighttime) and inversion biases. To overcome these limitations, this study proposes the Spatiotemporal Physics-Guided Inference Network (SPIN), a novel framework designed for inductive spatiotemporal kriging. Unlike conventional approaches, SPIN synergistically integrates domain knowledge into deep learning by explicitly modeling physical advection and diffusion processes via parallel graph kernels. Crucially, we introduce a paradigm-shifting training strategy: rather than using error-prone AOD as a direct input, we repurpose it as a spatial gradient constraint within the loss function. This allows the model to learn structural pollution patterns from satellite data while remaining robust to data voids. Validated in the highly polluted Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), SPIN achieves a new state-of-the-art with a Mean Absolute Error (MAE) of 9.52 ug/m^3, effectively generating continuous, physically plausible pollution fields even in unmonitored areas. This work provides a robust, low-cost, and all-weather solution for fine-grained environmental management.

[331] CARE: Turning LLMs Into Causal Reasoning Expert

Juncheng Dong, Yiling Liu, Ahmed Aloui, Vahid Tarokh, David Carlson

Main category: cs.LG

TL;DR: LLMs struggle with causal discovery but can be enhanced through supervised fine-tuning using outputs from traditional causal discovery algorithms, resulting in models that outperform both traditional methods and much larger LLMs.

Details

Motivation: LLMs lack causal reasoning abilities despite their impressive performance in other tasks, and current approaches of providing them with causal discovery algorithm outputs actually decrease their performance.

Method: Proposed CARE framework that uses supervised fine-tuning to teach LLMs how to effectively utilize outputs from established causal discovery algorithms as sufficient statistics of observational data.

Result: Fine-tuned Qwen2.5-1.5B model significantly outperforms both traditional causal discovery algorithms and state-of-the-art LLMs with over 1000x more parameters, demonstrating effective knowledge integration.

Conclusion: Supervised fine-tuning enables LLMs to effectively combine their internal knowledge with external causal discovery algorithm outputs, overcoming their inherent limitations in causal reasoning.

Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs’ behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs’ performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs’ causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.

[332] HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming

Yang Wu, Yifan Zhang, Zhenxing Liang, Jian Cheng

Main category: cs.LG

TL;DR: HGCN2SP is a novel hierarchical graph-based model for Two-stage Stochastic Programming that uses reinforcement learning to select scenarios in optimal order, significantly reducing solving time while maintaining solution quality.

Details

Motivation: Current methods for solving Two-stage Stochastic Programming problems with many scenarios rely on clustering or Monte Carlo sampling, which fail to deeply integrate scenario information and overlook the significant impact of scenario order on solving time.

Method: Developed HGCN2SP with hierarchical graph design encoding each scenario and modeling their relationships hierarchically. Trained in reinforcement learning paradigm using solver feedback, with hierarchical graph convolutional network for feature encoding and attention-based decoder for scenario selection.

Result: Evaluation on two classic 2SP problems shows HGCN2SP provides high-quality decisions in short computational time and exhibits remarkable generalization capabilities for large-scale instances with many variables or scenarios unseen during training.

Conclusion: HGCN2SP effectively addresses limitations of existing methods by deeply integrating scenario information and optimizing scenario selection order, demonstrating superior performance and generalization in solving Two-stage Stochastic Programming problems.

Abstract: Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.

[333] Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao

Main category: cs.LG

TL;DR: Agent0 is a fully autonomous framework that evolves high-performing LLM agents through multi-step co-evolution between curriculum and executor agents, enabling self-improvement without human data.

Details

Motivation: Current LLM agents are limited by human-curated data dependency and single-round interactions, hindering development of complex curricula involving tool use and dynamic reasoning.

Method: Establishes symbiotic competition between two agents from the same base LLM: a curriculum agent that proposes challenging tasks, and an executor agent that solves them with tool integration, creating a self-reinforcing cycle.

Result: Substantially boosts reasoning capabilities, improving Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks.

Conclusion: Agent0 enables fully autonomous evolution of high-performing agents through co-evolution and tool integration, overcoming limitations of human data dependency.

Abstract: Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model’s inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor’s problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.

[334] Change-of-Basis Pruning via Rotational Invariance

Alex Ning, Vainateya Rangaraju

Main category: cs.LG

TL;DR: The paper introduces two-subspace radial activations (TSRAs) to enable change-of-basis pruning by creating rotationally invariant architectures, showing improved pruning effectiveness despite a small accuracy trade-off.

Details

Motivation: Standard deep learning architectures are not inherently invariant to orthogonal transformations needed for change-of-basis pruning, which limits the effectiveness of structured pruning methods.

Method: Proposed two-subspace radial activations (TSRAs) that are invariant to orthogonal linear transformations within activation subspaces, allowing CoB transformations to merge into weights without extra parameters.

Result: CoB+TSRA framework on VGG-16 shows: fixed-ratio pruning extends reliable frontier from 30% to 70% parameters without fine-tuning; threshold-based pruning removes 90-96% parameters with only 1-6% accuracy drop after fine-tuning.

Conclusion: Rotationally invariant architectures offer a promising path for change-of-basis pruning, enabling more effective structured pruning despite a 4.52% accuracy trade-off compared to ReLU baselines.

Abstract: Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of $4.52%$ compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly $30%$ to $70%$ of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes $90-96%$ of parameters while maintaining $1-6%$ accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.

[335] Gauge-Equivariant Graph Networks via Self-Interference Cancellation

Yoonhyuk Choi, Chong-Kwon Kim

Main category: cs.LG

TL;DR: GESC is a novel GNN that addresses heterophily by replacing additive aggregation with interference-based projection, using phase connections and self-interference cancellation to improve performance on heterophilous graphs.

Details

Motivation: Standard GNNs work well on homophilous graphs but fail under heterophily due to self-reinforcing and phase-inconsistent signals, creating a need for better heterophily-aware models.

Method: Proposes GESC with U(1) phase connections and rank-1 projection that attenuates self-parallel components before attention, plus a sign- and phase-aware gate to regulate neighbor influence.

Result: Consistently outperforms recent state-of-the-art models across diverse graph benchmarks while providing a unified interference-aware view of message passing.

Conclusion: GESC offers an effective solution for heterophilous graphs through interference-based aggregation and phase-aware gating, advancing the theoretical understanding of message passing.

Abstract: Graph Neural Networks (GNNs) excel on homophilous graphs but often fail under heterophily due to self-reinforcing and phase-inconsistent signals. We propose a Gauge-Equivariant Graph Network with Self-Interference Cancellation (GESC), which replaces additive aggregation with a projection-based interference mechanism. Unlike prior magnetic or gauge-equivariant GNNs that typically focus on phase handling in spectral filtering while largely relying on scalar weighting, GESC introduces a $\mathrm{U}(1)$ phase connection followed by a rank-1 projection that attenuates self-parallel components before attention. A sign- and phase-aware gate further regulates neighbor influence, attenuating components aligned with current node states and acting as a local notch on low-frequency modes. Across diverse graph benchmarks, our method consistently outperforms recent state-of-the-art models while offering a unified, interference-aware view of message passing. Our code is available at \href{here}{https://anonymous.4open.science/r/GESC-1B22}.

[336] ILoRA: Federated Learning with Low-Rank Adaptation for Heterogeneous Client Aggregation

Junchao Zhou, Junkang Liu, Fanhua Shang

Main category: cs.LG

TL;DR: ILoRA addresses federated learning challenges with LoRA under client heterogeneity through orthonormal initialization, concatenated aggregation for mixed ranks, and AdamW with control variates to improve stability and accuracy.

Details

Motivation: Federated Learning with LoRA faces three critical issues: initialization-induced instability from random subspace misalignment, rank incompatibility during aggregation causing bias, and exacerbated client drift under non-IID data that impairs generalization.

Method: ILoRA integrates three innovations: QR-based orthonormal initialization for coherent subspace alignment, Concatenated QR Aggregation to fuse heterogeneous-rank updates via concatenation and decomposition while preserving information, and AdamW optimizer with rank-aware control variates to correct local updates and mitigate client drift.

Result: Extensive experiments on vision and NLP benchmarks show ILoRA consistently achieves superior accuracy and convergence stability compared to existing federated LoRA methods, supported by theoretical convergence guarantees.

Conclusion: ILoRA provides a unified framework that effectively addresses the key challenges of federated LoRA under client heterogeneity, delivering improved performance and stability through its three core innovations.

Abstract: Federated Learning with Low-Rank Adaptation (LoRA) faces three critical challenges under client heterogeneity: (1) Initialization-Induced Instability due to random initialization misaligning client subspaces; (2) Rank Incompatibility and Aggregation Error when averaging LoRA parameters of different ranks, which biases the global model; and (3) exacerbated Client Drift under Non-IID Data, impairing generalization. To address these challenges, we propose ILoRA, a unified framework that integrates three core innovations: a QR-based orthonormal initialization to ensure all clients start in a coherent subspace; a Concatenated QR Aggregation mechanism that fuses heterogeneous-rank updates via concatenation and decomposition, preserving information while maintaining dimension alignment; and an AdamW optimizer with rank-aware control variates to correct local updates and mitigate client drift. Supported by theoretical convergence guarantees, extensive experiments on vision and NLP benchmarks demonstrate that ILoRA consistently achieves superior accuracy and convergence stability compared to existing federated LoRA methods.

[337] L-JacobiNet and S-JacobiNet: An Analysis of Adaptive Generalization, Stabilization, and Spectral Domain Trade-offs in GNNs

Huseyin Goksu

Main category: cs.LG

TL;DR: The paper introduces AOPF class to address limitations of spectral GNNs like ChebyNet. It presents L-JacobiNet (adaptive) and S-JacobiNet (static+LayerNorm) models, revealing that ChebyNet’s main flaw is stabilization, not static nature, and that adaptation in [-1,1] domain can cause overfitting.

Details

Motivation: Spectral GNNs like ChebyNet suffer from heterophily and over-smoothing due to their static, low-pass filter design. The paper aims to investigate Adaptive Orthogonal Polynomial Filters (AOPF) as a solution.

Method: Proposed two models in [-1,1] domain: L-JacobiNet (adaptive generalization of ChebyNet with learnable parameters) and S-JacobiNet (static ChebyNet with LayerNorm stabilization). Compared these against AOPFs in [0,∞) domain like LaguerreNet.

Result: Found that [0,∞) domain is superior for modeling heterophily, while [-1,1] domain provides better numerical stability at high K (K>20). Static S-JacobiNet outperformed adaptive L-JacobiNet on 4 out of 5 benchmark datasets.

Conclusion: ChebyNet’s main flaw is stabilization, not its static nature. S-JacobiNet is identified as a powerful, overlooked baseline, and adaptation in [-1,1] domain can lead to overfitting.

Abstract: Spectral GNNs, like ChebyNet, are limited by heterophily and over-smoothing due to their static, low-pass filter design. This work investigates the “Adaptive Orthogonal Polynomial Filter” (AOPF) class as a solution. We introduce two models operating in the [-1, 1] domain: 1) L-JacobiNet, the adaptive generalization of ChebyNet with learnable alpha, beta shape parameters, and 2) S-JacobiNet, a novel baseline representing a LayerNorm-stabilized static ChebyNet. Our analysis, comparing these models against AOPFs in the [0, infty) domain (e.g., LaguerreNet), reveals critical, previously unknown trade-offs. We find that the [0, infty) domain is superior for modeling heterophily, while the [-1, 1] domain (Jacobi) provides superior numerical stability at high K (K>20). Most significantly, we discover that ChebyNet’s main flaw is stabilization, not its static nature. Our static S-JacobiNet (ChebyNet+LayerNorm) outperforms the adaptive L-JacobiNet on 4 out of 5 benchmark datasets, identifying S-JacobiNet as a powerful, overlooked baseline and suggesting that adaptation in the [-1, 1] domain can lead to overfitting.

[338] AssayMatch: Learning to Select Data for Molecular Activity Models

Vincent Fan, Regina Barzilay

Main category: cs.LG

TL;DR: AssayMatch is a framework for selecting homogeneous training data in drug discovery by using data attribution methods to quantify assay contributions and finetune text embeddings, enabling improved model performance even with unknown test labels.

Details

Motivation: Machine learning models in drug discovery suffer from noisy training data due to aggregating bioactivity data from diverse sources with variable experimental protocols, which reduces model performance.

Method: Leverages data attribution methods to quantify training assay contributions, finetunes language embeddings of assay descriptions to capture compatibility, and ranks training data using these embeddings for test sets with unknown labels.

Result: Models trained on AssayMatch-selected data outperform those trained on complete datasets, with increased prediction capability for 9/12 model-target pairs over language-only baselines.

Conclusion: AssayMatch provides a data-driven approach to curate higher-quality datasets by filtering out incompatible experiments, improving predictive power and data efficiency in drug discovery.

Abstract: The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.

[339] Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization

Haohui Chen, Zhiyong Chen, Aoxiang Liu, Wentuo Fang

Main category: cs.LG

TL;DR: This paper introduces enhanced double actor-critic methods with tunable bias control and improved representation learning for continuous control, outperforming benchmarks by flexibly balancing overestimation and underestimation biases.

Details

Motivation: Deterministic policy gradient algorithms suffer from value estimation biases that degrade performance, and while double critics help reduce biases, the exploration potential of double actors remains underexplored.

Method: Proposes three convex combination strategies (symmetric and asymmetric) balancing pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation, with a single hyperparameter for tunable bias control. Integrates augmented state and action representations into actor and critic networks.

Result: Extensive experiments show the approach consistently outperforms benchmarks, demonstrating the value of tunable bias control and revealing that both overestimation and underestimation can be exploited differently depending on the environment.

Conclusion: The proposed double actor-critic framework with flexible bias control and enhanced representation learning provides superior performance in continuous control tasks, showing that both types of estimation biases can be strategically leveraged.

Abstract: Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.

[340] HybSpecNet: A Critical Analysis of Architectural Instability in Hybrid-Domain Spectral GNNs

Huseyin Goksu

Main category: cs.LG

TL;DR: HybSpecNet resolves the stability-vs-adaptivity trade-off in spectral GNNs by combining stable ChebyNet and adaptive KrawtchoukNet branches with late fusion to prevent instability poisoning.

Details

Motivation: Spectral GNNs face a fundamental trade-off: filters in [-1,1] domain are stable but static/low-pass (fail on heterophilic graphs), while filters in [0,∞) domain are adaptive but numerically unstable at high polynomial degrees.

Method: Proposed HybSpecNet with hybrid-domain architecture combining stable ChebyNet branch and adaptive KrawtchoukNet branch. Identified and solved “instability poisoning” issue through “Late Fusion” that isolates gradient pathways.

Result: Naive hybrid architecture achieved strong performance at low K but catastrophically collapsed at K=25 due to NaN/Inf gradients. Late fusion architecture remained perfectly stable up to K=30 while maintaining SOTA performance across all graph types.

Conclusion: The work identifies critical architectural pitfalls in hybrid GNN design and provides a robust solution that successfully resolves the stability-vs-adaptivity trade-off in spectral graph neural networks.

Abstract: Spectral Graph Neural Networks offer a principled approach to graph filtering but face a fundamental “Stability-vs-Adaptivity” trade-off. This trade-off is dictated by the choice of spectral domain. Filters in the finite [-1, 1] domain (e.g., ChebyNet) are numerically stable at high polynomial degrees (K) but are static and low-pass, causing them to fail on heterophilic graphs. Conversely, filters in the semi-infinite [0, infty) domain (e.g., KrawtchoukNet) are highly adaptive and achieve SOTA results on heterophily by learning non-low-pass responses. However, as we demonstrate, these adaptive filters can also suffer from numerical instability, leading to catastrophic performance collapse at high K. In this paper, we propose to resolve this trade-off by designing a hybrid-domain GNN, HybSpecNet, which combines a stable ChebyNet branch with an adaptive KrawtchoukNet branch. We first demonstrate that a “naive” hybrid architecture, which fuses the branches via concatenation, successfully unifies performance at low K, achieving strong results on both homophilic and heterophilic benchmarks. However, we then prove that this naive architecture fails the stability test. Our K-ablation experiments show that this architecture catastrophically collapses at K=25, exactly mirroring the collapse of its unstable KrawtchoukNet branch. We identify this critical finding as “Instability Poisoning,” where NaN/Inf gradients from the adaptive branch destroy the training of the model. Finally, we propose and validate an advanced architecture that uses “Late Fusion” to completely isolate the gradient pathways. We demonstrate that this successfully solves the instability problem, remaining perfectly stable up to K=30 while retaining its SOTA performance across all graph types. This work identifies a critical architectural pitfall in hybrid GNN design and provides the robust architectural solution.

[341] Pathlet Variational Auto-Encoder for Robust Trajectory Generation

Yuanbo Tang, Yan Tang, Zixuan Zhang, Zihui Zhao, Yang Li

Main category: cs.LG

TL;DR: Proposes a robust and interpretable deep generative model for trajectory generation using pathlet representation, achieving significant improvements in performance and efficiency over baselines.

Details

Motivation: Address limitations in robustness and interpretability of existing deep learning trajectory generation models, which hinders their application on noisy real-world data and trustworthiness in downstream tasks.

Method: Uses pathlet representation (binary vectors with learned dictionary of trajectory segments) and probabilistic graphical model combining VAE with linear decoder to learn latent embeddings and mobility patterns simultaneously.

Result: Achieves 35.4% and 26.3% relative improvements over baselines on real-world datasets, saves 64.8% time and 56.5% GPU memory, and enables effective trajectory prediction and data denoising.

Conclusion: The proposed framework provides robust, interpretable, and efficient trajectory generation that works well with noisy data and supports multiple downstream applications.

Abstract: Trajectory generation has recently drawn growing interest in privacy-preserving urban mobility studies and location-based service applications. Although many studies have used deep learning or generative AI methods to model trajectories and have achieved promising results, the robustness and interpretability of such models are largely unexplored. This limits the application of trajectory generation algorithms on noisy real-world data and their trustworthiness in downstream tasks. To address this issue, we exploit the regular structure in urban trajectories and propose a deep generative model based on the pathlet representation, which encode trajectories with binary vectors associated with a learned dictionary of trajectory segments. Specifically, we introduce a probabilistic graphical model to describe the trajectory generation process, which includes a Variational Autoencoder (VAE) component and a linear decoder component. During training, the model can simultaneously learn the latent embedding of pathlet representations and the pathlet dictionary that captures mobility patterns in the trajectory dataset. The conditional version of our model can also be used to generate customized trajectories based on temporal and spatial constraints. Our model can effectively learn data distribution even using noisy data, achieving relative improvements of $35.4%$ and $26.3%$ over strong baselines on two real-world trajectory datasets. Moreover, the generated trajectories can be conveniently utilized for multiple downstream tasks, including trajectory prediction and data denoising. Lastly, the framework design offers a significant efficiency advantage, saving $64.8%$ of the time and $56.5%$ of GPU memory compared to previous approaches.

[342] An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text

Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez

Main category: cs.LG

TL;DR: SHAP-guided LLM framework generates synthetic emotion data that matches real data performance but lacks linguistic richness of authentic posts.

Details

Motivation: High API costs and platform restrictions make accessing social media training data prohibitively expensive for emotion recognition.

Method: Interpretability-guided framework using SHAP explanations to guide LLM-based synthetic data generation for emotion classification.

Result: SHAP-guided approach matches real data performance, outperforms naive generation, and improves underrepresented class classification, but synthetic text shows reduced vocabulary richness and fewer personal/temporal expressions.

Conclusion: Provides practical framework for responsible synthetic data generation while highlighting trade-offs between synthetic utility and real-world authenticity for trustworthy AI.

Abstract: Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.

[343] Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly Detection

Zhijie Zhong, Zhiwen Yu, Kaixiang Yang, C. L. Philip Chen

Main category: cs.LG

TL;DR: Simple supervised models with limited labels outperform complex unsupervised methods in time series anomaly detection, challenging the focus on architectural complexity.

Details

Motivation: Current research focuses on unsupervised methods due to label scarcity, but overlooks the potential performance gains from limited anomaly labels available in practical scenarios.

Method: Introduced STAND, a streamlined supervised baseline, and conducted systematic comparison between supervised and unsupervised paradigms across five public datasets.

Result: (1) Simple supervised models significantly outperform complex unsupervised methods under limited labeling budget; (2) Minimal supervision provides higher returns than architectural innovations; (3) STAND shows superior prediction consistency and anomaly localization.

Conclusion: Advocates for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity.

Abstract: Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.

[344] Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models

Perceval Beja-Battais, Alain Grossetête, Nicolas Vayatis

Main category: cs.LG

TL;DR: The paper introduces two surrogate models for nuclear reactor core simulation to improve Model Predictive Control methods, achieving up to 1000x computational time reduction.

Details

Motivation: There is an increasing need for Nuclear Power Plants to improve flexibility to match renewable energy growth, requiring enhanced simulation methods for Operator Assistance Predictive Systems.

Method: Developed two surrogate models (data-driven and physics-informed) from nonlinear stiff ODEs as alternative simulation schemes for nuclear reactor core simulation.

Result: Both data-driven and physics-informed models can rapidly integrate complex dynamics with very low computational time (up to 1000x time reduction).

Conclusion: Surrogate models provide efficient alternatives for nuclear reactor simulation, enabling improved MPC methods for nuclear power plant flexibility.

Abstract: In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).

[345] Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France

Eloi Lindas, Yannig Goude, Philippe Ciais

Main category: cs.LG

TL;DR: A forecasting pipeline that transforms ECMWF subseasonal-to-seasonal weather forecasts into wind power predictions for 1-46 day lead times, achieving 50% improvement over climatological baselines with well-calibrated probabilistic forecasts.

Details

Motivation: Accurate long-term wind power forecasts are crucial for grid stability and market risk management, but current methods mainly focus on short-term predictions. Longer prediction horizons (subseasonal-to-seasonal) need investigation despite recent weather forecasting progress.

Method: Developed a forecasting pipeline that transforms ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts at daily resolution for 1-46 day lead times, including post-processing to correct biases and dispersion issues in weather forecasts.

Result: Outperformed climatological baseline by 50% in terms of Continuous Ranked Probability Skill Score and Ensemble Mean Squared Error, while providing near perfect calibration for lead times from 15 to 46 days.

Conclusion: The proposed framework successfully enables reliable subseasonal-to-seasonal wind power forecasting, demonstrating significant skill improvement over baseline methods and well-calibrated probabilistic predictions for extended lead times.

Abstract: Accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand, and market risk management. Even though short-term weather forecasts have been thoroughly used to provide short-term renewable power predictions, forecasts involving longer prediction horizons still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation achieve reasonable skill. In this study, we present a forecasting pipeline enabling to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for lead times ranging from 1 day to 46 days at daily resolution. This framework also include post-processing of the resulting power ensembles to account for the biases and lack of dispersion of the weather forecasts. We show that our method is able to outperform a climatological baseline by 50 % in terms of both Continuous Ranked Probability Skill Score and Ensemble Mean Squared Error while also providing near perfect calibration of the forecasts for lead times ranging from 15 to 46 days.

[346] CausalMamba: Interpretable State Space Modeling for Temporal Rumor Causality

Xiaotong Zhan, Xi Cheng

Main category: cs.LG

TL;DR: CausalMamba integrates Mamba-based sequence modeling, GCNs, and causal discovery via NOTEARS for rumor detection, achieving competitive performance while providing interpretable causal insights through counterfactual analysis.

Details

Motivation: Existing neural models for rumor detection lack interpretability and fail to reveal underlying causal mechanisms of misinformation spread, despite capturing content and structural features.

Method: Proposes CausalMamba framework combining Mamba-based sequence modeling for temporal tweet sequences, GCNs for reply structures, and differentiable causal discovery via NOTEARS to learn joint representations and uncover latent causal graphs.

Result: Achieves competitive classification performance on Twitter15 dataset compared to strong baselines, and enables counterfactual intervention analysis showing that removing top-ranked causal nodes significantly alters graph connectivity.

Conclusion: CausalMamba provides a unified approach for rumor classification and influence analysis, offering explainable and actionable insights into rumor dynamics for misinformation detection systems.

Abstract: Rumor detection on social media remains a challenging task due to the complex propagation dynamics and the limited interpretability of existing models. While recent neural architectures capture content and structural features, they often fail to reveal the underlying causal mechanisms of misinformation spread. We propose CausalMamba, a novel framework that integrates Mamba-based sequence modeling, graph convolutional networks (GCNs), and differentiable causal discovery via NOTEARS. CausalMamba learns joint representations of temporal tweet sequences and reply structures, while uncovering latent causal graphs to identify influential nodes within each propagation chain. Experiments on the Twitter15 dataset show that our model achieves competitive classification performance compared to strong baselines, and uniquely enables counterfactual intervention analysis. Qualitative results demonstrate that removing top-ranked causal nodes significantly alters graph connectivity, offering interpretable insights into rumor dynamics. Our framework provides a unified approach for rumor classification and influence analysis, paving the way for more explainable and actionable misinformation detection systems.

[347] A Switching Framework for Online Interval Scheduling with Predictions

Antonios Antoniadis, Ali Shahheidar, Golnoosh Shahkarami, Abolfazl Soltani

Main category: cs.LG

TL;DR: Online interval scheduling with predictions: SemiTrust-and-Switch framework combines prediction-based and classical algorithms to balance consistency and robustness, with tight bounds and smooth performance degradation.

Details

Motivation: To improve online interval scheduling performance using machine-learned predictions while maintaining robustness against prediction errors.

Method: SemiTrust-and-Switch framework that unifies prediction-based and classical algorithms, plus a randomized algorithm that smoothly interpolates between prediction-based and robust approaches.

Result: The framework provides tight bounds for consistency-robustness trade-off and the randomized algorithm achieves both robustness and smooth performance degradation with prediction quality.

Conclusion: The proposed framework effectively leverages predictions to enhance interval scheduling while ensuring robust performance, with theoretical guarantees on the trade-offs.

Abstract: We study online interval scheduling in the irrevocable setting, where each interval must be immediately accepted or rejected upon arrival. The objective is to maximize the total length of accepted intervals while ensuring that no two accepted intervals overlap. We consider this problem in a learning-augmented setting, where the algorithm has access to (machine-learned) predictions. The goal is to design algorithms that leverage these predictions to improve performance while maintaining robust guarantees in the presence of prediction errors. Our main contribution is the SemiTrust-and-Switch framework, which provides a unified approach for combining prediction-based and classical interval scheduling algorithms. This framework applies to both deterministic and randomized algorithms and captures the trade-off between consistency (performance under accurate predictions) and robustness (performance under adversarial inputs). Moreover, we provide lower bounds, proving the tightness of this framework in particular settings. We further design a randomized algorithm that smoothly interpolates between prediction-based and robust algorithms. This algorithm achieves both robustness and smoothness–its performance degrades gracefully with the quality of the prediction.

[348] Causal Synthetic Data Generation in Recruitment

Andrea Iommi, Antonio Mastropietro, Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri

Main category: cs.LG

TL;DR: This paper presents a specialized Synthetic Data Generation method using Causal Generative Models to create synthetic recruitment datasets that preserve causal relationships and enable fairness evaluation in candidate ranking algorithms.

Details

Motivation: Recruitment domains face data scarcity due to privacy constraints on sensitive CV information, hindering development of fair ML models. Lack of representative data leads to poor generalization and unreliable real-world performance of ranking algorithms.

Method: The authors use two Causal Generative Models (CGMs) - one for job offers and another for curricula - structured according to causal graphs informed by domain expertise. These models generate synthetic datasets while preserving underlying causal relationships.

Result: The method enables generation of synthetic recruitment datasets and evaluation of fairness in candidate rankings under controlled scenarios that introduce specific biases.

Conclusion: Causal Generative Models offer a promising solution for generating synthetic recruitment data that provides greater control over fairness and interpretability, addressing data scarcity issues in sensitive domains.

Abstract: The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. % This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. % Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. % In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.

[349] Towards Overcoming Data Scarcity in Nuclear Energy: A Study on Critical Heat Flux with Physics-consistent Conditional Diffusion Model

Farah Alsafadi, Alexandra Akins, Xu Wu

Main category: cs.LG

TL;DR: This paper investigates using diffusion models to overcome data scarcity in nuclear energy applications, specifically for critical heat flux (CHF) data, by generating realistic synthetic samples that maintain physical consistency.

Details

Motivation: To address data scarcity in energy applications where experimental data are limited, costly, or difficult to obtain, by using deep generative models to enrich training datasets and improve downstream model robustness.

Method: Developed both vanilla and conditional diffusion models using a public CHF dataset covering commercial nuclear reactor conditions. The conditional DM generates targeted CHF data under user-specified thermal-hydraulic conditions.

Result: Both DM and conditional DM successfully generated realistic and physics-consistent CHF data. The conditional DM was highly effective in augmenting CHF data while maintaining acceptable uncertainty levels.

Conclusion: Diffusion models, particularly conditional variants, provide an effective pathway to overcome data scarcity in nuclear energy applications by generating high-fidelity synthetic data that preserves statistical and physical properties of the original dataset.

Abstract: Deep generative modeling provides a powerful pathway to overcome data scarcity in energy-related applications where experimental data are often limited, costly, or difficult to obtain. By learning the underlying probability distribution of the training dataset, deep generative models, such as the diffusion model (DM), can generate high-fidelity synthetic samples that statistically resemble the training data. Such synthetic data generation can significantly enrich the size and diversity of the available training data, and more importantly, improve the robustness of downstream machine learning models in predictive tasks. The objective of this paper is to investigate the effectiveness of DM for overcoming data scarcity in nuclear energy applications. By leveraging a public dataset on critical heat flux (CHF) that cover a wide range of commercial nuclear reactor operational conditions, we developed a DM that can generate an arbitrary amount of synthetic samples for augmenting of the CHF dataset. Since a vanilla DM can only generate samples randomly, we also developed a conditional DM capable of generating targeted CHF data under user-specified thermal-hydraulic conditions. The performance of the DM was evaluated based on their ability to capture empirical feature distributions and pair-wise correlations, as well as to maintain physical consistency. The results showed that both the DM and conditional DM can successfully generate realistic and physics-consistent CHF data. Furthermore, uncertainty quantification was performed to establish confidence in the generated data. The results demonstrated that the conditional DM is highly effective in augmenting CHF data while maintaining acceptable levels of uncertainty.

[350] Mind the Gap: Bridging Prior Shift in Realistic Few-Shot Crop-Type Classification

Joana Reuss, Ekaterina Gikalo, Marco Körner

Main category: cs.LG

TL;DR: Dirichlet Prior Augmentation (DirPA) addresses class imbalance in crop-type classification by simulating real-world label distribution skew during training, improving generalization in few-shot learning scenarios.

Details

Motivation: Real-world agricultural data suffers from severe class imbalance (long-tailed distribution), and training sets are often artificially balanced, creating a mismatch with test distributions that degrades real-world performance.

Method: Propose Dirichlet Prior Augmentation (DirPA) that models real-world distribution as Dirichlet-distributed random variables, performing prior augmentation during few-shot learning to simulate unknown label distribution skew.

Result: DirPA successfully shifts decision boundaries and stabilizes training by acting as a dynamic feature regularizer, improving generalization to real-world conditions.

Conclusion: DirPA effectively addresses the training-test distribution mismatch in imbalanced agricultural data through proactive simulation of real-world label skew during training.

Abstract: Real-world agricultural distributions often suffer from severe class imbalance, typically following a long-tailed distribution. Labeled datasets for crop-type classification are inherently scarce and remain costly to obtain. When working with such limited data, training sets are frequently constructed to be artificially balanced – in particular in the case of few-shot learning – failing to reflect real-world conditions. This mismatch induces a shift between training and test label distributions, degrading real-world generalization. To address this, we propose Dirichlet Prior Augmentation (DirPA), a novel method that simulates an unknown label distribution skew of the target domain proactively during model training. Specifically, we model the real-world distribution as Dirichlet-distributed random variables, effectively performing a prior augmentation during few-shot learning. Our experiments show that DirPA successfully shifts the decision boundary and stabilizes the training process by acting as a dynamic feature regularizer.

[351] Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty

Victor Croisfelt, João Henrique Inacio de Souza, Shashi Raj Pandey, Beatriz Soret, Petar Popovski

Main category: cs.LG

TL;DR: A neuro-inspired non-blocking inference paradigm using adaptive temporal windows of integration to handle stochastic delays across data streams, eliminating the need for reference modality and offline profiling.

Details

Motivation: Current non-blocking inference methods rely on reference-modality paradigm and require costly offline profiling, which limits adaptability to uncertain communication delays in cyber-physical systems.

Method: Proposes adaptive temporal windows of integration (TWIs) that dynamically adjust to stochastic delay patterns across heterogeneous streams, relaxing the reference-modality requirement.

Result: Achieves robust real-time inference with finer-grained control over accuracy-latency tradeoff, demonstrating superior adaptability to network dynamics on audio-visual event localization task compared to state-of-the-art approaches.

Conclusion: The neuro-inspired framework provides a more flexible and adaptive solution for handling communication delays in multi-stream inference systems without requiring reference modalities or offline profiling.

Abstract: Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.

[352] Deep SOR Minimax Q-learning for Two-player Zero-sum Game

Saksham Gautam, Lakshmi Mandal, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: Proposes a deep successive over-relaxation minimax Q-learning algorithm for two-player zero-sum games using neural networks for high-dimensional state-action spaces, proving finite-time convergence and demonstrating effectiveness through experiments.

Details

Motivation: Existing successive over-relaxation Q-learning only works for tabular cases and hasn't been extended to function approximation settings needed for real-world high-dimensional spaces, nor applied to two-player zero-sum games.

Method: Developed a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators, suitable for high-dimensional state-action spaces.

Result: Proved finite-time convergence of the proposed algorithm and showed through numerical experiments that it outperforms existing Q-learning methods, with ablation studies demonstrating the effect of different successive over-relaxation parameters.

Conclusion: The proposed deep successive over-relaxation minimax Q-learning algorithm effectively extends the benefits of successive over-relaxation to high-dimensional settings with function approximation for two-player zero-sum games, achieving faster convergence and better performance.

Abstract: In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.

[353] Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

Yang Yu

Main category: cs.LG

TL;DR: The pass@k metric for evaluating LLM reasoning is analyzed and shown to be unsuitable as a direct optimization objective due to vanishing learning signals and exploration collapse.

Details

Motivation: To understand the limitations of using pass@k as an optimization objective in reinforcement learning for LLM reasoning tasks, given its widespread adoption.

Method: Analyzed the pass@k objective mathematically, derived its gradient, and examined its relationship with pass@1, studying exploration dynamics and probability concentration effects.

Result: Pass@k is fundamentally a per-example positive reweighting of pass@1 that provides vanishing learning signals when exploration is most needed, leading to exploration collapse as policies concentrate probability mass.

Conclusion: While pass@k is useful for evaluation, it’s unsuitable as a direct optimization objective; explicit exploration mechanisms are needed for effective RL in reasoning tasks.

Abstract: The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of “exploration collapse”, showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.

[354] GeoPTH: A Lightweight Approach to Category-Based Trajectory Retrieval via Geometric Prototype Trajectory Hashing

Yang Xu, Zuliang Yang, Kai Ming Ting

Main category: cs.LG

TL;DR: GeoPTH is a lightweight, non-learning framework for efficient trajectory similarity retrieval using geometric prototypes and Hausdorff metric hashing.

Details

Motivation: Address limitations of traditional metrics (computationally expensive) and learning-based methods (high training costs, instability) in trajectory similarity retrieval.

Method: Constructs data-dependent hash functions using representative trajectory prototypes as anchors, mapping trajectories to closest prototypes via Hausdorff metric.

Result: Highly competitive retrieval accuracy with traditional metrics and state-of-the-art learning methods, significantly outperforms binary codes from learned embeddings, and consistently superior efficiency.

Conclusion: Lightweight prototype-centric approach offers practical and powerful alternative for trajectory retrieval with exceptional performance and computational efficiency.

Abstract: Trajectory similarity retrieval is an important part of spatiotemporal data mining, however, existing methods have the following limitations: traditional metrics are computationally expensive, while learning-based methods suffer from substantial training costs and potential instability. This paper addresses these problems by proposing \textbf{Geo}metric \textbf{P}rototype \textbf{T}rajectory \textbf{H}ashing (GeoPTH), a novel, lightweight, and non-learning framework for efficient category-based trajectory retrieval. GeoPTH constructs data-dependent hash functions by using representative trajectory prototypes, i.e., small point sets preserving geometric characteristics, as anchors. The hashing process is efficient, which involves mapping a new trajectory to its closest prototype via a robust, \textit{Hausdorff} metric. Extensive experiments show that GeoPTH’s retrieval accuracy is highly competitive with both traditional metrics and state-of-the-art learning methods, and it significantly outperforms binary codes generated through simple binarization of the learned embeddings. Critically, GeoPTH consistently outperforms all competitors in terms of efficiency. Our work demonstrates that a lightweight, prototype-centric approach offers a practical and powerful alternative, achieving an exceptional retrieval performance and computational efficiency.

[355] Graph Diffusion Counterfactual Explanation

David Bechtoldt, Sidney Bender

Main category: cs.LG

TL;DR: A novel framework using diffusion models and classifier-free guidance to generate counterfactual explanations for graph data, addressing the challenge of explaining graph-based ML predictions.

Details

Motivation: Graph-based ML models lack interpretability, and counterfactual explanations are underexplored in graph domains due to the discrete and non-Euclidean nature of graphs.

Method: Combines discrete diffusion models with classifier-free guidance to generate counterfactual explanations for graph data.

Result: Empirically demonstrates reliable generation of in-distribution and minimally structurally different counterfactuals for both discrete and continuous graph properties.

Conclusion: The proposed Graph Diffusion Counterfactual Explanation framework effectively addresses the challenge of generating interpretable counterfactuals for graph-structured data.

Abstract: Machine learning models that operate on graph-structured data, such as molecular graphs or social networks, often make accurate predictions but offer little insight into why certain predictions are made. Counterfactual explanations address this challenge by seeking the closest alternative scenario where the model’s prediction would change. Although counterfactual explanations are extensively studied in tabular data and computer vision, the graph domain remains comparatively underexplored. Constructing graph counterfactuals is intrinsically difficult because graphs are discrete and non-euclidean objects. We introduce Graph Diffusion Counterfactual Explanation, a novel framework for generating counterfactual explanations on graph data, combining discrete diffusion models and classifier-free guidance. We empirically demonstrate that our method reliably generates in-distribution as well as minimally structurally different counterfactuals for both discrete classification targets and continuous properties.

[356] Optimizing Operation Recipes with Reinforcement Learning for Safe and Interpretable Control of Chemical Processes

Dean Brandner, Sergio Lucia

Main category: cs.LG

TL;DR: A reinforcement learning approach that optimizes chemical process operation by tuning parameters of existing operation recipes and linear controllers, requiring less data and handling constraints better than traditional methods.

Details

Motivation: Traditional reinforcement learning struggles with hard constraints and large data requirements in chemical processes, while optimal control methods face computational complexity. Current manual recipes lead to suboptimal performance.

Method: Use reinforcement learning to optimize parameters of existing operation recipes and their underlying linear controllers, leveraging embedded expert knowledge for structured optimization.

Result: Simulation on industrial batch polymerization reactor shows approach can achieve near-optimal controller performance while using significantly less data and handling constraints effectively.

Conclusion: The proposed method provides an interpretable, data-efficient solution for chemical process optimization that bridges the gap between manual recipes and complex optimal control methods.

Abstract: Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility. In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods.

[357] Learning-Enhanced Observer for Linear Time-Invariant Systems with Parametric Uncertainty

Hao Shu

Main category: cs.LG

TL;DR: A learning-enhanced observer (LEO) framework that refines system matrices through gradient-based optimization to improve state estimation accuracy in uncertain linear time-invariant systems.

Details

Motivation: Traditional observers rely on nominal models and struggle with parameter uncertainty. This work aims to enhance observer performance by combining classical designs with modern learning mechanisms.

Method: Treats system matrices as optimizable variables and refines them through gradient-based minimization of steady-state output discrepancy loss, creating data-informed surrogate models.

Result: Monte Carlo studies show systematic reductions exceeding 15% in normalized estimation error for both open-loop and Luenberger observers across diverse system dimensions.

Conclusion: Learning mechanisms can effectively complement traditional observer design, yielding more accurate and robust state estimation in uncertain systems while preserving classical observer structures.

Abstract: This work introduces a learning-enhanced observer (LEO) for linear time-invariant systems with uncertain dynamics. Rather than relying solely on nominal models, the proposed framework treats the system matrices as optimizable variables and refines them through gradient-based minimization of a steady-state output discrepancy loss. The resulting data-informed surrogate model enables the construction of an improved observer that effectively compensates for moderate parameter uncertainty while preserving the structure of classical designs. Extensive Monte Carlo studies across diverse system dimensions show systematic and statistically significant reductions, typically exceeding 15%, in normalized estimation error for both open-loop and Luenberger observers. These results demonstrate that modern learning mechanisms can serve as a powerful complement to traditional observer design, yielding more accurate and robust state estimation in uncertain systems. Codes are available at https://github.com/Hao-B-Shu/LTI_LEO.

[358] Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub

Main category: cs.LG

TL;DR: This paper reviews World Models for healthcare, which learn predictive dynamics to enable multistep rollouts, counterfactual evaluation, and planning across medical imaging, EHRs, and robotic surgery.

Details

Motivation: Current AI in healthcare lacks physical foundation and temporal reasoning needed for clinical decision support. World models offer multimodal, temporally coherent representations that reflect the physical and causal structure of care.

Method: Survey of recent work across three domains: medical imaging/diagnostics, disease progression modeling from EHRs, and robotic surgery/surgical planning. Introduces a 4-level capability rubric (L1-L4) for evaluation.

Result: Most reviewed systems achieve L1 (temporal prediction) and L2 (action-conditioned prediction), with fewer instances of L3 (counterfactual rollouts) and rare L4 (planning/control). Identifies key gaps limiting clinical reliability.

Conclusion: Outlines research agenda for clinically robust prediction-first world models that integrate generative backbones with causal/mechanical foundation for safe decision support in healthcare.

Abstract: Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1–L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

[359] Improving Iterative Gaussian Processes via Warm Starting Sequential Posteriors

Alan Yufei Dong, Jihao Andreas Lin, José Miguel Hernández-Lobato

Main category: cs.LG

TL;DR: Proposes a method to improve convergence of iterative Gaussian process solvers by leveraging solutions from smaller contained systems, achieving speed-ups for incremental data tasks.

Details

Motivation: Scalable Gaussian process inference is crucial for sequential decision-making but remains challenging; iterative solvers need better convergence for incremental data scenarios.

Method: Uses known solutions from smaller linear systems to improve convergence of iterative linear solvers (conjugate gradients, SGD, alternative projections) for larger GP systems.

Result: Achieves speed-ups when solving to tolerance and improved Bayesian optimisation performance under fixed compute budgets for incremental data addition tasks.

Conclusion: The proposed technique effectively enhances iterative GP solver convergence by leveraging smaller system solutions, providing practical benefits for sequential decision-making applications.

Abstract: Scalable Gaussian process (GP) inference is essential for sequential decision-making tasks, yet improving GP scalability remains a challenging problem with many open avenues of research. This paper focuses on iterative GPs, where iterative linear solvers, such as conjugate gradients, stochastic gradient descent or alternative projections, are used to approximate the GP posterior. We propose a new method which improves solver convergence of a large linear system by leveraging the known solution to a smaller system contained within. This is significant for tasks with incremental data additions, and we show that our technique achieves speed-ups when solving to tolerance, as well as improved Bayesian optimisation performance under a fixed compute budget.

[360] Are Foundation Models Useful for Bankruptcy Prediction?

Marcin Kostrzewa, Oleksii Furman, Roman Furman, Sebastian Tomczak, Maciej Zięba

Main category: cs.LG

TL;DR: Foundation models like Llama-3.3-70B-Instruct and TabPFN underperform classical ML methods (XGBoost, CatBoost) for corporate bankruptcy prediction on large imbalanced datasets, with LLMs showing unreliable probability estimates and TabPFN having unjustified computational costs.

Details

Motivation: To systematically evaluate foundation models against established methods for corporate bankruptcy prediction, as their effectiveness for this specific financial application remains unevaluated despite promise in other domains.

Method: Used Llama-3.3-70B-Instruct and TabPFN foundation models, compared against classical ML baselines (XGBoost, CatBoost) on large imbalanced datasets of over one million company records from the Visegrád Group.

Result: XGBoost and CatBoost consistently outperformed foundation models across all prediction horizons. LLMs showed unreliable probability estimates, and TabPFN required substantial computational resources without performance gains justifying the costs.

Conclusion: Current foundation models remain less effective than specialized methods for bankruptcy forecasting, despite their generality, highlighting the importance of domain-specific approaches for financial risk prediction.

Abstract: Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrád Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.

[361] Optimal Fairness under Local Differential Privacy

Hrad Ghoukasian, Shahab Asoodeh

Main category: cs.LG

TL;DR: This paper presents optimal LDP mechanisms that reduce data unfairness while preserving privacy, improving fairness in downstream classification with better accuracy-fairness trade-offs than existing methods.

Details

Motivation: To address how local differential privacy mechanisms can be designed to reduce data unfairness and improve fairness in downstream classification tasks while maintaining privacy.

Method: Derived closed-form optimal mechanism for binary sensitive attributes and developed tractable optimization framework for multi-valued attributes; established theoretical link between privacy-aware pre-processing and classification fairness.

Result: The approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to non-private models and achieving better accuracy-fairness trade-off than leading fairness methods.

Conclusion: LDP serves as a principled and effective pre-processing fairness intervention technique that simultaneously preserves privacy of sensitive attributes.

Abstract: We investigate how to optimally design local differential privacy (LDP) mechanisms that reduce data unfairness and thereby improve fairness in downstream classification. We first derive a closed-form optimal mechanism for binary sensitive attributes and then develop a tractable optimization framework that yields the corresponding optimal mechanism for multi-valued attributes. As a theoretical contribution, we establish that for discrimination-accuracy optimal classifiers, reducing data unfairness necessarily leads to lower classification unfairness, thus providing a direct link between privacy-aware pre-processing and classification fairness. Empirically, we demonstrate that our approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to that of non-private models. Moreover, compared with leading pre-processing and post-processing fairness methods, our mechanism achieves a more favorable accuracy-fairness trade-off while simultaneously preserving the privacy of sensitive attributes. Taken together, these results highlight LDP as a principled and effective pre-processing fairness intervention technique.

[362] Collaborative Management for Chronic Diseases and Depression: A Double Heterogeneity-based Multi-Task Learning Method

Yidong Chai, Haoxin Liu, Jiaheng Xie, Chaopeng Wang, Xiao Fang

Main category: cs.LG

TL;DR: Proposes ADH-MTL method for joint assessment of physical chronic diseases and depression using wearable sensors, addressing double heterogeneity through group-level modeling, decomposition strategy, and Bayesian networks.

Details

Motivation: Most health sensing studies focus only on physical diseases, overlooking the need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care management.

Method: Multi-task learning approach with ADH-MTL method featuring three innovations: group-level modeling for new patient predictions, decomposition strategy to reduce complexity, and Bayesian network to capture dependencies while balancing similarities/differences across diseases.

Result: Empirical evaluations on real-world wearable sensor data show ADH-MTL significantly outperforms existing baselines, with each innovation proven effective.

Conclusion: Provides computational solution for integrated physical and mental healthcare and design principles for advancing collaborative chronic disease management across pre-treatment, treatment, and post-treatment phases.

Abstract: Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.

[363] FreqFlow: Long-term forecasting using lightweight flow matching

Seyed Mohamad Moghadas, Bruno Cornelis, Adrian Munteanu

Main category: cs.LG

TL;DR: FreqFlow is a lightweight multivariate time-series forecasting framework that uses conditional flow matching in the frequency domain, achieving state-of-the-art performance with only 89k parameters and deterministic single-pass sampling.

Details

Motivation: Current diffusion-based generative models for MTS forecasting suffer from computational overhead due to iterative sampling and struggle with high-dimensional, non-stationary data with multi-scale periodic patterns.

Method: Transforms forecasting to spectral domain using conditional flow matching; learns amplitude and phase shifts via single complex-valued linear layer; decomposes signals into trend, seasonal, and residual components; uses ODE integration for deterministic sampling.

Result: Achieves 7% average RMSE improvement on real-world traffic datasets; significantly faster and more parameter-efficient than diffusion models; handles high-dimensional, non-stationary data effectively.

Conclusion: FreqFlow demonstrates that frequency-domain flow matching enables efficient and accurate MTS forecasting with minimal computational overhead, making it suitable for real-time deployment.

Abstract: Multivariate time-series (MTS) forecasting is fundamental to applications ranging from urban mobility and resource management to climate modeling. While recent generative models based on denoising diffusion have advanced state-of-the-art performance in capturing complex data distributions, they suffer from significant computational overhead due to iterative stochastic sampling procedures that limit real-time deployment. Moreover, these models can be brittle when handling high-dimensional, non-stationary, and multi-scale periodic patterns characteristic of real-world sensor networks. We introduce FreqFlow, a novel framework that leverages conditional flow matching in the frequency domain for deterministic MTS forecasting. Unlike conventional approaches that operate in the time domain, FreqFlow transforms the forecasting problem into the spectral domain, where it learns to model amplitude and phase shifts through a single complex-valued linear layer. This frequency-domain formulation enables the model to efficiently capture temporal dynamics via complex multiplication, corresponding to scaling and temporal translations. The resulting architecture is exceptionally lightweight with only 89k parameters - an order of magnitude smaller than competing diffusion-based models-while enabling single-pass deterministic sampling through ordinary differential equation (ODE) integration. Our approach decomposes MTS signals into trend, seasonal, and residual components, with the flow matching mechanism specifically designed for residual learning to enhance long-term forecasting accuracy. Extensive experiments on real-world traffic speed, volume, and flow datasets demonstrate that FreqFlow achieves state-of-the-art forecasting performance, on average 7% RMSE improvements, while being significantly faster and more parameter-efficient than existing methods

[364] Generative Modeling of Clinical Time Series via Latent Stochastic Differential Equations

Muhammad Aslanimoghanloo, Ahmed ElGazzar, Marcel van Gerven

Main category: cs.LG

TL;DR: A generative modeling framework using latent neural stochastic differential equations (SDEs) that handles irregular clinical time series data, captures disease progression uncertainty, and outperforms baseline models in treatment effect estimation and physiological forecasting.

Details

Motivation: Clinical time series data from EHRs and medical registries offer opportunities for understanding patient trajectories but present challenges due to irregular sampling, complex latent physiology, and measurement uncertainties.

Method: Latent neural SDEs with modality-dependent emission models, using variational inference for state estimation and parameter learning. Models clinical time series as discrete observations of an underlying controlled stochastic dynamical system.

Result: Outperforms ordinary differential equation and LSTM baseline models in accuracy and uncertainty estimation for both simulated PKPD lung cancer treatment effect estimation and real-world ICU physiological signal forecasting from 12,000 patients.

Conclusion: The framework enables precise, uncertainty-aware predictions to support clinical decision-making by naturally handling irregular sampling, learning complex interactions, and capturing disease progression stochasticity in a unified probabilistic framework.

Abstract: Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.

[365] A Comparison Between Decision Transformers and Traditional Offline Reinforcement Learning Algorithms

Ali Murtaza Caunhye, Asad Jeewa

Main category: cs.LG

TL;DR: Comparative study shows Decision Transformers outperform traditional offline RL methods in sparse reward settings and are less sensitive to reward density, while value-based methods like IQL excel in dense reward scenarios with high-quality data.

Details

Motivation: To compare Decision Transformers against traditional offline RL algorithms (CQL, IQL) in different reward settings, addressing challenges in balancing exploration and exploitation with varying reward densities.

Method: Empirical analysis in ANT continuous control environment, evaluating performance across dense and sparse reward settings with different dataset qualities (medium-expert).

Result: DTs showed less sensitivity to varying reward density, excelled in sparse reward scenarios with medium-expert datasets, while IQL performed better in dense reward settings with high-quality data. CQL offered balanced performance across data qualities. DTs had lower variance but higher computational costs.

Conclusion: Sequence modelling approaches (DTs) are more suitable for uncertain reward structures or mixed-quality data, while value-based methods remain competitive in dense reward settings with high-quality demonstrations.

Abstract: The field of Offline Reinforcement Learning (RL) aims to derive effective policies from pre-collected datasets without active environment interaction. While traditional offline RL algorithms like Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) have shown promise, they often face challenges in balancing exploration and exploitation, especially in environments with varying reward densities. The recently proposed Decision Transformer (DT) approach, which reframes offline RL as a sequence modelling problem, has demonstrated impressive results across various benchmarks. This paper presents a comparative study evaluating the performance of DT against traditional offline RL algorithms in dense and sparse reward settings for the ANT continous control environment. Our research investigates how these algorithms perform when faced with different reward structures, examining their ability to learn effective policies and generalize across varying levels of feedback. Through empirical analysis in the ANT environment, we found that DTs showed less sensitivity to varying reward density compared to other methods and particularly excelled with medium-expert datasets in sparse reward scenarios. In contrast, traditional value-based methods like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities. Additionally, DTs exhibited lower variance in performance but required significantly more computational resources compared to traditional approaches. These findings suggest that sequence modelling approaches may be more suitable for scenarios with uncertain reward structures or mixed-quality data, while value-based methods remain competitive in settings with dense rewards and high-quality demonstrations.

[366] Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments

Muhammad Sa’ood Shah, Asad Jeewa

Main category: cs.LG

TL;DR: Scalarisation functions in MORL struggle to accurately approximate Pareto fronts in complex environments, while inner-loop multi-policy algorithms like Pareto Q-Learning offer more robust alternatives.

Details

Motivation: To investigate the limitations of scalarisation approaches in multi-objective reinforcement learning, particularly their inability to accurately approximate Pareto fronts in complex, uncertain environments.

Method: Evaluated MO Q-Learning with linear and Chebyshev scalarisation functions (single-policy) versus Pareto Q-Learning (multi-policy) across MORL environments with discrete action/observation spaces using outer-loop multi-policy methodology.

Result: Scalarisation functions’ performance depends heavily on environment and Pareto front shape, often failing to retain discovered solutions and favoring certain solution space regions. Finding appropriate weight configurations for full Pareto front sampling is complex.

Conclusion: Inner-loop multi-policy algorithms provide more sustainable and generalizable approaches for intelligent decision-making in dynamic, uncertain environments compared to scalarisation-based methods.

Abstract: Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.

[367] Correlation-Aware Feature Attribution Based Explainable AI

Poushali Sengupta, Yan Zhang, Frank Eliassen, Sabita Maharjan

Main category: cs.LG

TL;DR: ExCIR is a correlation-aware attribution method that provides computationally efficient and stable global feature importance rankings by using robust centering and lightweight transfer protocols.

Details

Motivation: Existing global attribution methods have high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets.

Method: ExCIR quantifies sign-aligned co-movement between features and model outputs after robust centering, and includes BlockCIR for groupwise attribution of correlated feature sets.

Result: ExCIR shows trustworthy agreement with established baselines, delivers consistent top-k rankings across diverse datasets, and reduces runtime via lightweight evaluation on data subsets.

Conclusion: ExCIR provides computationally efficient, consistent, and scalable explainability suitable for real-world deployment.

Abstract: Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.

[368] ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation

Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades

Main category: cs.LG

TL;DR: ODE-ViT reformulates Vision Transformers as ODE systems for stable, interpretable performance with fewer parameters, enhanced by a teacher-student framework.

Details

Motivation: Large models require high computational resources and lack interpretability; ODE-ViT addresses this by leveraging connections between residual networks and ODEs for stable dynamics.

Method: Reformulate Vision Transformer as an ODE system ensuring well-posed and stable dynamics; introduce a teacher-student framework where a discrete ViT guides ODE-ViT’s continuous trajectory.

Result: ODE-ViT achieves stable, interpretable, and competitive performance on CIFAR-10/100 with up to 10x fewer parameters, outperforming prior ODE-based Transformers; teacher-student framework boosts performance by over 10%.

Conclusion: ODE-ViT offers a computationally efficient and interpretable alternative to large Vision Transformers, with the teacher-student framework further enhancing performance.

Abstract: In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.

[369] Loss Functions Robust to the Presence of Label Errors

Nicholas Pellegrino, David Szczecina, Paul Fieguth

Main category: cs.LG

TL;DR: Two novel loss functions that de-weight or ignore difficult samples (likely label errors) improve F1 scores for detecting label errors compared to Cross Entropy and Focal Loss baselines.

Details

Motivation: Methods for detecting label errors require robust models, but training on corrupted data is challenging. Focal Loss's emphasis on difficult samples inspired new approaches that handle potential label errors differently.

Method: Proposed two simple loss functions that de-weight or ignore difficult-to-classify samples (likely containing label errors), building on the concept of Focal Loss but with opposite approach.

Result: Experiments on artificially corrupted data show improved F1 scores for label error detection compared to conventional categorical Cross Entropy and Focal Loss baselines.

Conclusion: The proposed loss functions that handle difficult samples differently show promise for improving label error detection in training data.

Abstract: Methods for detecting label errors in training data require models that are robust to label errors (i.e., not fit to erroneously labelled data points). However, acquiring such models often involves training on corrupted data, which presents a challenge. Adjustments to the loss function present an opportunity for improvement. Motivated by Focal Loss (which emphasizes difficult-to-classify samples), two novel, yet simple, loss functions are proposed that de-weight or ignore these difficult samples (i.e., those likely to have label errors). Results on artificially corrupted data show promise, such that F1 scores for detecting errors are improved from the baselines of conventional categorical Cross Entropy and Focal Loss.

[370] Dynamic Participation in Federated Learning: Benchmarks and a Knowledge Pool Plugin

Ming-Lun Lee, Fu-Shiang Yang, Cheng-Kuan Lin, Yan-Ann Chen, Chih-Yu Lin, Yu-Chee Tseng

Main category: cs.LG

TL;DR: First open-source framework for benchmarking federated learning under dynamic client participation, revealing significant performance degradation and proposing KPFL solution.

Details

Motivation: Existing FL research assumes consistent client participation, ignoring practical scenarios where clients dynamically join/leave during training, with no systematic benchmarking framework for DPFL challenges.

Method: Developed configurable benchmarking framework for DPFL, then proposed KPFL - a generic plugin with shared knowledge pool, dual-age/data-bias weighting, and generative knowledge distillation.

Result: Benchmarking revealed substantial performance degradation under dynamic participation; KPFL significantly improved model robustness and generalization across experiments.

Conclusion: Dynamic participation significantly impacts FL performance, and KPFL effectively addresses these challenges by maintaining knowledge continuity and preventing knowledge loss.

Abstract: Federated learning (FL) enables clients to collaboratively train a shared model in a distributed manner, setting it apart from traditional deep learning paradigms. However, most existing FL research assumes consistent client participation, overlooking the practical scenario of dynamic participation (DPFL), where clients may intermittently join or leave during training. Moreover, no existing benchmarking framework systematically supports the study of DPFL-specific challenges. In this work, we present the first open-source framework explicitly designed for benchmarking FL models under dynamic client participation. Our framework provides configurable data distributions, participation patterns, and evaluation metrics tailored to DPFL scenarios. Using this platform, we benchmark four major categories of widely adopted FL models and uncover substantial performance degradation under dynamic participation. To address these challenges, we further propose Knowledge-Pool Federated Learning (KPFL), a generic plugin that maintains a shared knowledge pool across both active and idle clients. KPFL leverages dual-age and data-bias weighting, combined with generative knowledge distillation, to mitigate instability and prevent knowledge loss. Extensive experiments demonstrate the significant impact of dynamic participation on FL performance and the effectiveness of KPFL in improving model robustness and generalization.

[371] FairLRF: Achieving Fairness through Sparse Low Rank Factorization

Yuanbo Guo, Jun Xia, Yiyu Shi

Main category: cs.LG

TL;DR: Proposes FairLRF, a fairness-enhancing method using SVD to selectively remove bias-inducing elements from unitary matrices, improving DL model fairness without significant accuracy loss.

Details

Motivation: Existing bias-mitigation methods are computationally expensive or cause substantial accuracy drops, limiting their practicality in resource-constrained real-world applications like medical diagnosis.

Method: Uses singular value decomposition (SVD) to identify and selectively remove bias-inducing elements from unitary matrices, reducing group disparities based on sensitive attributes.

Result: Outperforms conventional low rank factorization methods and state-of-the-art fairness-enhancing techniques in extensive experiments.

Conclusion: SVD can be effectively repurposed for fairness enhancement rather than just model compression, offering a practical solution for improving DL model fairness in real-world applications.

Abstract: As deep learning (DL) techniques become integral to various applications, ensuring model fairness while maintaining high performance has become increasingly critical, particularly in sensitive fields such as medical diagnosis. Although a variety of bias-mitigation methods have been proposed, many rely on computationally expensive debiasing strategies or suffer substantial drops in model accuracy, which limits their practicality in real-world, resource-constrained settings. To address this issue, we propose a fairness-oriented low rank factorization (LRF) framework that leverages singular value decomposition (SVD) to improve DL model fairness. Unlike traditional SVD, which is mainly used for model compression by decomposing and reducing weight matrices, our work shows that SVD can also serve as an effective tool for fairness enhancement. Specifically, we observed that elements in the unitary matrices obtained from SVD contribute unequally to model bias across groups defined by sensitive attributes. Motivated by this observation, we propose a method, named FairLRF, that selectively removes bias-inducing elements from unitary matrices to reduce group disparities, thus enhancing model fairness. Extensive experiments show that our method outperforms conventional LRF methods as well as state-of-the-art fairness-enhancing techniques. Additionally, an ablation study examines how major hyper-parameters may influence the performance of processed models. To the best of our knowledge, this is the first work utilizing SVD not primarily for compression but for fairness enhancement.

[372] Broad stochastic configuration residual learning system for norm-convergent universal approximation

Han Su, Zhongyan Li, Wanquan Liu

Main category: cs.LG

TL;DR: The paper identifies a limitation in broad residual learning systems (BRLS) where universal approximation relies on probability measure convergence rather than norm convergence, making it sensitive to random parameter selection. The authors propose BSCRLS with a supervisory mechanism to constrain random parameters and prove its universal approximation with norm convergence.

Details

Motivation: To address the sensitivity of randomized learning networks to random parameter selection and establish universal approximation with more rigorous norm convergence rather than probability measure convergence.

Method: Proposed broad stochastic configuration residual learning system (BSCRLS) with a supervisory mechanism that adaptively constrains random parameter ranges within the BRLS framework. Developed three incremental BSCRLS algorithms for different network update requirements.

Result: Theoretically proved BSCRLS achieves universal approximation with norm convergence. Experimental results on solar panel dust detection showed BSCRLS outperformed 13 deep and broad learning algorithms.

Conclusion: BSCRLS effectively addresses the limitation of BRLS by ensuring norm convergence for universal approximation through adaptive constraint of random parameters, demonstrating superior performance in practical applications.

Abstract: Universal approximation serves as the foundation of neural network learning algorithms. However, some networks establish their universal approximation property by demonstrating that the iterative errors converge in probability measure rather than the more rigorous norm convergence, which makes the universal approximation property of randomized learning networks highly sensitive to random parameter selection, Broad residual learning system (BRLS), as a member of randomized learning models, also encounters this issue. We theoretically demonstrate the limitation of its universal approximation property, that is, the iterative errors do not satisfy norm convergence if the selection of random parameters is inappropriate and the convergence rate meets certain conditions. To address this issue, we propose the broad stochastic configuration residual learning system (BSCRLS) algorithm, which features a novel supervisory mechanism adaptively constraining the range settings of random parameters on the basis of BRLS framework, Furthermore, we prove the universal approximation theorem of BSCRLS based on the more stringent norm convergence. Three versions of incremental BSCRLS algorithms are presented to satisfy the application requirements of various network updates. Solar panels dust detection experiments are performed on publicly available dataset and compared with 13 deep and broad learning algorithms. Experimental results reveal the effectiveness and superiority of BSCRLS algorithms.

[373] Toward Valid Generative Clinical Trial Data with Survival Endpoints

Perrine Chassat, Van Tuan Nguyen, Lucas Ducrot, Emilie Lanoy, Agathe Guilloux

Main category: cs.LG

TL;DR: A VAE-based method for generating synthetic control arms with time-to-event outcomes, outperforming GANs in fidelity, utility, and privacy while addressing censoring and small sample challenges.

Details

Motivation: Clinical trials face challenges with fragmented populations, slow enrollment, and high costs, especially in oncology and rare diseases. Synthetic control arms using generative AI offer a promising alternative to traditional external controls.

Method: Variational autoencoder that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring.

Result: Outperforms GAN baselines on fidelity, utility, and privacy metrics across synthetic and real trial datasets. Effective for data sharing under privacy constraints and control-arm augmentation.

Conclusion: The method shows progress in generative survival modeling but reveals systematic miscalibration issues. A post-generation selection procedure is proposed to improve calibration, highlighting both achievements and remaining challenges.

Abstract: Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.

[374] Boosting Predictive Performance on Tabular Data through Data Augmentation with Latent-Space Flow-Based Diffusion

Md. Tawfique Ihsan, Md. Rakibul Hasan Rafi, Ahmed Shoyeb Raihan, Imtiaz Ahmed, Abdullahil Azeem

Main category: cs.LG

TL;DR: A family of latent-space, tree-driven diffusion methods (PCAForest, EmbedForest, AttentionForest) for minority oversampling in imbalanced tabular data, using conditional flow matching with gradient-boosted trees to preserve tabular structure while improving privacy and efficiency.

Details

Motivation: Address severe class imbalance in real-world tabular learning where rare minority classes are crucial, overcoming limitations of existing generative methods (GANs, VAEs, diffusion models) that struggle with tabular heterogeneity, training stability, and privacy concerns.

Method: Three variants using conditional flow matching with gradient-boosted trees as vector-field learners: PCAForest (linear PCA embedding), EmbedForest (learned nonlinear embedding), and AttentionForest (attention-augmented embedding), all operating in compact latent spaces with decoders back to original feature space.

Result: Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest offer faster generation with similar utility. Privacy metrics comparable to or better than ForestDiffusion baseline.

Conclusion: Latent-space, tree-driven diffusion provides efficient and privacy-aware approach for high-fidelity tabular data augmentation under severe class imbalance, with smaller embeddings improving minority recall and aggressive learning rates harming stability.

Abstract: Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.

[375] ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions

Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal

Main category: cs.LG

TL;DR: ECPv2 is an improved global optimization algorithm for Lipschitz-continuous functions that addresses computational efficiency issues in the original ECP framework while maintaining theoretical guarantees.

Details

Motivation: To overcome limitations of the original ECP algorithm, including high computational costs and overly conservative early behavior, while maintaining its theoretical guarantees for global optimization.

Method: Introduces three key innovations: adaptive lower bound to avoid vacuous acceptance regions, Worst-m memory mechanism for efficient comparisons, and fixed random projection for faster distance computations in high dimensions.

Result: ECPv2 achieves state-of-the-art performance across high-dimensional non-convex optimization benchmarks while significantly reducing wall-clock time, matching or outperforming existing optimizers.

Conclusion: ECPv2 successfully addresses ECP’s limitations while preserving its theoretical guarantees, making it a scalable and efficient global optimization algorithm suitable for high-dimensional problems.

Abstract: We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP’s no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.

[376] Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods

Amartya Mukherjee, Jun Liu

Main category: cs.LG

TL;DR: DP-SGD converges almost surely under standard smoothness assumptions in both nonconvex and strongly convex settings with decaying step sizes, extending to momentum variants like DP-SHB and DP-NAG.

Details

Motivation: Existing analyses of DP-SGD typically establish convergence in expectation or with high probability, but lack understanding of almost sure convergence of single trajectories, leaving theoretical foundations incomplete.

Method: Prove almost sure convergence of DP-SGD under standard smoothness assumptions using decaying step sizes, and extend analysis to momentum variants (DP-SHB, DP-NAG) through careful energy constructions.

Result: DP-SGD converges almost surely in both nonconvex and strongly convex settings, and momentum variants maintain similar convergence guarantees despite privacy-induced distortions.

Conclusion: The results provide stronger theoretical foundations for differentially private optimization, showing that DP algorithms remain pathwise stable in both convex and nonconvex regimes despite privacy constraints.

Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov’s accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.

[377] gfnx: Fast and Scalable Library for Generative Flow Networks in JAX

Daniil Tiapkin, Artem Agarkov, Nikita Morozov, Ian Maksimov, Askar Tsyganov, Timofei Gritsaev, Sergey Samsonov

Main category: cs.LG

TL;DR: gfnx is a fast JAX-based package for training and evaluating Generative Flow Networks (GFlowNets) that provides extensive environments, metrics, and achieves significant speedups over PyTorch implementations.

Details

Motivation: To accelerate GFlowNet research and applications by providing a fast, scalable implementation with standardized benchmarks and diverse environments.

Method: Implemented in JAX with single-file core objectives, supporting various environments including hypergrids, sequence generation, molecular generation, phylogenetic trees, Bayesian structure learning, and Ising model sampling.

Result: Achieves up to 55x speedup on CPU-based sequence generation and 80x speedup on GPU-based Bayesian network structure learning compared to PyTorch implementations.

Conclusion: gfnx provides a comprehensive, high-performance package that standardizes GFlowNet evaluation and significantly accelerates research and applications.

Abstract: In this paper, we present gfnx, a fast and scalable package for training and evaluating Generative Flow Networks (GFlowNets) written in JAX. gfnx provides an extensive set of environments and metrics for benchmarking, accompanied with single-file implementations of core objectives for training GFlowNets. We include synthetic hypergrids, multiple sequence generation environments with various editing regimes and particular reward designs for molecular generation, phylogenetic tree construction, Bayesian structure learning, and sampling from the Ising model energy. Across different tasks, gfnx achieves significant wall-clock speedups compared to Pytorch-based benchmarks (such as torchgfn library) and author implementations. For example, gfnx achieves up to 55 times speedup on CPU-based sequence generation environments, and up to 80 times speedup with the GPU-based Bayesian network structure learning setup. Our package provides a diverse set of benchmarks and aims to standardize empirical evaluation and accelerate research and applications of GFlowNets. The library is available on GitHub (https://github.com/d-tiapkin/gfnx) and on pypi (https://pypi.org/project/gfnx/). Documentation is available on https://gfnx.readthedocs.io.

[378] Toward Artificial Palpation: Representation Learning of Touch on Soft Bodies

Zohar Rimon, Elisei Shafer, Tal Tepper, Efrat Shimron, Aviv Tamar

Main category: cs.LG

TL;DR: Self-supervised learning approach for artificial palpation that learns tactile representations from robot palpation sequences, enabling tactile imaging and change detection.

Details

Motivation: Current artificial palpation methods only create simple force maps, lacking the ability to capture intricate tactile patterns that human palpation provides.

Method: Encoder-decoder framework trained on robot palpation sequences with tactile sensors, validated using both simulation and real-world MRI data of soft objects.

Result: Learned representation captures complex tactile patterns beyond simple force maps, successfully applied to tactile imaging and change detection tasks.

Conclusion: Self-supervised learning enables artificial palpation systems to learn rich tactile representations that can support medical examination tasks like imaging and change detection.

Abstract: Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a $\textit{representation}$ from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces – the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.

[379] Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games

Eilam Shapira, Omer Madmon, Roi Reichart, Moshe Tennenholtz

Main category: cs.LG

TL;DR: LLMs can generate effective training data for predicting human choices in economic persuasion games, sometimes outperforming models trained on actual human data.

Details

Motivation: Human choice prediction is important for economics applications but limited by data scarcity. The paper explores whether LLMs can generate synthetic training data to overcome this limitation.

Method: Used LLMs to generate training data for language-based persuasion games, then trained models on both LLM-generated and human data to predict human behavior. Also studied LLMs as both data generators and predictors.

Result: Models trained on LLM-generated data effectively predicted human behavior and sometimes outperformed models trained on actual human data. Interaction history was more important than linguistic sentiment for prediction accuracy.

Conclusion: LLM-generated data shows strong potential for modeling human decision-making in complex economic settings, especially when LLMs capture history-dependent patterns similar to humans.

Abstract: Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.

[380] Stabilizing Policy Gradient Methods via Reward Profiling

Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang

Main category: cs.LG

TL;DR: A universal reward profiling framework for policy gradient methods that selectively updates policies based on high-confidence performance estimations, improving convergence speed and reducing variance.

Details

Motivation: Policy gradient methods suffer from unreliable reward improvements and slow convergence due to high variance in gradient estimations, limiting their performance in reinforcement learning.

Method: Proposed a reward profiling framework that can be integrated with any policy gradient algorithm, selectively updating policies based on high-confidence performance estimations rather than always updating.

Result: Empirical results on 8 continuous-control benchmarks show up to 1.5x faster convergence to near-optimal returns and up to 1.75x reduction in return variance compared to baseline methods.

Conclusion: The profiling approach provides a general, theoretically grounded path to more reliable and efficient policy learning in complex environments without slowing down convergence.

Abstract: Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.

[381] Property-guided Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing

Shinyoung Kang, Jihan Kim

Main category: cs.LG

TL;DR: This study explores quantum natural language processing (QNLP) for inverse design of metal-organic frameworks (MOFs), achieving high classification accuracies for pore volume and CO2 Henry’s constant properties using various QNLP models.

Details

Motivation: To leverage quantum computing for materials design by applying QNLP to inverse design MOFs with targeted properties, offering a new approach to explore the complex MOF landscape.

Method: Analyzed 450 hypothetical MOF structures with 3 topologies, 10 metal nodes and 15 organic ligands; compared bag-of-words, DisCoCat, and sequence-based QNLP models using IBM Qiskit simulator; developed binary and multi-class classification models.

Result: Bag-of-words model achieved best performance: 88.6% validation accuracy for pore volume and 78.0% for CO2 Henry’s constant binary classification; multi-class models achieved 92% and 80% average test accuracies; inverse design achieved 93.5% and 87% accuracies for generating MOFs with target properties.

Conclusion: Although covering only a fraction of the MOF search space, this work demonstrates promising potential for using quantum computing in materials design, marking an important first step in applying QNLP to MOF exploration.

Abstract: In this study, we explore the potential of using quantum natural language processing (QNLP) to inverse design metal-organic frameworks (MOFs) with targeted properties. Specifically, by analyzing 450 hypothetical MOF structures consisting of 3 topologies, 10 metal nodes and 15 organic ligands, we categorize these structures into four distinct classes for pore volume and $CO_{2}$ Henry’s constant values. We then compare various QNLP models (i.e. the bag-of-words, DisCoCat (Distributional Compositional Categorical), and sequence-based models) to identify the most effective approach to process the MOF dataset. Using a classical simulator provided by the IBM Qiskit, the bag-of-words model is identified to be the optimum model, achieving validation accuracies of 88.6% and 78.0% for binary classification tasks on pore volume and $CO_{2}$ Henry’s constant, respectively. Further, we developed multi-class classification models tailored to the probabilistic nature of quantum circuits, with average test accuracies of 92% and 80% across different classes for pore volume and $CO_{2}$ Henry’s constant datasets. Finally, the performance of generating MOF with target properties showed accuracies of 93.5% for pore volume and 87% for $CO_{2}$ Henry’s constant, respectively. Although our investigation covers only a fraction of the vast MOF search space, it marks a promising first step towards using quantum computing for materials design, offering a new perspective through which to explore the complex landscape of MOFs.

[382] Evolution Strategies at the Hyperscale

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster

Main category: cs.LG

TL;DR: EGGROLL is a scalable evolution strategies algorithm that uses low-rank matrix perturbations to enable efficient optimization of large neural networks with billions of parameters, reducing computational and memory costs while maintaining performance.

Details

Motivation: Evolution strategies (ES) are powerful for non-differentiable optimization but become prohibitively expensive at scale due to computational and memory costs of full-rank matrix perturbations and batched forward passes in large neural networks.

Method: EGGROLL generates low-rank matrix perturbations using random matrices A and B with r « min(m,n), forming A·B⊤ instead of full-rank perturbations, reducing storage from mn to r(m+n) and computational cost from O(mn) to O(r(m+n)).

Result: EGGROLL achieves comparable performance to full-rank ES in RL settings, is competitive with GRPO for LLM reasoning improvement, and enables stable pre-training of nonlinear recurrent language models using only integer datatypes.

Conclusion: EGGROLL provides an efficient and scalable evolution strategies approach that overcomes computational bottlenecks while maintaining optimization performance for large-scale neural network training.

Abstract: We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.

[383] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

Main category: cs.LG

TL;DR: TLT accelerates reasoning RL training for LLMs using adaptive speculative decoding, achieving 1.7x speedup without accuracy loss by addressing long-tail response generation bottlenecks.

Details

Motivation: Training reasoning LLMs with RL suffers from efficiency bottlenecks due to long-tail response distributions where few very long responses dominate execution time, wasting resources and increasing costs.

Method: TLT integrates adaptive speculative decoding with two components: Adaptive Drafter (lightweight draft model trained continuously on idle GPUs) and Adaptive Rollout Engine (memory-efficient CUDAGraph pool with adaptive SD strategy selection).

Result: TLT achieves over 1.7x end-to-end RL training speedup compared to state-of-the-art systems while preserving model accuracy, and produces a high-quality draft model as a free byproduct.

Conclusion: TLT successfully overcomes RL training bottlenecks through adaptive speculative decoding, providing significant speed improvements without compromising accuracy, making reasoning model training more efficient and cost-effective.

Abstract: The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

[384] KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

Main category: cs.LG

TL;DR: KVTuner is a framework that optimizes KV cache quantization for LLMs by adaptively searching layer-wise precision pairs, achieving near-lossless compression with significant throughput improvements.

Details

Motivation: Current KV cache quantization methods overlook layer-wise sensitivity, have high overhead for fine-grained decisions, and lack flexibility across different LLMs and constraints.

Method: Theoretical analysis of transformer attention patterns’ correlation to KV quantization errors, plus adaptive search for optimal hardware-friendly layer-wise KV precision pairs using multi-objective optimization with intra-layer pruning and inter-layer clustering.

Result: Achieved nearly lossless 3.25-bit mixed precision for Llama-3.1-8B-Instruct and 4.0-bit for Qwen2.5-7B-Instruct, with 21.25% maximum throughput improvement over KIVI-KV8 quantization.

Conclusion: KVTuner provides an effective solution for KV cache quantization that balances performance preservation with computational efficiency across various LLMs and context lengths.

Abstract: KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.

[385] Interpreting the Effects of Quantization on LLMs

Manpreet Singh, Hassan Sajjad

Main category: cs.LG

TL;DR: Quantization’s impact on LLM internal representations is studied using interpretability techniques, revealing minor effects on model calibration, consistent dead neuron counts, and varying neuron redundancy across models.

Details

Motivation: To investigate how quantization affects the internal representations and reliability of LLMs, as this impact remains understudied despite quantization's practical use in resource-constrained environments.

Method: Employed interpretability techniques to analyze multiple LLMs under 4-bit and 8-bit quantization, examining model calibration, neuron activations, dead neurons, and neuron contribution to predictions.

Result: Quantization has minor impact on model calibration; dead neuron counts remain consistent; smaller full-precision models have fewer salient neurons while larger models have more (except Llama-2-7B); neuron redundancy effects vary by model.

Conclusion: Quantization effects vary by model and tasks, but no drastic changes were observed that would discourage using quantization as a reliable model compression technique.

Abstract: Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

[386] TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Pasan Dissanayake, Sanghamitra Dutta

Main category: cs.LG

TL;DR: TabDistill distills knowledge from complex transformer models into simpler neural networks for tabular data classification, achieving parameter efficiency while maintaining strong few-shot performance.

Details

Motivation: Transformer models perform well on tabular data with limited training data but have high complexity and parameter counts. There's a need to maintain performance while reducing computational requirements.

Method: Proposed TabDistill framework that uses knowledge distillation to transfer pre-trained transformer knowledge into simpler neural networks for tabular data classification.

Result: Distilled neural networks outperform classical baselines (regular neural networks, XGBoost, logistic regression) with equal training data, and sometimes even surpass the original transformer models they were distilled from.

Conclusion: TabDistill successfully bridges the gap between performance and efficiency, providing parameter-efficient models that maintain strong few-shot learning capabilities for tabular data.

Abstract: Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.

[387] Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering

Jie Xu, Yazhou Ren, Huayi Tang, Zhimeng Yang, Lili Pan, Yang Yang, Xiaorong Pu, Philip S. Yu, Lifang He

Main category: cs.LG

TL;DR: SDMVC proposes self-supervised discriminative feature learning for deep multi-view clustering to address the negative impact of views with unclear clustering structures.

Details

Motivation: Existing multi-view clustering methods often fail to handle views with unclear clustering structures, which negatively affects overall clustering performance.

Method: Uses deep autoencoders for feature learning per view, concatenates all views’ features to form global features, and employs self-supervised learning with pseudo-labels to create unified target distribution for discriminative feature learning.

Result: Outperforms 14 competitors including classic and state-of-the-art methods on various multi-view datasets.

Conclusion: SDMVC effectively handles views with unclear clustering structures, learns consistent cluster assignments while preserving feature diversity, and achieves superior multi-view clustering performance.

Abstract: Multi-view clustering is an important research topic due to its capability to utilize complementary information from multiple views. However, there are few methods to consider the negative impact caused by certain views with unclear clustering structures, resulting in poor multi-view clustering performance. To address this drawback, we propose self-supervised discriminative feature learning for deep multi-view clustering (SDMVC). Concretely, deep autoencoders are applied to learn embedded features for each view independently. To leverage the multi-view complementary information, we concatenate all views’ embedded features to form the global features, which can overcome the negative impact of some views’ unclear clustering structures. In a self-supervised manner, pseudo-labels are obtained to build a unified target distribution to perform multi-view discriminative feature learning. During this process, global discriminative information can be mined to supervise all views to learn more discriminative features, which in turn are used to update the target distribution. Besides, this unified target distribution can make SDMVC learn consistent cluster assignments, which accomplishes the clustering consistency of multiple views while preserving their features’ diversity. Experiments on various types of multi-view datasets show that SDMVC outperforms 14 competitors including classic and state-of-the-art methods. The code is available at https://github.com/SubmissionsIn/SDMVC.

[388] A low-rank non-convex norm method for multiview graph clustering

Alaeddine Zahir, Khalide Jbilou, Ahmed Ratnani

Main category: cs.LG

TL;DR: A novel multi-view clustering method called CGMVC-NC that uses low-rank non-convex tensor norms to integrate information from multiple data sources, achieving superior clustering accuracy while remaining computationally efficient.

Details

Motivation: Multi-view clustering is challenging due to the need to integrate information from multiple data sources or views for accurate clustering, requiring methods that can effectively capture correlations between different views.

Method: Uses structural characteristics of multi-view data tensors with a non-convex tensor norm to identify correlations between views, and employs efficient optimization algorithms despite the non-convex nature.

Result: Demonstrates superior clustering accuracy across several benchmark datasets compared to conventional methods, while remaining computationally efficient.

Conclusion: Provides a valuable tool for multi-view data analysis with potential applications in understanding complex systems across various fields, with opportunities for extension to other data types and machine learning tasks.

Abstract: This study introduces a novel technique for multi-view clustering known as the “Consensus Graph-Based Multi-View Clustering Method Using Low-Rank Non-Convex Norm” (CGMVC-NC). Multi-view clustering is a challenging task in machine learning as it requires the integration of information from multiple data sources or views to cluster data points accurately. The suggested approach makes use of the structural characteristics of multi-view data tensors, introducing a non-convex tensor norm to identify correlations between these views. In contrast to conventional methods, this approach demonstrates superior clustering accuracy across several benchmark datasets. Despite the non-convex nature of the tensor norm used, the proposed method remains amenable to efficient optimization using existing algorithms. The approach provides a valuable tool for multi-view data analysis and has the potential to enhance our understanding of complex systems in various fields. Further research can explore the application of this method to other types of data and extend it to other machine-learning tasks.

[389] Sparse-PGD: A Unified Framework for Sparse Adversarial Perturbations Generation

Xuyang Zhong, Chen Liu

Main category: cs.LG

TL;DR: Proposes Sparse-PGD framework for generating sparse adversarial perturbations and adversarial training to build robust models against such attacks.

Details

Motivation: To study and evaluate model robustness against both unstructured and structured sparse adversarial perturbations, which are more practical and stealthy than dense perturbations.

Method: Developed Sparse-PGD, a white-box PGD-like attack method for efficient sparse perturbation generation, combined with black-box attacks for comprehensive evaluation, and used it for adversarial training.

Result: Sparse-PGD shows strong attack performance across scenarios, and adversarially trained models achieve state-of-the-art robustness against various sparse attacks.

Conclusion: The proposed framework effectively generates sparse perturbations and builds robust models, demonstrating superior defense capabilities compared to existing methods.

Abstract: This work studies sparse adversarial perturbations, including both unstructured and structured ones. We propose a framework based on a white-box PGD-like attack method named Sparse-PGD to effectively and efficiently generate such perturbations. Furthermore, we combine Sparse-PGD with a black-box attack to comprehensively and more reliably evaluate the models’ robustness against unstructured and structured sparse adversarial perturbations. Moreover, the efficiency of Sparse-PGD enables us to conduct adversarial training to build robust models against various sparse perturbations. Extensive experiments demonstrate that our proposed attack algorithm exhibits strong performance in different scenarios. More importantly, compared with other robust models, our adversarially trained model demonstrates state-of-the-art robustness against various sparse attacks. Codes are available at https://github.com/CityU-MLO/sPGD.

[390] Structural Disentanglement of Causal and Correlated Concepts

Qilong Zhao, Shiyu Wang, Zeeshan Memon, Yang Qiao, Guangji Bai, Bo Pan, Zhaohui Qin, Liang Zhao

Main category: cs.LG

TL;DR: C2VAE is a unified framework that jointly models causal and correlational relationships in latent factors for controllable data generation, improving generation quality and intervention fidelity.

Details

Motivation: Existing methods only model part of the causal-correlational structure in real-world generative factors, limiting reliable control over data generation.

Method: Proposes Causal-Correlation Variational Autoencoder (C2VAE) that organizes latent space into a structured graph with root causes, enabling efficient control by optimizing only relevant root factors.

Result: Experiments on synthetic and real-world datasets show C2VAE improves generation quality, disentanglement, and intervention fidelity compared to existing baselines.

Conclusion: C2VAE provides a unified approach for modeling both causal and correlational dependencies, enabling more faithful and efficient controllable data generation.

Abstract: Controllable data generation aims to synthesize data by specifying values for target concepts. Achieving this reliably requires modeling the underlying generative factors and their relationships. In real-world scenarios, these factors exhibit both causal and correlational dependencies, yet most existing methods model only part of this structure. We propose the Causal-Correlation Variational Autoencoder (C2VAE), a unified framework that jointly captures causal and correlational relationships among latent factors. C2VAE organizes the latent space into a structured graph, identifying a set of root causes that govern the generative processes. By optimizing only the root factors relevant to target concepts, the model enables efficient and faithful control. Experiments on synthetic and real-world datasets demonstrate that C2VAE improves generation quality, disentanglement, and intervention fidelity over existing baselines.

[391] Provably Robust Pre-Trained Ensembles for Biomarker-Based Cancer Classification

Chongmin Lee, Jihie Kim

Main category: cs.LG

TL;DR: Meta-trained Hyperfast models achieve state-of-the-art cancer classification with high AUC (0.9929) and robustness to class imbalance, using fewer features than previous methods.

Details

Motivation: Early cancer detection is challenging, especially for cancers like pancreatic cancer. Liquid biopsies offer non-invasive monitoring but current ML methods struggle with hyperparameter tuning and class imbalance in high-dimensional tabular data.

Method: Used meta-trained Hyperfast models for binary classification and proposed a novel ensemble combining Hyperfast, XGBoost, and LightGBM for multi-class classification with only 500 PCA features.

Result: Achieved highest AUC of 0.9929 in binary classification and accuracy of 0.9464 in multi-class classification with just 500 features, demonstrating robustness under class imbalance via balanced accuracy and minority-class recall.

Conclusion: Pre-trained tabular models and simple ensembling can deliver state-of-the-art accuracy and improved minority-class performance with far fewer features and no additional tuning, making them suitable for robust cancer detection.

Abstract: Certain cancer types, notably pancreatic cancer, are difficult to detect at an early stage, motivating robust biomarker-based screening. Liquid biopsies enable non-invasive monitoring of circulating biomarkers, but typical machine learning pipelines for high-dimensional tabular data (e.g., random forests, SVMs) rely on expensive hyperparameter tuning and can be brittle under class imbalance. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results. Crucially, we demonstrate robustness under class imbalance: empirically via balanced accuracy and minority-class recall across cancer-vs.-noncancer and cancer-vs.-rest settings, and theoretically by showing (i) a prototype-form final layer for Hyperfast that yields prior-insensitive decisions under bounded bias, and (ii) minority-error reductions for majority vote under mild error diversity. Together, these results indicate that pre-trained tabular models and simple ensembling can deliver state-of-the-art accuracy and improved minority-class performance with far fewer features and no additional tuning.

[392] Distance-Preserving Representations for Genomic Spatial Reconstruction

Wenbin Zhou, Jin-Hong Du

Main category: cs.LG

TL;DR: dp-VAE is a representation learning framework that reconstructs spatial coordinates from single-cell gene expression data using distance-preserving regularization, enabling spatial context recovery for datasets lacking spatial information.

Details

Motivation: Spatial context is crucial for single-cell gene expression analysis but often inaccessible due to technical limitations, limiting the utility of many datasets that lack spatial coordinates.

Method: A distance-preserving VAE framework with a regularizer in the loss function to capture spatial context from reference datasets, using latent representations to reconstruct spatial coordinates through constrained optimization.

Result: The method demonstrates effectiveness in training robustness, out-of-sample evaluation, and transfer learning across 27 publicly available datasets, showing broad applicability for genomics studies.

Conclusion: dp-VAE provides a practical solution for recovering spatial context in single-cell genomics, overcoming previous limitations and enabling spatial analysis for datasets that previously lacked this crucial information.

Abstract: The spatial context of single-cell gene expression data is crucial for many downstream analyses, yet often remains inaccessible due to practical and technical limitations, restricting the utility of such datasets. In this paper, we propose a generic representation learning and transfer learning framework dp-VAE, capable of reconstructing the spatial coordinates associated with the provided gene expression data. Central to our approach is a distance-preserving regularizer integrated into the loss function during training, ensuring the model effectively captures and utilizes spatial context signals from reference datasets. During the inference stage, the produced latent representation of the model can be used to reconstruct or impute the spatial context of the provided gene expression by solving a constrained optimization problem. We also explore the theoretical connections between distance-preserving loss, distortion, and the bi-Lipschitz condition within generative models. Finally, we demonstrate the effectiveness of dp-VAE in different tasks involving training robustness, out-of-sample evaluation, and transfer learning inference applications by testing it over 27 publicly available datasets. This underscores its applicability to a wide range of genomics studies that were previously hindered by the absence of spatial data.

[393] Asymptotic and Finite Sample Analysis of Nonexpansive Stochastic Approximations with Markovian Noise

Ethan Blaser, Shangtong Zhang

Main category: cs.LG

TL;DR: This paper analyzes stochastic approximation algorithms with nonexpansive operators under Markovian noise, providing both asymptotic and finite sample analysis, with applications to average reward temporal difference learning.

Details

Motivation: Previous stochastic approximation analysis focused on contractive operators, which doesn't apply to important reinforcement learning settings like average reward problems where operators are only nonexpansive.

Method: The authors study nonexpansive stochastic approximations with Markovian noise, using novel bounds of noise terms from the Poisson equation for analysis.

Result: The paper provides both asymptotic and finite sample analysis for nonexpansive stochastic approximations, and proves for the first time that classical tabular average reward temporal difference learning converges to a sample-path dependent fixed point.

Conclusion: The work extends stochastic approximation theory to nonexpansive operators with Markovian noise, enabling analysis of important reinforcement learning algorithms like average reward TD learning that were previously not covered by contractive operator theory.

Abstract: Stochastic approximation is a powerful class of algorithms with celebrated success. However, a large body of previous analysis focuses on stochastic approximations driven by contractive operators, which is not applicable in some important reinforcement learning settings like the average reward setting. This work instead investigates stochastic approximations with merely nonexpansive operators. In particular, we study nonexpansive stochastic approximations with Markovian noise, providing both asymptotic and finite sample analysis. Key to our analysis are novel bounds of noise terms resulting from the Poisson equation. As an application, we prove for the first time that classical tabular average reward temporal difference learning converges to a sample-path dependent fixed point.

[394] Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang, Satpreet H. Singh, Flavio Martinelli, Kanaka Rajan

Main category: cs.LG

TL;DR: A framework to quantify and control solution degeneracy in task-trained RNNs across behavior, neural dynamics, and weight space levels, tested on 3,400 networks across four neuroscience tasks.

Details

Motivation: To systematically understand and control the phenomenon of solution degeneracy where different RNNs trained on the same task achieve similar performance but exhibit different internal solutions.

Method: Developed a unified framework to quantify degeneracy across three levels, applied to 3,400 RNNs trained on four neuroscience tasks while varying task complexity, learning regime, network size, and regularization.

Result: Higher task complexity and stronger feature learning reduce neural dynamics degeneracy but increase weight space degeneracy. Larger networks and structural regularization reduce degeneracy at all levels, validating the Contravariance Principle.

Conclusion: Provides a principled framework for controlling RNN solution variability, offering tools for building more interpretable and biologically grounded neural computation models.

Abstract: Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.

[395] TopoTune : A Framework for Generalized Combinatorial Complex Neural Networks

Mathilde Papillon, Guillermo Bernárdez, Claudio Battiloro, Nina Miolane

Main category: cs.LG

TL;DR: GCCNs generalize CCNNs to transform any graph neural network into a topological deep learning model, outperforming CCNNs with less complexity, with TopoTune software for easy implementation.

Details

Motivation: Real-world systems have multi-way interactions that GNNs fail to capture, and TDL lacks standardized frameworks, restricting accessibility.

Method: Introduce Generalized CCNNs (GCCNs) that systematically transform any graph neural network into its TDL counterpart, and TopoTune software for building GCCNs.

Result: GCCNs consistently match or outperform CCNNs, often with less model complexity, and generalize/subsumes CCNNs.

Conclusion: GCCNs accelerate and democratize TDL by providing a flexible, easy-to-use framework for topological deep learning.

Abstract: Graph Neural Networks (GNNs) effectively learn from relational data by leveraging graph symmetries. However, many real-world systems – such as biological or social networks – feature multi-way interactions that GNNs fail to capture. Topological Deep Learning (TDL) addresses this by modeling and leveraging higher-order structures, with Combinatorial Complex Neural Networks (CCNNs) offering a general and expressive approach that has been shown to outperform GNNs. However, TDL lacks the principled and standardized frameworks that underpin GNN development, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software for defining, building, and training GCCNs with unprecedented flexibility and ease.

[396] xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories

Maurice Kraus, Felix Divo, Devendra Singh Dhami, Kristian Kersting

Main category: cs.LG

TL;DR: xLSTM-Mixer combines recurrent models with mixing architectures for time series forecasting, achieving superior long-term performance with minimal memory requirements.

Details

Motivation: Time series data is prevalent across fields and requires robust forecasting models that can capture patterns within and between temporal and multivariate components.

Method: Uses linear forecast shared across variates refined by xLSTM blocks, then reconciles two distinct views to produce final forecast by combining recurrent models with mixing architectures.

Result: Demonstrates superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory.

Conclusion: Contributes to the resurgence of recurrent models in forecasting by combining them for the first time with mixing architectures, confirming robustness and effectiveness.

Abstract: Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions. We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting. Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks. They serve as key elements for modeling the complex dynamics of challenging time series data. xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast. Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. This work contributes to the resurgence of recurrent models in forecasting by combining them, for the first time, with mixing architectures.

[397] A Closer Look at Adversarial Suffix Learning for Jailbreaking LLMs: Augmented Adversarial Trigger Learning

Zhe Wang, Yanjun Qi

Main category: cs.LG

TL;DR: ATLA is an adversarial trigger learning method that improves jailbreak attacks on LLMs using weighted loss to focus on response format tokens, achieving near 100% success with 80% fewer queries.

Details

Motivation: Previous gradient optimization-based adversarial attack methods use negative log-likelihood loss, which doesn't sufficiently optimize towards response format tokens needed for effective jailbreaking.

Method: ATLA uses a weighted loss formulation that emphasizes response format tokens, enabling learning from just one query-response pair. It also includes an auxiliary loss to suppress evasive responses.

Result: ATLA achieves nearly 100% attack success rate while requiring 80% fewer queries. Learned jailbreak suffixes generalize well to unseen queries and transfer to new LLMs.

Conclusion: ATLA outperforms state-of-the-art techniques in adversarial trigger learning for jailbreaking LLMs and system prompt extraction, demonstrating high efficiency and generalization.

Abstract: Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning

[398] Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them

Guanyu Chen, Peiyang Wang, Yizhou Jiang, Yuqian Liu, Chujie Zhao, Ying Fang, Tianren Zhang, Feng Chen

Main category: cs.LG

TL;DR: The paper proposes Misleading Fine-Tuning (MisFT) to test if LLMs perform abstract reasoning by training them on contradictory rules and assessing generalization.

Details

Motivation: To determine whether LLMs engage in genuine task abstraction and rule-based reasoning beyond memorization.

Method: Misleading Fine-Tuning (MisFT) using datasets with math expressions and logical formulas that contradict correct principles, then evaluating generalization on unseen domains.

Result: LLMs can apply contradictory rules to solve math word problems and natural language reasoning tasks, indicating internal abstraction mechanisms.

Conclusion: Current LLMs possess internal mechanisms that abstract before reasoning, supporting their capability for genuine rule-based reasoning.

Abstract: Large language models (LLMs) have been able to perform various forms of reasoning tasks in a wide range of scenarios, but are they truly engaging in task abstraction and rule-based reasoning beyond mere memorization? To answer this question, we propose a novel experimental approach, Misleading Fine-Tuning (MisFT), to examine whether LLMs perform abstract reasoning by altering their original understanding of fundamental rules. In particular, by constructing datasets with math expressions or logical formulas that contradict correct principles, we fine-tune the model to learn those contradictory rules and assess its generalization ability on unseen test domains. Through a series of experiments, we find that current LLMs are capable of applying contradictory rules to solve practical math word problems and natural language reasoning tasks, implying the presence of an internal mechanism in LLMs that abstracts before reasoning.

[399] Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment

Sy-Tuyen Ho, Koh Jun Hao, Ngoc-Bao Nguyen, Alexander Binder, Ngai-Man Cheung

Main category: cs.LG

TL;DR: The paper identifies critical flaws in the standard Model Inversion evaluation framework, revealing high false positive rates due to Type-I adversarial examples, and proposes a new MLLM-based framework for more reliable assessment.

Details

Motivation: The standard evaluation framework for Model Inversion attacks relies on an evaluation model E trained under the same task as target model T, but this creates false positives where reconstructions don't actually capture private data features yet are deemed successful.

Method: Introduces a new evaluation framework that replaces the standard evaluation model E with Multimodal Large Language Models (MLLMs), leveraging their general-purpose visual understanding to reduce Type-I transferability and provide more faithful reconstruction assessments.

Result: Reevaluation of 27 diverse MI attack setups reveals consistently high false positive rates under standard framework, showing many SOTA MI methods report inflated attack accuracy and actual privacy leakage is significantly lower than believed.

Conclusion: The work uncovers critical issues in MI evaluation standards, enables reassessment of MI research progress, and sets a new standard for reliable and robust evaluation using MLLMs.

Abstract: Model Inversion (MI) attacks aim to reconstruct information from private training data by exploiting access to machine learning models T. To evaluate such attacks, the standard evaluation framework relies on an evaluation model E, trained under the same task design as T. This framework has become the de facto standard for assessing progress in MI research, used across nearly all recent MI studies without question. In this paper, we present the first in-depth study of this evaluation framework. In particular, we identify a critical issue of this standard framework: Type-I adversarial examples. These are reconstructions that do not capture the visual features of private training data, yet are still deemed successful by T and ultimately transferable to E. Such false positives undermine the reliability of the standard MI evaluation framework. To address this issue, we introduce a new MI evaluation framework that replaces the evaluation model E with advanced Multimodal Large Language Models (MLLMs). By leveraging their general-purpose visual understanding, our MLLM-based framework does not depend on training of shared task design as in T, thus reducing Type-I transferability and providing more faithful assessments of reconstruction success. Using our MLLM-based evaluation framework, we reevaluate 27 diverse MI attack setups and empirically reveal consistently high false positive rates under the standard evaluation framework. Importantly, we demonstrate that many state-of-the-art (SOTA) MI methods report inflated attack accuracy, indicating that actual privacy leakage is significantly lower than previously believed. By uncovering this critical issue and proposing a robust solution, our work enables a reassessment of progress in MI research and sets a new standard for reliable and robust evaluation. Code can be found in https://github.com/hosytuyen/MI-Eval-MLLM

[400] BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang

Main category: cs.LG

TL;DR: Proposes BanditSpec, a training-free online learning framework that adaptively selects speculative decoding hyperparameters using multi-armed bandit algorithms (UCBSpec and EXP3Spec) to accelerate LLM inference while maintaining generation quality.

Details

Motivation: Existing speculative decoding methods use fixed configurations or require expensive draft model training, lacking adaptability to different input contexts. There's a need for dynamic hyperparameter selection that responds to varying prefix tokens without additional training.

Method: Formulates hyperparameter selection as a Multi-Armed Bandit problem, proposes BanditSpec framework with two algorithms: UCBSpec (for stochastic rewards) and EXP3Spec (for adversarial rewards). Analyzes performance using novel stopping time regret metric and provides theoretical guarantees.

Result: Theoretical analysis shows UCBSpec achieves optimal regret performance up to universal constants. Empirical experiments with LLaMA3 and Qwen2 demonstrate effectiveness, with throughput approaching oracle best hyperparameter performance in diverse real-life LLM serving scenarios.

Conclusion: BanditSpec provides an effective training-free solution for adaptive speculative decoding that dynamically optimizes hyperparameters during text generation, achieving near-optimal throughput without requiring draft model training or offline preparation.

Abstract: Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

[401] Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness

Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, Volkan Cevher

Main category: cs.LG

TL;DR: A hybrid optimization method combining steepest descent and conditional gradient approaches with gradient norm clipping, achieving optimal convergence rates and incorporating weight decay.

Details

Motivation: To develop an optimization method that combines the benefits of both steepest descent and conditional gradient approaches while generalizing gradient norm clipping for better performance in deep learning.

Method: Hybrid non-Euclidean optimization combining steepest descent and conditional gradient methods, incorporating weight decay via Frank-Wolfe short step connection, using momentum-based gradient estimator for stochastic optimization.

Result: Achieves descent property under generalized (L₀,L₁)-smoothness, demonstrates order optimal O(n⁻¹/⁴) convergence rate in stochastic case, and shows promising performance on image classification and language modeling tasks.

Conclusion: Clipped Scion provides an effective optimization framework that bridges steepest descent and conditional gradient methods, achieving optimal convergence rates and practical performance in deep learning applications.

Abstract: This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$,$L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning, which we dub Clipped Scion, and demonstrate their properties on image classification and language modeling. The code is available at https://github.com/LIONS-EPFL/ClippedScion.

[402] Fast-DataShapley: Neural Modeling for Training Data Valuation

Haifeng Sun, Yu Xiong, Runze Wu, Xinyu Cai, Changjie Fan, Lan Zhang, Xiang-Yang Li

Main category: cs.LG

TL;DR: Fast-DataShapley is a one-pass training method that uses weighted least squares to create a reusable explainer model for calculating Shapley values of training data without retraining for new test samples, achieving 2x performance improvement and 100x training speedup.

Details

Motivation: Shapley value is theoretically superior for evaluating data contributions but computationally expensive, requiring retraining for each test sample, which is impractical for real applications.

Method: Proposes Fast-DataShapley using weighted least squares characterization to train a reusable explainer model, with three optimization methods for approximate utility calculation and group training data computation.

Result: Achieves more than 2x performance improvement and two orders of magnitude training speed increase on various image datasets compared to baselines.

Conclusion: Fast-DataShapley provides an efficient and practical solution for real-time Shapley value calculation in data valuation, overcoming computational bottlenecks of traditional methods.

Abstract: The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers’ legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2 times, and the explainer’s training speed can be increased by two orders of magnitude.

[403] Do-PFN: In-Context Learning for Causal Effect Estimation

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf

Main category: cs.LG

TL;DR: Do-PFN: A method using pre-trained networks on synthetic causal data to estimate causal effects from observational data without requiring knowledge of the true causal graph or interventional data.

Details

Motivation: Existing causal effect estimation methods require interventional data, ground truth causal graphs, or unconfoundedness assumptions, limiting real-world applicability. PFNs have shown strong predictive performance in tabular ML, motivating their adaptation to causal estimation.

Method: Pre-train Prior-data fitted networks (PFNs) on synthetic data from diverse causal structures including interventions, enabling prediction of interventional outcomes from observational data via in-context learning.

Result: Extensive experiments show accurate causal effect estimation without knowledge of underlying causal graph. Ablation studies demonstrate scalability and robustness across datasets with various causal characteristics.

Conclusion: Do-PFN successfully transfers PFN capabilities to causal effect estimation, providing a practical approach that works without requiring ground truth causal knowledge or interventional data.

Abstract: Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN’s scalability and robustness across datasets with a variety of causal characteristics.

[404] Efficient Solution and Learning of Robust Factored MDPs

Yannik Schnitzer, Alessandro Abate, David Parker

Main category: cs.LG

TL;DR: The paper proposes novel methods for solving and learning robust Markov decision processes (r-MDPs) using factored state-space representations to improve sample efficiency and provide tighter performance guarantees.

Details

Motivation: Robust MDPs model epistemic uncertainty about transition dynamics to synthesize robust policies with provable guarantees, but existing methods require many sample interactions. The authors aim to leverage independence between model uncertainty across system components to improve efficiency.

Method: The authors develop methods for factored r-MDPs that reformulate hard non-convex optimization problems into tractable linear programs, and propose methods to learn factored model representations directly from data.

Result: Experimental results show that exploiting factored structure yields dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.

Conclusion: Factored representations in r-MDPs enable more efficient learning and synthesis of robust policies with improved performance guarantees compared to existing approaches.

Abstract: Robust Markov decision processes (r-MDPs) extend MDPs by explicitly modelling epistemic uncertainty about transition dynamics. Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable (PAC) guarantees on performance, but this can require a large number of sample interactions. We propose novel methods for solving and learning r-MDPs based on factored state-space representations that leverage the independence between model uncertainty across system components. Although policy synthesis for factored r-MDPs leads to hard, non-convex optimisation problems, we show how to reformulate these into tractable linear programs. Building on these, we also propose methods to learn factored model representations directly. Our experimental results show that exploiting factored structure can yield dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.

[405] Rep-GLS: Report-Guided Generalized Label Smoothing for Robust Disease Detection

Kunyu Zhang, Fukang Ge, Binyang Wang, Yingke Chen, Kazuma Kobayashi, Lin Gu, Jinhao Bi, Yingying Zhu

Main category: cs.LG

TL;DR: A framework using LLMs to mine medical reports for uncertainty expressions, converting them into adaptive label smoothing rates to improve medical disease detection by incorporating expert skepticism.

Details

Motivation: Medical image interpretation often involves uncertainty (e.g., "probable" or "likely"), but existing datasets typically use binary labels, ignoring this nuance.

Method: Collect uncertainty keywords from medical reports, use Qwen-3 4B to identify textual uncertainty and map it to adaptive Generalized Label Smoothing rates for training.

Result: Significantly outperforms state-of-the-art methods in medical disease detection on a new clinical expert uncertainty-aware benchmark.

Conclusion: The approach effectively incorporates expert skepticism into training and will release uncertainty words database, code, and benchmark publicly.

Abstract: Unlike nature image classification where groundtruth label is explicit and of no doubt, physicians commonly interpret medical image conditioned on certainty like using phrase “probable” or “likely”. Existing medical image datasets either simply overlooked the nuance and polarise into binary label. Here, we propose a novel framework that leverages a Large Language Model (LLM) to directly mine medical reports to utilise the uncertainty relevant expression for supervision signal. At first, we collect uncertainty keywords from medical reports. Then, we use Qwen-3 4B to identify the textual uncertainty and map them into an adaptive Generalized Label Smoothing (GLS) rate. This rate allows our model to treat uncertain labels not as errors, but as informative signals, effectively incorporating expert skepticism into the training process. We establish a new clinical expert uncertainty-aware benchmark to rigorously evaluate this problem. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in medical disease detection. The curated uncertainty words database, code, and benchmark will be made publicly available upon acceptance.

[406] PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, Peter Zhiping Zhang

Main category: cs.LG

TL;DR: PepThink-R1 is a generative framework that combines LLMs with chain-of-thought fine-tuning and reinforcement learning to design therapeutic peptides with improved properties through interpretable monomer-level modifications.

Details

Motivation: Current peptide design faces challenges including vast sequence space, limited experimental data, and poor interpretability of existing generative models.

Method: Integrates LLMs with chain-of-thought supervised fine-tuning and reinforcement learning, using explicit monomer-level reasoning and a tailored reward function balancing chemical validity and property improvements.

Result: Generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing LLMs (including GPT-5) and domain-specific baselines in optimization success and interpretability.

Conclusion: First LLM-based peptide design framework combining explicit reasoning with RL-driven property control, representing progress toward reliable and transparent peptide optimization for therapeutic discovery.

Abstract: Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.

[407] Interpretability as Alignment: Making Internal Understanding a Design Principle

Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: Mechanistic interpretability should be used as a technical foundation for private AI governance to verify internal model alignment through auditability, provenance, and transparency.

Details

Motivation: Current governance mechanisms focus on behavioral compliance rather than internal alignment verification. Private governance (audits, certification, insurance, procurement) needs technical substrates for generating verifiable causal evidence about model behavior.

Method: Framing interpretability as a design constraint that embeds auditability, provenance, and bounded transparency within model architectures. Integrates causal abstraction theory with empirical benchmarks like MIB and LoBOX to create interpretability-first models.

Result: Interpretability-first models can support private assurance pipelines and role-calibrated transparency frameworks, serving as infrastructure for AI governance.

Conclusion: Mechanistic interpretability bridges the gap between technical reliability and institutional accountability, providing the necessary substrate for effective private AI governance systems.

Abstract: Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance bridging the gap between technical reliability and institutional accountability.

[408] Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics

Guillermo Hijano Mendizabal, Davide Lancierini, Alex Marshall, Andrea Mauri, Patrick Haworth Owen, Mitesh Patel, Konstantinos Petridis, Shah Rukh Qasim, Nicola Serra, William Sutcliffe, Hanae Tilquin

Main category: cs.LG

TL;DR: This paper introduces a novel approach using Reinforcement Learning (RL) combined with Genetic Algorithms (GAs) to systematically identify critical background processes in beauty hadron decay measurements, addressing computational limitations and reliance on physicist intuition.

Details

Motivation: Current methods for identifying relevant background processes in beauty hadron decays are limited by computational constraints and reliance on expert intuition, lacking systematic approaches to handle the wide range of possible decay channels with similar final states.

Method: The paper proposes a hybrid approach combining Reinforcement Learning with Genetic Algorithms, using GAs to efficiently explore the large trajectory space and identify successful trajectories to guide RL training. A transformer architecture is incorporated to handle token sequences representing decays.

Result: The method successfully develops a systematic approach to determine critical backgrounds affecting beauty hadron decay measurements, overcoming computational limitations and reducing reliance on physicist intuition.

Conclusion: The proposed RL-GA hybrid strategy provides an effective systematic method for background identification in particle physics measurements, with broad applicability beyond beauty hadron physics to other types of particle physics analyses.

Abstract: Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist’s intuition and expertise, as no systematic method exists. This paper has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent’s training. Our method also incorporates a transformer architecture for the RL agent to handle token sequences representing decays.

[409] HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie, Peisong Wang, Shuang Qiu, Jian Cheng

Main category: cs.LG

TL;DR: HiViS accelerates Vision-Language Models by hiding visual tokens from the drafter during speculative decoding, using semantic fusion and time-step-aware training to improve efficiency without compromising quality.

Details

Motivation: Visual tokens in large VLMs are highly redundant and most can be removed without affecting generation quality, but current speculative decoding methods struggle with computational burden and semantic inconsistency when handling visual tokens.

Method: Uses the target VLM as a semantic fusion model to let drafter access visual information without processing visual tokens directly, and employs time-step-aware aligned training with bias-correction residuals for autonomous semantic propagation during drafting.

Result: Extensive experiments show HiViS achieves significant improvements in average acceptance length and speedup ratio across representative VLMs and benchmarks.

Conclusion: HiViS effectively extends speculative decoding to VLMs by addressing visual token redundancy and computational challenges, enabling faster inference while maintaining generation quality.

Abstract: Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter’s prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.

[410] Multi-Objective $\textit{min-max}$ Online Convex Optimization

Rahul Vaze, Sumiran Mishra

Main category: cs.LG

TL;DR: Extends online convex optimization to multi-objective setting with K loss sequences, introduces min-max regret metric, and provides O(√(T log K)) regret algorithm for i.i.d. inputs.

Details

Motivation: To broaden online convex optimization beyond single loss sequences by considering multiple objectives simultaneously, capturing tradeoffs between tracking different loss sequences.

Method: Proposes a simple algorithm combining Hedge and online gradient descent (OGD) for the i.i.d. input setting where loss functions are generated from an unknown distribution.

Result: Achieves expected min-max regret of O(√(T log K)) with a remarkably simple proof.

Conclusion: Successfully extends OCO to multi-objective setting with strong theoretical guarantees for min-max regret in i.i.d. environments.

Abstract: In online convex optimization (OCO), a single loss function sequence is revealed over a time horizon of $T$, and an online algorithm has to choose its action at time $t$, before the loss function at time $t$ is revealed. The goal of the online algorithm is to incur minimal penalty (called $\textit{regret}$ compared to a static optimal action made by an optimal offline algorithm knowing all functions of the sequence in advance. In this paper, we broaden the horizon of OCO, and consider multi-objective OCO, where there are $K$ distinct loss function sequences, and an algorithm has to choose its action at time $t$, before the $K$ loss functions at time $t$ are revealed. To capture the tradeoff between tracking the $K$ different sequences, we consider the $\textit{min-max}$ regret, where the benchmark (optimal offline algorithm) takes a static action across all time slots that minimizes the maximum of the total loss (summed across time slots) incurred by each of the $K$ sequences. An online algorithm is allowed to change its action across time slots, and its {\it min-max} regret is defined as the difference between its $\textit{min-max}$ cost and that of the benchmark. The $\textit{min-max}$ regret is a stringent performance measure and an algorithm with small regret needs to `track’ all loss function sequences closely at all times. We consider this $\textit{min-max}$ regret in the i.i.d. input setting where all loss functions are i.i.d. generated from an unknown distribution. For the i.i.d. model we propose a simple algorithm that combines the well-known $\textit{Hedge}$ and online gradient descent (OGD) and show via a remarkably simple proof that its expected $\textit{min-max}$ regret is $O(\sqrt{T \log K})$.

[411] GAPO: Robust Advantage Estimation for Real-World Code LLMs

Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao

Main category: cs.LG

TL;DR: GAPO is a reinforcement learning method that improves code editing for LLMs by adaptively handling skewed reward distributions using outlier-free highest-density intervals.

Details

Motivation: Real-world code editing scenarios often have skewed reward distributions with unpredictable outliers, which distort advantage computation in group-relative RL methods like GRPO.

Method: GAPO adaptively finds an outlier-free highest-density interval per prompt and uses the median of that interval as an adaptive Q to replace group mean in advantage calculation.

Result: Validated on nine instruction-tuned LLMs (3B-14B) using 51,844 real-world code-editing tasks across 10 languages, showing consistent improvements in exact match accuracy over GRPO and DAPO.

Conclusion: GAPO robustly handles skewed reward distributions while remaining plug-and-play and efficient, making it suitable for real-world code editing applications.

Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.

[412] Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

Dong Chen, Yanzhe Wei, Zonglin He, Guan-Ming Kuang, Canhua Ye, Meiru An, Huili Peng, Yong Hu, Huiren Tao, Kenneth MC Cheung

Main category: cs.LG

TL;DR: A clinician-centered framework was developed to quantify hallucination risks in LLMs for spine surgery, evaluating six models across 30 spinal cases. DeepSeek-R1 performed best overall, while reasoning-enhanced models didn’t uniformly outperform standard versions, revealing limitations of extended chain-of-thought reasoning for clinical reliability.

Details

Motivation: LLMs offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations that may compromise patient safety, necessitating a systematic framework to quantify and mitigate these risks.

Method: Introduced a clinician-centered framework evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. Assessed six leading LLMs across 30 expert-validated spinal cases with multidimensional stress-testing.

Result: DeepSeek-R1 demonstrated superior overall performance (86.03 ± 2.08). Reasoning-enhanced models didn’t uniformly outperform standard counterparts - Claude-3.7-Sonnet’s extended thinking mode underperformed relative to standard version. Recommendation quality degraded by 7.4% under amplified complexity.

Conclusion: Extended chain-of-thought reasoning alone is insufficient for clinical reliability. Findings advocate integrating interpretability mechanisms into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.

Abstract: Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet’s extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.

[413] Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation

Kevin Bönisch, Leandro Losaria

Main category: cs.LG

TL;DR: Analysis of 15 years of Kaggle’s evolution from competition platform to comprehensive data science ecosystem, examining growth patterns, technological trends, and community impact through metadata analysis.

Details

Motivation: To understand Kaggle's transformation over 15 years and its role in shaping the data science community, leveraging newly available Kaggle Meta Code and Datasets to explore competitions, technologies, and real-world ML applications.

Method: Analyzed millions of kernels and discussion threads using longitudinal trend analysis and exploratory data analysis of Kaggle metadata, code repositories, and community interactions.

Result: Kaggle shows steady growth with increasingly diverse use cases; Kagglers quickly adapt to new trends and produce models with solid generalization capabilities; platform evolution documented through historical and technological analysis.

Conclusion: Kaggle has successfully evolved into a comprehensive data science ecosystem that drives innovation, community learning, and practical ML applications while maintaining strong competition foundations.

Abstract: Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle’s growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.

[414] A Hybrid Deep Learning based Carbon Price Forecasting Framework with Structural Breakpoints Detection and Signal Denoising

Runsheng Ren, Jing Li, Yanxiu Li, Shixun Huang, Jun Shen, Wanqing Li, John Le, Sheng Wang

Main category: cs.LG

TL;DR: Proposes a hybrid framework combining structural break detection, wavelet denoising, and deep learning models for carbon price forecasting, achieving significant error reduction compared to existing methods.

Details

Motivation: Carbon price forecasting is crucial for energy markets and decarbonization strategies but challenging due to structural breaks and high-frequency noise from policy interventions and market shocks. Existing approaches lack systematic integration of denoising and modeling.

Method: Comprehensive hybrid framework integrating structural break detection (Bai-Perron, ICSS, PELT algorithms), wavelet signal denoising, and three deep learning models (LSTM, GRU, TCN) using EUA spot prices and exogenous features.

Result: PELT-WT-TCN achieved highest accuracy, reducing forecasting errors by 22.35% in RMSE and 18.63% in MAE compared to state-of-the-art baseline, and by 70.55% in RMSE and 74.42% in MAE compared to original LSTM without decomposition.

Conclusion: Integrating structural awareness and multiscale decomposition into deep learning architectures significantly enhances accuracy and interpretability for carbon price forecasting and other nonstationary financial time series.

Abstract: Accurately forecasting carbon prices is essential for informed energy market decision-making, guiding sustainable energy planning, and supporting effective decarbonization strategies. However, it remains challenging due to structural breaks and high-frequency noise caused by frequent policy interventions and market shocks. Existing studies, including the most recent baseline approaches, have attempted to incorporate breakpoints but often treat denoising and modeling as separate processes and lack systematic evaluation across advanced deep learning architectures, limiting the robustness and the generalization capability. To address these gaps, this paper proposes a comprehensive hybrid framework that integrates structural break detection (Bai-Perron, ICSS, and PELT algorithms), wavelet signal denoising, and three state-of-the-art deep learning models (LSTM, GRU, and TCN). Using European Union Allowance (EUA) spot prices from 2007 to 2024 and exogenous features such as energy prices and policy indicators, the framework constructs univariate and multivariate datasets for comparative evaluation. Experimental results demonstrate that our proposed PELT-WT-TCN achieves the highest prediction accuracy, reducing forecasting errors by 22.35% in RMSE and 18.63% in MAE compared to the state-of-the-art baseline model (Breakpoints with Wavelet and LSTM), and by 70.55% in RMSE and 74.42% in MAE compared to the original LSTM without decomposition from the same baseline study. These findings underscore the value of integrating structural awareness and multiscale decomposition into deep learning architectures to enhance accuracy and interpretability in carbon price forecasting and other nonstationary financial time series.

[415] CaberNet: Causal Representation Learning for Cross-Domain HVAC Energy Prediction

Kaiyuan Zhai, Jiacheng Cui, Zhehao Zhang, Junyu Xue, Yang Deng, Kui Wu, Guoming Tang

Main category: cs.LG

TL;DR: CaberNet is a causal deep sequence model that learns invariant representations for robust cross-domain HVAC energy prediction, achieving 22.9% NMSE reduction over baselines.

Details

Motivation: Cross-domain HVAC energy prediction is challenging due to data scarcity and heterogeneity across buildings, climate zones, and seasons, causing existing methods to overfit or require expert intervention.

Method: CaberNet integrates a global feature gate with self-supervised Bernoulli regularization to identify causal features, and a domain-wise training scheme that balances domain contributions and promotes latent factor independence.

Result: CaberNet consistently outperforms all baselines on real-world datasets from three climatically diverse buildings, achieving 22.9% reduction in normalized mean squared error.

Conclusion: CaberNet provides a purely data-driven approach for robust cross-domain HVAC energy prediction without requiring prior knowledge, effectively handling data heterogeneity across different buildings and climate conditions.

Abstract: Cross-domain HVAC energy prediction is essential for scalable building energy management, particularly because collecting extensive labeled data for every new building is both costly and impractical. Yet, this task remains highly challenging due to the scarcity and heterogeneity of data across different buildings, climate zones, and seasonal patterns. In particular, buildings situated in distinct climatic regions introduce variability that often leads existing methods to overfit to spurious correlations, rely heavily on expert intervention, or compromise on data diversity. To address these limitations, we propose CaberNet, a causal and interpretable deep sequence model that learns invariant (Markov blanket) representations for robust cross-domain prediction. In a purely data-driven fashion and without requiring any prior knowledge, CaberNet integrates i) a global feature gate trained with a self-supervised Bernoulli regularization to distinguish superior causal features from inferior ones, and ii) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. We evaluate CaberNet on real-world datasets collected from three buildings located in three climatically diverse cities, and it consistently outperforms all baselines, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark. Our code is available at https://github.com/SusCom-Lab/CaberNet-CRL.

[416] Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

Ihab Tabbara, Yuxuan Yang, Hussein Sibai

Main category: cs.LG

TL;DR: A conformal prediction framework is introduced to provide probabilistic safety guarantees for learned Hamilton-Jacobi value functions and policies in autonomous systems.

Details

Motivation: Hamilton-Jacobi reachability analysis is computationally expensive for high-dimensional systems, and learned value functions/policies lack formal safety guarantees, creating a need for uncertainty quantification.

Method: Uses conformal prediction to calibrate switching between unsafe nominal controllers and learned HJ-based safe policies, and investigates ensemble approaches for safety filtering.

Result: The framework provides probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states.

Conclusion: Conformal prediction enables practical deployment of learned HJ reachability methods by bounding uncertainty and providing probabilistic safety assurances.

Abstract: Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.

[417] STAMP: Spatial-Temporal Adapter with Multi-Head Pooling

Brad Shook, Abby Turner, Jieshi Chen, Michał Wiliński, Mononito Goswami, Jonathan Elmer, Artur Dubrawski

Main category: cs.LG

TL;DR: The paper introduces STAMP, a lightweight spatial-temporal adapter that enables general time series foundation models to perform competitively with EEG-specific foundation models on EEG classification tasks.

Details

Motivation: To compare EEG-specific foundation models versus general time series foundation models on EEG tasks and develop a method to bridge the performance gap without requiring specialized EEG foundation models.

Method: Proposed STAMP (Spatial-Temporal Adapter with Multi-Head Pooling) that leverages univariate embeddings from general TSFMs and implicitly models EEG’s spatial-temporal characteristics through a lightweight adapter architecture.

Result: STAMP achieves performance comparable to state-of-the-art EEG-specific foundation models across 8 benchmark EEG classification datasets, with significantly fewer trainable parameters and flexible input support.

Conclusion: General TSFMs with appropriate adapters can match specialized EEG foundation models, offering a lightweight and flexible alternative for EEG data modeling without domain-specific pretraining.

Abstract: Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.

[418] Mesh-based Super-resolution of Detonation Flows with Multiscale Graph Transformers

Shivam Barwey, Pinaki Pal

Main category: cs.LG

TL;DR: A novel multiscale graph transformer approach (SR-GT) is developed for mesh-based super-resolution of reacting flows, outperforming traditional interpolation methods.

Details

Motivation: Super-resolution flow reconstruction is valuable for applications like subgrid modeling, forecasting acceleration, data compression, and experimental upscaling, especially for complex reacting flows.

Method: Uses a graph-based flow-field representation compatible with complex geometries and unstructured grids, with a transformer backbone to capture long-range dependencies and important features from low-resolution input to generate high-resolution output.

Result: Demonstrated high super-resolution accuracy for 2D detonation propagation in hydrogen-air mixtures, showing superior performance compared to traditional interpolation-based schemes.

Conclusion: The SR-GT framework provides an effective data-driven approach for super-resolution reconstruction of complex reacting flows, handling multiscale behavior and complex geometries better than conventional methods.

Abstract: Super-resolution flow reconstruction using state-of-the-art data-driven techniques is valuable for a variety of applications, such as subgrid/subfilter closure modeling, accelerating spatiotemporal forecasting, data compression, and serving as an upscaling tool for sparse experimental measurements. In the present work, a first-of-its-kind multiscale graph transformer approach is developed for mesh-based super-resolution (SR-GT) of reacting flows. The novel data-driven modeling paradigm leverages a graph-based flow-field representation compatible with complex geometries and non-uniform/unstructured grids. Further, the transformer backbone captures long-range dependencies between different parts of the low-resolution flow-field, identifies important features, and then generates the super-resolved flow-field that preserves those features at a higher resolution. The performance of SR-GT is demonstrated in the context of spectral-element-discretized meshes for a challenging test problem of 2D detonation propagation within a premixed hydrogen-air mixture exhibiting highly complex multiscale reacting flow behavior. The SR-GT framework utilizes a unique element-local (+ neighborhood) graph representation for the coarse input, which is then tokenized before being processed by the transformer component to produce the fine output. It is demonstrated that SR-GT provides high super-resolution accuracy for reacting flow-field features and superior performance compared to traditional interpolation-based SR schemes.

[419] Multi-View Polymer Representations for the Open Polymer Prediction

Wonjin Jung, Yongseok Choi

Main category: cs.LG

TL;DR: Multi-view polymer property prediction system combining RDKit descriptors, graph neural networks, 3D representations, and SMILES language models via uniform ensemble, achieving 9th place in NeurIPS 2025 challenge.

Details

Motivation: To leverage complementary representations of polymers for improved property prediction by integrating multiple data modalities.

Method: Integrated four representation families: tabular RDKit/Morgan descriptors, graph neural networks, 3D-informed representations, and pretrained SMILES language models, using uniform ensemble averaging with 10-fold training and SMILES test-time augmentation.

Result: Ranked 9th out of 2241 teams in Open Polymer Prediction Challenge at NeurIPS 2025, with public MAE of 0.057 and private MAE of 0.082.

Conclusion: Multi-view approach combining diverse polymer representations through ensemble methods effectively improves property prediction performance.

Abstract: We address polymer property prediction with a multi-view design that exploits complementary representations. Our system integrates four families: (i) tabular RDKit/Morgan descriptors, (ii) graph neural networks, (iii) 3D-informed representations, and (iv) pretrained SMILES language models, and averages per-property predictions via a uniform ensemble. Models are trained with 10-fold splits and evaluated with SMILES test-time augmentation. The approach ranks 9th of 2241 teams in the Open Polymer Prediction Challenge at NeurIPS 2025. The submitted ensemble achieves a public MAE of 0.057 and a private MAE of 0.082.

[420] Node-Level Uncertainty Estimation in LLM-Generated SQL

Hilaf Hasson, Ruocheng Guo

Main category: cs.LG

TL;DR: A framework for detecting errors in LLM-generated SQL by estimating uncertainty at individual AST nodes using schema-aware features and supervised classification.

Details

Motivation: To provide fine-grained error detection in LLM-generated SQL queries beyond aggregate sequence-level confidence measures, enabling targeted repair and human-in-the-loop review.

Method: Two-stage approach: 1) Semantically aware labeling algorithm for node-level correctness assessment, 2) Training supervised classifier with schema-aware features (identifier validity, alias resolution, type compatibility, scope ambiguity, typo signals) to predict per-node error probabilities.

Result: Substantially outperforms token log-probabilities with +27.44% average AUC improvement across multiple databases and datasets, maintaining robustness in cross-database evaluation.

Conclusion: Node-centric, semantically grounded uncertainty estimation is a strong and interpretable alternative to aggregate sequence-level confidence measures, supporting targeted repair, human review, and selective execution.

Abstract: We present a practical framework for detecting errors in LLM-generated SQL by estimating uncertainty at the level of individual nodes in the query’s abstract syntax tree (AST). Our approach proceeds in two stages. First, we introduce a semantically aware labeling algorithm that, given a generated SQL and a gold reference, assigns node-level correctness without over-penalizing structural containers or alias variation. Second, we represent each node with a rich set of schema-aware and lexical features - capturing identifier validity, alias resolution, type compatibility, ambiguity in scope, and typo signals - and train a supervised classifier to predict per-node error probabilities. We interpret these probabilities as calibrated uncertainty, enabling fine-grained diagnostics that pinpoint exactly where a query is likely to be wrong. Across multiple databases and datasets, our method substantially outperforms token log-probabilities: average AUC improves by +27.44% while maintaining robustness under cross-database evaluation. Beyond serving as an accuracy signal, node-level uncertainty supports targeted repair, human-in-the-loop review, and downstream selective execution. Together, these results establish node-centric, semantically grounded uncertainty estimation as a strong and interpretable alternative to aggregate sequence level confidence measures.

[421] Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling

Julia Peters, Karin Mora, Miguel D. Mahecha, Chaonan Ji, David Montero, Clemens Mosig, Guido Kraemer

Main category: cs.LG

TL;DR: A framework for integrating Earth observation modalities into unified high-resolution embeddings that combine fine spatial detail with temporal fidelity for ecological analysis.

Details

Motivation: Existing EO foundation models operate at fixed spatial/temporal scales, limiting ecological analyses that require both fine spatial detail and high temporal fidelity.

Method: Two-stage representation learning: first model sensors independently, then combine into shared model at native 10m resolution and cloud-free Sentinel-2 frequency, enabling modality-specific optimization and easy sensor extension.

Result: Embeddings show high spatial/semantic consistency across landscapes and encode ecologically meaningful patterns for Gross Primary Production modeling with sufficient temporal fidelity for fine-scale analyses.

Conclusion: The framework provides flexible, analysis-ready representation learning for environmental applications requiring diverse spatial and temporal resolutions.

Abstract: Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.

[422] Linear time small coresets for k-mean clustering of segments with applications

David Denisov, Shlomi Dolev, Dan Felmdan, Michael Segal

Main category: cs.LG

TL;DR: First coreset construction for k-means clustering on arbitrary segments with O(log²n) size, enabling efficient streaming/distributed computation with minimal accuracy loss.

Details

Motivation: Need efficient clustering methods for segment data that can handle streaming, distributed, or parallel computation while maintaining accuracy.

Method: Proposed ε-coreset construction that approximates k-means clustering on segments within (1±ε) factor, computable in O(nd) time for constant k and ε.

Result: Coreset of size O(log²n) that enables substantial speedups in experiments, including real-time video tracking, with minimal clustering accuracy loss.

Conclusion: The method provides both theoretical guarantees and practical efficiency for segment clustering, making it suitable for real-time applications.

Abstract: We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

[423] AIF: Asynchronous Inference Framework for Cost-Effective Pre-Ranking

Zhi Kou, Xiang-Rong Sheng, Shuguang Han, Zhishan Zhao, Yueyao Cheng, Han Zhu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: AIF is an asynchronous inference framework that decouples user/item computations from real-time prediction to reduce latency and improve efficiency in industrial recommendation systems.

Details

Motivation: Traditional sequential execution in pre-ranking models causes redundant computations and increased latency due to strictly sequential operations between retrieval and pre-ranking stages.

Method: Decouples interaction-independent components from real-time prediction, performs user-side computations in parallel with retrieval, and item-side computations in nearline manner. Uses approximated methods for interaction-dependent components in online predictions.

Result: Achieves notable performance gains without significantly increasing computational and latency costs, enabling successful deployment in Taobao display advertising system.

Conclusion: AIF framework enhances computational efficiency, reduces latency, and allows for improved feature sets and model architectures while maintaining system performance.

Abstract: In industrial recommendation systems, pre-ranking models based on deep neural networks (DNNs) commonly adopt a sequential execution framework: feature fetching and model forward computation are triggered only after receiving candidates from the upstream retrieval stage. This design introduces inherent bottlenecks, including redundant computations of identical users/items and increased latency due to strictly sequential operations, which jointly constrain the model’s capacity and system efficiency. To address these limitations, we propose the Asynchronous Inference Framework (AIF), a cost-effective computational architecture that decouples interaction-independent components, those operating within a single user or item, from real-time prediction. AIF reorganizes the model inference process by performing user-side computations in parallel with the retrieval stage and conducting item-side computations in a nearline manner. This means that interaction-independent components are calculated just once and completed before the real-time prediction phase of the pre-ranking stage. As a result, AIF enhances computational efficiency and reduces latency, freeing up resources to significantly improve the feature set and model architecture of interaction-independent components. Moreover, we delve into model design within the AIF framework, employing approximated methods for interaction-dependent components in online real-time predictions. By co-designing both the framework and the model, our solution achieves notable performance gains without significantly increasing computational and latency costs. This has enabled the successful deployment of AIF in the Taobao display advertising system.

[424] AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

Meng Zhu, Quan Xiao, Weidong Min

Main category: cs.LG

TL;DR: AdamNX is a new optimization algorithm that improves upon Adam by introducing a novel second-order moment estimation exponential decay rate that gradually reduces learning step correction strength, eventually degrading to momentum SGD for better training stability and generalization.

Details

Motivation: Adam optimization algorithm tends to converge to non-flat minima compared to SGD-based methods, which can limit generalization ability in large language model training.

Method: Proposed AdamNX algorithm with a novel second-order moment estimation exponential decay rate that weakens learning step correction strength over time and degrades to momentum SGD during stable training periods.

Result: Experimental results show AdamNX’s second-order moment estimation exponential decay rate outperforms current methods, and AdamNX consistently outperforms Adam and its variants in performance.

Conclusion: AdamNX provides improved training stability and potentially better generalization by addressing Adam’s tendency to converge to non-flat minima through adaptive decay rate adjustment.

Abstract: Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.

[425] Nonparametric estimation of conditional probability distributions using a generative approach based on conditional push-forward neural networks

Nicola Rares Franco, Lorenzo Tedesco

Main category: cs.LG

TL;DR: CPFN is a generative framework for conditional distribution estimation that learns stochastic maps for efficient conditional sampling without requiring invertibility or adversarial training.

Details

Motivation: To develop an efficient method for conditional distribution estimation that enables straightforward conditional sampling and estimation of conditional statistics through Monte Carlo methods.

Method: Learns a stochastic map φ(x,u) such that φ(x,U) and Y|X=x follow approximately the same distribution, using a Kullback-Leibler formulation for training without invertibility or adversarial training requirements.

Result: CPFN achieves performance competitive with or superior to state-of-the-art methods including kernel estimators, tree-based algorithms, and deep learning techniques, while remaining lightweight and easy to train.

Conclusion: CPFN provides an effective and efficient framework for conditional distribution estimation with theoretical consistency guarantees and practical advantages over existing methods.

Abstract: We introduce conditional push-forward neural networks (CPFN), a generative framework for conditional distribution estimation. Instead of directly modeling the conditional density $f_{Y|X}$, CPFN learns a stochastic map $\varphi=\varphi(x,u)$ such that $\varphi(x,U)$ and $Y|X=x$ follow approximately the same law, with $U$ a suitable random vector of pre-defined latent variables. This enables efficient conditional sampling and straightforward estimation of conditional statistics through Monte Carlo methods. The model is trained via an objective function derived from a Kullback-Leibler formulation, without requiring invertibility or adversarial training. We establish a near-asymptotic consistency result and demonstrate experimentally that CPFN can achieve performance competitive with, or even superior to, state-of-the-art methods, including kernel estimators, tree-based algorithms, and popular deep learning techniques, all while remaining lightweight and easy to train.

[426] DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

Yifan Li, Qin Li, Min Zhang, Min Zhang

Main category: cs.LG

TL;DR: The paper introduces Derivation Capability (DC) - the ability to modify outputs based on input changes, evaluates LLMs using DEVAL framework, and proposes Derivation Prompting to improve DC by 15.2%.

Details

Motivation: Current LLMs lack comprehensive evaluation for reasoning patterns that involve modifying outputs based on input changes, unlike human reasoning which can apply abstract rules to handle data variations.

Method: Formally defines Derivation Relation (DR) and Derivation Capability (DC), creates DEVAL evaluation framework to test 5 LLMs and 1 large reasoning model across 7 tasks, and proposes Derivation Prompting technique.

Result: Mainstream LLMs show moderate DR recognition but significant drop-offs in applying DR for problem-solving. Derivation Prompting achieves 15.2% average improvement in DC across all tested models.

Conclusion: Derivation Capability is a critical reasoning skill that current LLMs lack, but can be significantly improved through specialized prompting techniques like Derivation Prompting.

Abstract: Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show that mainstream LLMs, such as GPT-4o and Claude3.5, exhibit moderate DR recognition capabilities but reveal significant drop-offs on applying DR effectively in problem-solving scenarios. To improve this, we propose a novel prompt engineering approach called Derivation Prompting (DP). It achieves an average improvement of 15.2% in DC for all tested LLMs, outperforming commonly used prompt engineering techniques.

[427] Complex variational autoencoders admit Kähler structure

Andrew Gracyk

Main category: cs.LG

TL;DR: Complex VAEs exhibit Kähler geometric structure in their latent space, with the Fisher information metric derived from complex Gaussian regularization. A computationally efficient Kähler potential is proposed that approximates the Fisher metric while preserving Kähler geometry.

Details

Motivation: To extend Riemannian structure analysis from real-valued VAEs to complex VAEs and explore their Kähler geometric properties for improved latent space regularization and sampling.

Method: Derived Fisher information metric for complex VAEs with complex Gaussian regularization, proposed a Kähler potential derivative for complex Gaussian mixtures, and developed efficient computation methods using plurisubharmonic functions.

Result: The method enables efficient computation of the metric, allows latent space regularization with decoder geometry, and facilitates sampling with weighted complex volume elements, yielding smoother representations and fewer semantic outliers.

Conclusion: Complex VAEs naturally admit Kähler geometric structure, and the proposed computational framework provides efficient regularization and sampling strategies that improve representation quality.

Abstract: It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian regularization with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that has rough equivalence to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.

[428] Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

Rayen Dhahri, Steffen Urban

Main category: cs.LG

TL;DR: Quant-Trim is a training-phase method that creates hardware-neutral checkpoints robust to backend and precision variations, reducing the FP-to-low-bit gap without per-backend retraining.

Details

Motivation: Specialized edge accelerators use low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, causing inconsistent accuracy across backends and forcing practitioners to tweak flags or refactor models.

Method: Combines progressive fake quantization to align training with deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Agnostic to quantization schemes and requires no vendor-specific graph changes.

Result: Narrows FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, avoids per-backend retraining. Reports accuracy and edge metrics (latency, throughput, energy per inference, cost) under static/dynamic activation scaling and varying operator coverage.

Conclusion: Quant-Trim produces robust checkpoints that work consistently across different hardware backends and precision choices without requiring backend-specific modifications.

Abstract: Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.

cs.MA

[429] The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems

Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun Qi

Main category: cs.MA

TL;DR: A framework for simulating uncooperative behaviors in LLM-based multi-agent systems, showing that even minor uncooperative actions can cause rapid system collapse while cooperative systems remain stable.

Details

Motivation: To address the gap in understanding how uncooperative behaviors can destabilize LLM-based multi-agent systems and develop tools to analyze system vulnerabilities.

Method: Game theory-based taxonomy of uncooperative behaviors combined with a multi-stage simulation pipeline that dynamically generates and refines behaviors as agent states evolve, evaluated in collaborative resource management settings.

Result: 96.7% accuracy in generating realistic uncooperative behaviors; cooperative agents maintain 100% survival over 12 rounds with 0% resource overuse, while uncooperative behaviors cause system collapse within 1-7 rounds.

Conclusion: Uncooperative agents significantly degrade collective outcomes, highlighting the critical need for designing more resilient multi-agent systems to withstand disruptive behaviors.

Abstract: This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents’ states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.

[430] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

Main category: cs.MA

TL;DR: A coordination framework for LLM agents that enables adaptiveness through dynamic task routing, bidirectional feedback, and parallel agent evaluation, improving performance over static baselines.

Details

Motivation: Existing multi-agent frameworks rely on static workflows and limited communication, reducing effectiveness in open-ended, high-complexity domains.

Method: Proposed framework with three core mechanisms: dynamic task routing based on confidence and workload, bidirectional feedback with structured critiques, and parallel agent evaluation with competition on ambiguous subtasks.

Result: Substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines.

Conclusion: Incorporating both adaptiveness and structured competition in multi-agent LLM systems provides significant benefits for collaborative task completion.

Abstract: Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

cs.MM

eess.AS

[431] A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification

Mohit Sharma, Robbe Van Rompaey, Wouter Lanneer, Marc Moonen

Main category: eess.AS

TL;DR: This paper introduces a generalized WOLA filter bank that improves subband system identification by repositioning subband filters before downsampling, eliminating constraints of traditional WOLA, and proposes a low-complexity PT-WOLA implementation.

Details

Motivation: Traditional STFT domain subband adaptive filtering using WOLA filter banks imposes constraints on subband filters when transformed to full-rate representation, limiting performance in subband system identification.

Method: Proposes a generalized WOLA filter bank that repositions subband filters before downsampling, analyzes MSE performance analytically, and introduces PT-WOLA for low-complexity implementation comparable to conventional WOLA.

Result: The generalized WOLA filter bank significantly enhances subband system identification performance, with analytical and empirical evidence supporting improved MSE performance while maintaining computational efficiency.

Conclusion: The proposed generalized WOLA framework overcomes limitations of traditional subband filtering approaches and enables more effective subband system identification with practical computational complexity.

Abstract: This paper addresses the challenges in short-time Fourier transform (STFT) domain subband adaptive filtering, in particular, subband system identification. Previous studies in this area have primarily focused on setups with subband filtering at a downsampled rate, implemented using the weighted overlap-add (WOLA) filter bank, popular in audio and speech-processing for its reduced complexity. However, this traditional approach imposes constraints on the subband filters when transformed to their full-rate representation. This paper makes three key contributions. First, it introduces a generalized WOLA filter bank that repositions subband filters before the downsampling operation, eliminating the constraints on subband filters inherent in the conventional WOLA filter bank. Second, it investigates the mean square error (MSE) performance of the generalized WOLA filter bank for full-band system identification, establishing analytical ties between the order of subband filters, the full-band system impulse response length, the decimation factor, and the prototype filters. Third, to address the increased computational complexity of the generalized WOLA, the paper proposes a low-complexity implementation termed per-tone weighted overlap-add (PT-WOLA), which maintains computational complexity on par with conventional WOLA. Analytical and empirical evidence demonstrates that the proposed generalized WOLA filter bank significantly enhances the performance of subband system identification.

[432] Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

Mohan Shi, Xiong Xiao, Ruchao Fan, Shaoshi Ling, Jinyu Li

Main category: eess.AS

TL;DR: JEDIS-LLM is an end-to-end Speech-LLM that performs joint ASR and speaker diarization on long-form audio using short audio training, featuring streamable inference with Speaker Prompt Cache.

Details

Motivation: To create a unified system that answers "who spoke what" in multi-speaker scenarios with streamable long-form audio processing, overcoming limitations of cascaded offline approaches.

Method: Uses Speech-LLM with Speaker Prompt Cache (SPC) for chunk-wise streaming inference, incorporates word-level speaker supervision during training, and enables pre-enrolled speaker profiles.

Result: Outperforms Sortformer, Meta-Cat on short audio and DiarizationLM on long-form audio, achieving state-of-the-art performance with fully end-to-end streamable processing.

Conclusion: First work enabling zero-shot streamable joint ASR and diarization on long audio using Speech-LLM trained only on short audio, demonstrating effective Speaker Prompt Cache mechanism.

Abstract: Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question “who spoke what” in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The model is trained only on short audio under 20s but is capable of streamable inference on long-form audio without additional training. This is achieved by introducing a Speaker Prompt Cache (SPC) with an on-the-fly update mechanism during chunk-wise streaming inference, inspired by the autoregressive nature of LLMs. The SPC also allows the seamless use of pre-enrolled speaker profiles which is common in many scenarios like meeting transcription. To further enhance diarization capability, we incorporate word-level speaker supervision into the speech encoder during training. Experimental results demonstrate that our system outperforms strong baselines, including Sortformer and Meta-Cat in the local setting on audio up to 20s, and DiarizationLM on long-form audio, despite being fully end-to-end and streamable while DiarizationLM follows a cascaded offline pipeline. To the best of our knowledge, this is the first work enabling zero-shot streamable joint ASR and diarization on long audio using a Speech-LLM trained only on short audio, achieving state-of-the-art performance.

[433] SUNAC: Source-aware Unified Neural Audio Codec

Ryo Aihara, Yoshiki Masuyama, Francesco Paissan, François G. Germain, Gordon Wichern, Jonathan Le Roux

Main category: eess.AS

TL;DR: Proposes a source-aware neural audio codec that encodes individual sources directly from mixtures using source type prompts, enabling selective encoding of specific sources without needing separate source separation preprocessing.

Details

Motivation: Current neural audio codecs encode mixtures of multiple sources in an entangled manner, which impedes efficient downstream processing when only specific sources are needed for applications like sound analysis or speaker transcription.

Method: Developed a source-aware codec that encodes individual sources directly from audio mixtures, conditioned on source type prompts, allowing user-driven selection of which sources to encode including multiple sources of the same type.

Result: The model achieves competitive resynthesis and separation quality compared to a cascade of source separation followed by conventional neural audio codec, with lower computational cost.

Conclusion: Source-aware neural audio codecs provide an efficient alternative to traditional approaches by enabling direct encoding of selected sources from mixtures, reducing computational overhead while maintaining quality.

Abstract: Neural audio codecs (NACs) provide compact representations that can be leveraged in many downstream applications, in particular large language models. Yet most NACs encode mixtures of multiple sources in an entangled manner, which may impede efficient downstream processing in applications that need access to only a subset of the sources (e.g., analysis of a particular type of sound, transcription of a given speaker, etc). To address this, we propose a source-aware codec that encodes individual sources directly from mixtures, conditioned on source type prompts. This enables user-driven selection of which source(s) to encode, including separately encoding multiple sources of the same type (e.g., multiple speech signals). Experiments show that our model achieves competitive resynthesis and separation quality relative to a cascade of source separation followed by a conventional NAC, with lower computational cost.

[434] Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

Wei-Cheng Tseng, David Harwath

Main category: eess.AS

TL;DR: Codec2Vec is the first speech representation learning framework using discrete audio codec units, achieving competitive performance with 16.5x storage reduction and 2.3x faster training.

Details

Motivation: To leverage neural audio codecs as universal acoustic feature extractors for broader speech processing tasks, offering improved efficiency, storage, and privacy benefits.

Method: Uses discrete audio codec units with masked prediction and various training target derivation strategies for speech representation learning.

Result: Achieves competitive performance on SUPERB benchmark with 16.5x storage reduction and 2.3x training speedup compared to continuous-input models.

Conclusion: Codec2Vec demonstrates scalable and efficient speech representation learning using discrete codec units, making it practical for real-world applications.

Abstract: Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.

[435] Recent Advances in Discrete Speech Tokens: A Review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

Main category: eess.AS

TL;DR: Survey of discrete speech tokens in LLM era, covering acoustic vs semantic token types, their advantages for speech representation, and future research directions.

Details

Motivation: The rise of LLMs has made discrete speech tokens fundamental for speech representation due to their efficiency, compactness, and compatibility with text-based LLM architectures.

Method: Systematic synthesis of existing taxonomy, critical examination of strengths/limitations of acoustic vs semantic tokens, and experimental comparisons across token types.

Result: Comprehensive survey categorizing discrete speech tokens into acoustic and semantic classes, analyzing their unique design philosophies and methodological approaches.

Conclusion: Identifies persistent challenges and proposes future research directions to advance discrete speech token development and applications in speech generation.

Abstract: The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

[436] UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

Main category: eess.AS

TL;DR: UniVoice is a unified LLM framework that integrates speech recognition and synthesis using continuous representations, achieving state-of-the-art performance in both ASR and zero-shot TTS tasks.

Details

Motivation: Current approaches treat speech recognition and text-to-speech as separate tasks, but a unified framework could better leverage LLM capabilities. Discrete speech tokenization causes information loss, limiting performance.

Method: Uses continuous representations with autoregressive modeling for ASR and flow matching for TTS. Implements dual attention mechanism switching between causal masks (recognition) and bidirectional masks (synthesis). Includes text-prefix-conditioned speech infilling for zero-shot voice cloning.

Result: Achieves or exceeds current single-task modeling methods in both ASR and zero-shot TTS tasks. Enables high-fidelity zero-shot voice cloning.

Conclusion: Demonstrates new possibilities for end-to-end speech understanding and generation through unified modeling of recognition and synthesis tasks.

Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

[437] FxSearcher: gradient-free text-driven audio transformation

Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

Main category: eess.AS

TL;DR: FxSearcher is a gradient-free framework that uses Bayesian Optimization and CLAP-based scoring to find optimal audio effect configurations for text-guided audio transformations, with AI-based evaluation showing alignment with human preferences.

Details

Motivation: Existing audio transformation methods are limited by their reliance on a small set of differentiable audio effects, making it challenging to achieve diverse and high-quality transformations from text prompts.

Method: Proposes FxSearcher framework using Bayesian Optimization and CLAP-based score function to search for optimal audio effect configurations, with a guiding prompt to prevent artifacts and enhance human preference.

Result: The method achieves highest scores on proposed AI-based evaluation metrics that closely align with human preferences.

Conclusion: FxSearcher effectively discovers optimal audio effect configurations for text-guided transformations, with evaluation showing strong correlation with human judgment.

Abstract: Achieving diverse and high-quality audio transformations from text prompts remains challenging, as existing methods are fundamentally constrained by their reliance on a limited set of differentiable audio effects. This paper proposes FxSearcher, a novel gradient-free framework that discovers the optimal configuration of audio effects (FX) to transform a source signal according to a text prompt. Our method employs Bayesian Optimization and CLAP-based score function to perform this search efficiently. Furthermore, a guiding prompt is introduced to prevent undesirable artifacts and enhance human preference. To objectively evaluate our method, we propose an AI-based evaluation framework. The results demonstrate that the highest scores achieved by our method on these metrics align closely with human preferences. Demos are available at https://hojoonki.github.io/FxSearcher/

eess.IV

[438] UniUltra: Interactive Parameter-Efficient SAM2 for Universal Ultrasound Segmentation

Yue Li, Qing Xu, Yixuan Zhang, Xiangjian He, Qian Zhang, Yuan Yao, Fiseha B. Tesem, Xin Chen, Ruili Wang, Zhen Chen, Chang Wen Chen

Main category: eess.IV

TL;DR: UniUltra adapts SAM2 for universal ultrasound segmentation using parameter-efficient fine-tuning and knowledge distillation, achieving competitive performance with 94.08% parameter reduction for clinical deployment.

Details

Motivation: SAM2 performs poorly on ultrasound images due to domain disparities, requiring efficient adaptation while maintaining parameter efficiency and enabling deployment in resource-constrained clinical environments.

Method: Proposes context-edge hybrid adapter (CH-Adapter) for fine-grained perception and deep-supervised knowledge distillation (DSKD) to transfer knowledge from large encoder to lightweight encoder.

Result: Outperforms state-of-the-art methods with superior generalization. Achieves competitive performance using only 8.91% of SAM2’s parameters during fine-tuning, and final compressed model reduces parameters by 94.08% compared to original SAM2.

Conclusion: UniUltra provides an effective solution for adapting SAM2 to ultrasound segmentation while achieving significant parameter reduction, making it highly suitable for practical clinical deployment.

Abstract: The Segment Anything Model 2 (SAM2) demonstrates remarkable universal segmentation capabilities on natural images. However, its performance on ultrasound images is significantly degraded due to domain disparities. This limitation raises two critical challenges: how to efficiently adapt SAM2 to ultrasound imaging while maintaining parameter efficiency, and how to deploy the adapted model effectively in resource-constrained clinical environments. To address these issues, we propose UniUltra for universal ultrasound segmentation. Specifically, we first introduce a novel context-edge hybrid adapter (CH-Adapter) that enhances fine-grained perception across diverse ultrasound imaging modalities while achieving parameter-efficient fine-tuning. To further improve clinical applicability, we develop a deep-supervised knowledge distillation (DSKD) technique that transfers knowledge from the large image encoder of the fine-tuned SAM2 to a super lightweight encoder, substantially reducing computational requirements without compromising performance. Extensive experiments demonstrate that UniUltra outperforms state-of-the-arts with superior generalization capabilities. Notably, our framework achieves competitive performance using only 8.91% of SAM2’s parameters during fine-tuning, and the final compressed model reduces the parameter count by 94.08% compared to the original SAM2, making it highly suitable for practical clinical deployment. The source code is available at https://github.com/xq141839/UniUltra.

[439] Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images

Erwan Dereure, Robin Louiset, Laura Parkkinen, David A Menassa, David Holcman

Main category: eess.IV

TL;DR: Developed automated pipeline for segmenting and classifying alpha-synuclein aggregates in Parkinson’s disease histopathology images using weakly supervised segmentation and ResNet50 classifier.

Details

Motivation: To better understand spatial organization of alpha-synuclein aggregates in Parkinson's disease through automated analysis of immunohistochemistry images, overcoming limitations of manual analysis.

Method: Used weakly supervised segmentation robust to immunohistochemical labeling variability with ResNet50 classifier on whole-slide images of midbrain tissue from PD and iLBD cases.

Result: Achieved 80% balanced accuracy in differentiating between major aggregate morphologies including Lewy bodies and neurites.

Conclusion: The framework enables large-scale characterization of spatial distribution and heterogeneity of alpha-synuclein aggregates, facilitating investigation of their relationships with surrounding cells like microglia and astrocytes.

Abstract: Parkinson’s disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.

[440] Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selection

Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis

Main category: eess.IV

TL;DR: DEFORMISE is a deep learning framework for dementia diagnosis using 3D brain MRI scans with optimized slice selection, achieving 94.12% accuracy on OASIS datasets and demonstrating robustness on ADNI dataset.

Details

Motivation: Dementia presents significant diagnostic challenges affecting millions worldwide, requiring more accurate and efficient diagnostic tools for clinical applications.

Method: Uses selective processing of MRI slices focusing on relevant brain regions, complemented by a confidence-based classification committee with three novel deep learning models and explainable AI techniques.

Result: Achieved 94.12% accuracy on Open OASIS datasets, surpassing existing methods, and demonstrated robustness on ADNI dataset with comprehensive validation.

Conclusion: Provides a significant advancement in dementia diagnosis with highly accurate and efficient tool for clinical applications, validated through explainable AI and ablation studies.

Abstract: Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.

[441] LEARNER: Contrastive Pretraining for Learning Fine-Grained Patient Progression from Coarse Inter-Patient Labels

Jana Armouti, Nikhil Madaan, Rohan Panda, Tom Fox, Laura Hutchins, Amita Krishnan, Ricardo Rodriguez, Bennett DeBoisblanc, Deva Ramanan, John Galeotti, Gautam Gare

Main category: eess.IV

TL;DR: LEARNER uses contrastive pretraining on inter-patient data to predict individual treatment response, improving accuracy over standard methods on lung ultrasound and brain MRI datasets.

Details

Motivation: Predicting treatment outcomes requires detecting subtle visual changes over time, but obtaining large-scale longitudinal data for each patient is impractical.

Method: Proposed LEARNER framework uses contrastive pretraining on coarsely labeled inter-patient data to learn patient-specific representations that capture intra-patient progression.

Result: Improves downstream classification accuracy and F1-score compared to standard MSE pretraining across both lung ultrasound and brain MRI datasets.

Conclusion: Inter-patient contrastive learning shows potential for individualized outcome prediction by using inter-patient variability as proxy for intra-patient progression.

Abstract: Predicting whether a treatment leads to meaningful improvement is a central challenge in personalized medicine, particularly when disease progression manifests as subtle visual changes over time. While data-driven deep learning (DL) offers a promising route to automate such predictions, acquiring large-scale longitudinal data for each individual patient remains impractical. To address this limitation, we explore whether inter-patient variability can serve as a proxy for learning intra-patient progression. We propose LEARNER, a contrastive pretraining framework that leverages coarsely labeled inter-patient data to learn fine-grained, patient-specific representations. Using lung ultrasound (LUS) and brain MRI datasets, we demonstrate that contrastive objectives trained on coarse inter-patient differences enable models to capture subtle intra-patient changes associated with treatment response. Across both modalities, our approach improves downstream classification accuracy and F1-score compared to standard MSE pretraining, highlighting the potential of inter-patient contrastive learning for individualized outcome prediction.

[442] Neuromorphic Split Computing via Optical Inter-Satellite Links

Zihang Song, Petar Popovski

Main category: eess.IV

TL;DR: A neuromorphic split-computing framework that partitions SNNs between edge and core nodes, using lossless channel-block-sparse event representation for efficient transmission over optical inter-satellite links, achieving 10x reduction in energy and transmission load.

Details

Motivation: To enable energy-efficient low-latency inference over optical inter-satellite links by addressing the challenges of transmission efficiency and reliability in space communications.

Method: Partitions SNNs between edge/core nodes, uses lossless channel-block-sparse event representation, hierarchical error protection with FEC and CRC, end-to-end training with sparsity/clustering regularizers, and channel-aware stochastic masking.

Result: Achieves 10x reduction in computational energy and transmission load compared to dense split systems with <1% accuracy loss, 3.7x better transmission efficiency than address-event-based SNNs, and superior resilience to optical pointing jitter.

Conclusion: The framework provides an effective solution for energy-efficient and reliable neuromorphic computing in space applications, significantly outperforming conventional approaches in transmission efficiency and robustness.

Abstract: We present a neuromorphic split-computing framework for energy-efficient low-latency inference over optical inter-satellite links. The system partitions a spiking neural network (SNN) between edge and core nodes. To transmit sparse spiking features efficiently, we introduce a lossless channel-block-sparse event representation that exploits inter- and intra-channel sparsity. We employ hierarchical error protection using multi-level forward error correction and cyclic redundancy checks to ensure reliable communication without retransmission. The framework uses end-to-end training with sparsity and clustering regularizers, combined with channel-aware stochastic masking to optimize feature compression and channel robustness jointly. In a proof-of-concept implementation on remote sensing imagery, the framework achieves over $10 \times$ reduction in both computational energy and transmission load compared to conventional dense split systems, with less than 1% accuracy loss. The proposed approach also outperforms address-event-based split SNNs by $3.7 \times$ in transmission efficiency and shows superior resilience to optical pointing jitter.

Today’s Research Highlights

Table of Contents

cs.CL

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

[4] Liars’ Bench: Evaluating Lie Detectors for Language Models

[5] Learning Tractable Distributions Of Language Model Continuations

[6] Early science acceleration experiments with GPT-5

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

[13] NLP Datasets for Idiom and Figurative Language Tasks

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

[15] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

[16] Classification of worldwide news articles by perceived quality, 2018-2024

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

[19] Arctic-Extract Technical Report

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

[26] GPTopic: Dynamic and Interactive Topic Representations

[27] LLMs as Models for Analogical Reasoning

[28] CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering

[29] Atomic Calibration of LLMs in Long-Form Generations

[30] Crowdsourcing Lexical Diversity

[31] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

[32] Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks

[33] CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

[34] One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

[35] AutoJudge: Judge Decoding Without Manual Annotation

[36] Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

[37] An Iterative Question-Guided Framework for Knowledge Base Question Answering

[38] AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

[39] Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models

[40] Eliciting Reasoning in Language Models with Cognitive Tools

[41] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

[42] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

[43] CRISP: Persistent Concept Unlearning via Sparse Autoencoders

[44] From Confidence to Collapse in LLM Factual Robustness

[45] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

[46] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

[47] Verbalized Algorithms

[48] Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

[49] Steering Evaluation-Aware Language Models to Act Like They Are Deployed

[50] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning

[51] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

[52] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

[53] MajinBook: An open catalogue of digital world literature with likes

[54] Auditing Google’s AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy

[55] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

[56] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

[57] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

[58] Multimodal Evaluation of Russian-language Architectures

cs.CV

[59] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

[60] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

[61] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

[62] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

[63] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

[64] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

[65] Adaptive Guided Upsampling for Low-light Image Enhancement

[66] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

[67] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

[68] Boosting Medical Visual Understanding From Multi-Granular Language Learning

[69] Automated Interpretable 2D Video Extraction from 3D Echocardiography

[70] Segmenting Collision Sound Sources in Egocentric Videos

[71] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

[72] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

[73] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

[74] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

[75] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

[76] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection